首页 > 解决方案 > Python文件解析,无法在新行中捕获字符串

问题描述

因此,解析一个包含 56,900 个书名的大型文本文件,其中包含作者和 etext 编号。试图找到作者。通过解析文件。该文件是这样的:

TITLE and AUTHOR                                                     ETEXT NO.

Aspects of plant life; with special reference to the British flora,      56900
 by Robert Lloyd Praeger

The Vicar of Morwenstow, by Sabine Baring-Gould                          56899
 [Subtitle: Being a Life of Robert Stephen Hawker, M.A.]

Raamatun tutkisteluja IV, mennessä Charles T. Russell                    56898
 [Subtitle: Harmagedonin taistelu]
 [Language: Finnish]

Raamatun tutkisteluja III, mennessä Charles T. Russell                   56897
 [Subtitle: Tulkoon valtakuntasi]
 [Language: Finnish]

Tom Thatcher's Fortune, by Horatio Alger, Jr.                            56896

A Yankee Flier in the Far East, by Al Avery                              56895
 and George Rutherford Montgomery
 [Illustrator: Paul Laune]

Nancy Brandon's Mystery, by Lillian Garis                                56894

Nervous Ills, by Boris Sidis                                             56893
 [Subtitle: Their Cause and Cure]

Pensées sans langage, par Francis Picabia                                56892
 [Language: French]

Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss     56891
 [Subtitle: A picture of Judaism, in the century
  which preceded the advent of our Savior]

Fra Tommaso Campanella, Vol. 1, di Luigi Amabile                         56890
 [Subtitle: la sua congiura, i suoi processi e la sua pazzia]
 [Language: Italian]

The Blue Star, by Fletcher Pratt                                         56889

Importanza e risultati degli incrociamenti in avicoltura,                56888
 di Teodoro Pascal
 [Language: Italian]

The Junior Classics, Volume 3: Tales from Greece and Rome, by Various    56887


~ ~ ~ ~ Posting Dates for the below eBooks:  1 Mar 2018 to 31 Mar 2018 ~ ~ ~ ~

TITLE and AUTHOR                                                     ETEXT NO.

The American Missionary, Volume 41, No. 1, January, 1887, by Various     56886

Morganin miljoonat, mennessä Sven Elvestad                               56885
 [Author a.k.a. Stein Riverton]
 [Subtitle: Salapoliisiromaani]
 [Language: Finnish]

"Trip to the Sunny South" in March, 1885, by L. S. D                     56884

Balaam and His Master, by Joel Chandler Harris                           56883
 [Subtitle: and Other Sketches and Stories]

Susien saaliina, mennessä Jack London                                    56882
 [Language: Finnish]

Forged Egyptian Antiquities, by T. G. Wakeling                           56881

The Secret Doctrine, Vol. 3 of 4, by Helena Petrovna Blavatsky           56880
 [Subtitle: Third Edition]

No Posting                                                               56879

作者姓名通常以“by”开头,或者当行中没有“by”时,作者姓名以逗号“,”开头。但是,如果该行有 by,则“,”可以是标题的一部分。
所以,我先解析它,然后解析逗号。

这是我尝试过的:

def search_by_author():

    fhand = open('GUTINDEX.ALL')
    print("Search by Author:")

    for line in fhand:
        if not line.startswith(" [") and not line.startswith("TITLE"):
            if not line.startswith("~"):
                words = line.rstrip()
                words = line.lstrip()
                words = words[:-6] 
                if ", by" in words:

                    words = words[words.find(', by'):]
                    words = words[5:]
                    print (words)

                else:
                    words = words[words.find(', '):]
                    words = words[5:]
                    if "," in words:
                        words = words[words.find(', '):]
                        if words.startswith(','):
                            words =words[words.find(','):]
                            print (words)
                        else:
                            print (words)
                    else:
                        print (words)
                if " by" in words:
                    words = words[words.find('by')]
                    print(words)

search_by_author()

但是,它似乎找不到像这样的行的作者姓名

Aspects of plant life; with special reference to the British flora,      56900
 by Robert Lloyd Praeger

标签: pythonpython-3.x

解决方案


根据您的文件,关于一本书的信息可以分布在多行中。每本书信息后面都有一个空行。我用它来收集有关一本书的所有信息,然后对其进行解析以获取作者信息。

import re

def search_by_author():

    fhand = open('GUTINDEX.ALL')
    book_info = ''

    for line in fhand:
        line = line.rstrip()

        if (line.startswith('TITLE') or line.startswith('~')):
            continue

        if (len(line) == 0):
            # remove info in square bracket from book_info
            book_info = re.sub(r'\[.*$', '', book_info)

            if ('by ' in book_info):
                tokens = book_info.split('by ')
            else:
                tokens = book_info.split(',')

            if (len(tokens) > 1):
                authors = tokens[-1].strip()
                print(authors)

            book_info = ''

        else:
            # remove ETEXT NO. from line
            line = re.sub(r'\d+$', '', line)
            book_info +=  ' ' + line.rstrip()


search_by_author()

输出:

Robert Lloyd Praeger
Sabine Baring-Gould
mennessä Charles T. Russell
mennessä Charles T. Russell
Horatio Alger, Jr.
Al Avery  and George Rutherford Montgomery
Lillian Garis
Boris Sidis
par Francis Picabia
Frederick Strauss
di Luigi Amabile
Fletcher Pratt
di Teodoro Pascal
Various
Various
mennessä Sven Elvestad
L. S. D
Joel Chandler Harris
mennessä Jack London
T. G. Wakeling
Helena Petrovna Blavatsky

推荐阅读