python - Python文件解析,无法在新行中捕获字符串
问题描述
因此,解析一个包含 56,900 个书名的大型文本文件,其中包含作者和 etext 编号。试图找到作者。通过解析文件。该文件是这样的:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora, 56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
The Junior Classics, Volume 3: Tales from Greece and Rome, by Various 56887
~ ~ ~ ~ Posting Dates for the below eBooks: 1 Mar 2018 to 31 Mar 2018 ~ ~ ~ ~
TITLE and AUTHOR ETEXT NO.
The American Missionary, Volume 41, No. 1, January, 1887, by Various 56886
Morganin miljoonat, mennessä Sven Elvestad 56885
[Author a.k.a. Stein Riverton]
[Subtitle: Salapoliisiromaani]
[Language: Finnish]
"Trip to the Sunny South" in March, 1885, by L. S. D 56884
Balaam and His Master, by Joel Chandler Harris 56883
[Subtitle: and Other Sketches and Stories]
Susien saaliina, mennessä Jack London 56882
[Language: Finnish]
Forged Egyptian Antiquities, by T. G. Wakeling 56881
The Secret Doctrine, Vol. 3 of 4, by Helena Petrovna Blavatsky 56880
[Subtitle: Third Edition]
No Posting 56879
作者姓名通常以“by”开头,或者当行中没有“by”时,作者姓名以逗号“,”开头。但是,如果该行有 by,则“,”可以是标题的一部分。
所以,我先解析它,然后解析逗号。
这是我尝试过的:
def search_by_author():
fhand = open('GUTINDEX.ALL')
print("Search by Author:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("TITLE"):
if not line.startswith("~"):
words = line.rstrip()
words = line.lstrip()
words = words[:-6]
if ", by" in words:
words = words[words.find(', by'):]
words = words[5:]
print (words)
else:
words = words[words.find(', '):]
words = words[5:]
if "," in words:
words = words[words.find(', '):]
if words.startswith(','):
words =words[words.find(','):]
print (words)
else:
print (words)
else:
print (words)
if " by" in words:
words = words[words.find('by')]
print(words)
search_by_author()
但是,它似乎找不到像这样的行的作者姓名
Aspects of plant life; with special reference to the British flora, 56900
by Robert Lloyd Praeger
解决方案
根据您的文件,关于一本书的信息可以分布在多行中。每本书信息后面都有一个空行。我用它来收集有关一本书的所有信息,然后对其进行解析以获取作者信息。
import re
def search_by_author():
fhand = open('GUTINDEX.ALL')
book_info = ''
for line in fhand:
line = line.rstrip()
if (line.startswith('TITLE') or line.startswith('~')):
continue
if (len(line) == 0):
# remove info in square bracket from book_info
book_info = re.sub(r'\[.*$', '', book_info)
if ('by ' in book_info):
tokens = book_info.split('by ')
else:
tokens = book_info.split(',')
if (len(tokens) > 1):
authors = tokens[-1].strip()
print(authors)
book_info = ''
else:
# remove ETEXT NO. from line
line = re.sub(r'\d+$', '', line)
book_info += ' ' + line.rstrip()
search_by_author()
输出:
Robert Lloyd Praeger
Sabine Baring-Gould
mennessä Charles T. Russell
mennessä Charles T. Russell
Horatio Alger, Jr.
Al Avery and George Rutherford Montgomery
Lillian Garis
Boris Sidis
par Francis Picabia
Frederick Strauss
di Luigi Amabile
Fletcher Pratt
di Teodoro Pascal
Various
Various
mennessä Sven Elvestad
L. S. D
Joel Chandler Harris
mennessä Jack London
T. G. Wakeling
Helena Petrovna Blavatsky
推荐阅读
- gem5 - 用 gem5 测量加速
- php - vscode-phpcs 扩展在编辑器中不起作用,但在版本控制中,我可以修复它吗?
- c++ - C++/opencv/Ubuntu:GoPro 非常低的 fps
- go - 跨多个结构使用相同的方法
- python - 嵌套字典的 CSV 转换和重新排列几个方面
- javascript - Fetch 方法不发送带有 post 的内容
- amazon - 为 Screen Sample App 添加新的 Alexa 技能
- java - 未包含错误但仍无法执行代码
- docker - Kubernetes:同一个 pod 中的两个容器无法通过 localhost 连接
- odoo - 如何将域添加到 Odoo 中的 Many2one 字段?