python - 如何在python中根据关键字提取txt文件的一部分?
问题描述
给定一个包含约 5000 个 HTML 文档的非常大的文本文件。我正在尝试“搜索”特定的文本文件DOCNO
并打印文件的所有行,直到遇到下一个</DOC>
标签。
文本文件大致如下所示:
<DOC>
<DOCNO>abc4567890</DOCNO>
contents
more contents
<BODY>
even more contents
</BODY>
</DOC>
... repeated roughly 5000 times for different DOC NO's
我正在寻找以下输出:
contents
more contents
<BODY>
even more contents
</BODY>
</DOC>
这是我一直在尝试实现的:
doc_string = "abc4567890"
with open('myfile.txt', encoding = "utf8") as f:
for item in f.readlines():
if "</DOCNO>" in item:
ID = (item [ item.find("<DOCNO>")+len("<DOCNO>") : ])
if (ID[0:9] == doc_string):
print (item)
if "</DOC>" in item:
break
但是,作为输出,我得到:
<DOCNO>abc4567890</DOCNO>
解决方案
这样的事情怎么样?
# initialize variables:
lines = []
read_lines = False
with open('file.txt', 'r') as file:
# iterate over each line:
for line in file.readlines():
# append line to lines list:
if read_lines: lines.append(line)
# set read_lines to True:
if '<DOCNO>abc4567890</DOCNO>' in line: read_lines = True
# set read_lines to Flase:
if '</DOC>' in line: read_lines = False
# print each line:
for line in lines:
print(line, end='')
给定您的输入,它将输出:
contents
more contents
<BODY>
even more contents
</BODY>
</DOC>