python - 如何逐行解析记录并仅打印选定的字段
问题描述
我有一个巨大的文件,其结构类似于:
PMID- 1
OWN - NLM
STAT- PubMed-not-MEDLINE
LR - 20191218
TI - Synthesis and Characterization of a Fluorescence Probe of the Phase Transition
and Dynamic Properties of Membranes.
PG - 5714-5722
LID - 10.1021/bi00294a006 [doi]
AB - We describe the synthesis and characterization of a new fluorescence probe whose
emission spectra, anisotropies, and wavelength-dependent decay times are highly
sensitive to the phase state of phospholipid vesicles.
PST - ppublish
SO - Biochemistry.
PMID- 2
STAT- Publisher
PB - National Academies Press (US)
TI - The National Academies Collection: Reports funded by National Institutes of
Health
AID - NBK547296 [bookaccession]
AID - 10.17226/19375 [doi]
PMID- 3
STAT- Publisher
DA - 20140815
ISBN- 030903339X
PB - National Academies Press (US)
DP - 1983
BTI - Community Oriented Primary Care: New Directions for Health Services Delivery
CN - Institute of Medicine (US) Division of Health Care Services
任务是仅打印带有PMID
、TI
和字段的行,并AB
具有以下约束:
- 仅当记录中存在 TI 字段时才打印 PMID 字段
- 记录应该用一个空行分隔
仅当记录中存在字段时,原始数据AB
中的字段才TI
存在。结果应该是:
PMID- 1
TI - Synthesis and Characterization of a Fluorescence Probe of the Phase Transition
and Dynamic Properties of Membranes.
AB - We describe the synthesis and characterization of a new fluorescence probe whose
emission spectra, anisotropies, and wavelength-dependent decay times are highly
sensitive to the phase state of phospholipid vesicles.
PMID- 2
TI - The National Academies Collection: Reports funded by National Institutes of
Health
目前的解决方案是:
import re
with open("input.txt", "rt") as in_file:
prog = re.compile("^(....)- (.*)$")
for line in in_file:
line = line.rstrip()
match = prog.match(line)
if match:
tag = match.groups()[0]
field = match.groups()[1]
# If "PMID" tag, print it but only if "TI" field is present in the record
if tag == "PMID":
pmid = line
# If "TI" line, print it
if tag == "TI ":
print(line)
ti_line = True
# If line is a "continuation line" and we are in TI field
elif line.startswith(" ") and ti_line:
print(line)
else:
ti_line = False
# If "AB" line, print it
if tag == "AB ":
print(line)
ab_line = True
# If line is a "continuation line" and we are in AB field
elif line.startswith(" ") and ab_line:
print(line)
else:
ab_line = False
输出:
TI - Synthesis and Characterization of a Fluorescence Probe of the Phase Transition
and Dynamic Properties of Membranes.
AB - We describe the synthesis and characterization of a new fluorescence probe whose
emission spectra, anisotropies, and wavelength-dependent decay times are highly
sensitive to the phase state of phospholipid vesicles.
TI - The National Academies Collection: Reports funded by National Institutes of
Health
问题:
PMID
在输出中包含 s的最佳方法是什么?- 如何用空行分隔记录?
解决方案
如果您不需要输出与输入的格式相同,我会这样做:
import json
def convert_to_json(input, output) -> None:
current_object = {}
current_tag = None
for line in input:
if not line.strip():
# empty line
# we need to save object into output file
output.write(json.dumps(current_object) + "\n")
current_object = {}
current_tag = None
continue
if line.startswith(" "):
# it is continuation of previous tag
assert current_tag
current_object[current_tag] = current_object[current_tag] + " " + line.strip()
else:
# we found a new tag
tag, value = line.strip().split('-', 1)
tag = tag.strip()
value = value.strip()
current_tag = tag
current_object[tag] = value
# save the last object manually
# because file may not contain empty line at the end
if current_object:
output.write(json.dumps(current_object) + "\n")
# STAGE 1: transform input file into more machine-friendly format
# for example, JSON rows
with open("input.txt", "rt") as input:
with open('output.json', 'w') as output:
convert_to_json(input, output)
# STAGE 2: print any information you need
# format it as you need
with open('output.json') as new_input:
for line in new_input:
obj = json.loads(line.strip())
if "PMID" in obj and "TI" in obj:
print(obj["PMID"])
if "TI" in obj:
print(obj["TI"])
if "AB" in obj:
print(obj["AB"])
print()
据我了解,保存源数据格式并非易事(但当然可能),因为输入流中的标签可能按任何顺序排列(例如,PMID 可能在 TI 之前,但它们也可能以相反的顺序出现)。
推荐阅读
- c++ - std::byte 不是“std”的成员
- python - 如何退货?
- javascript - 使用数组和基本 js 知识创建 SPA 井字游戏
- typescript - 强制字符串类型成为 TypeScript 中字符串数组的一部分
- php - HTML/PHP 按月用颜色分组表
- java - Kubernetes 上的 Apache Ignite 与 TcpDiscoverySharedFsIpFinder:集群似乎解体了
- .net-core - 带有 .Net Core 的 AWS Secret Manager 引发套接字异常
- android-studio - 将 build.gradle 从 2.3.3 更新到 3.4.1,但我仍然收到重新“编译”被“实现”替换的警告?
- python - 有没有办法优雅地在圆圈内绘制箭头
- ios - KeyboardLayoutGuide 在 iMessage 应用程序中不起作用