首页 > 解决方案 > 如何逐行解析记录并仅打印选定的字段

问题描述

我有一个巨大的文件,其结构类似于:

PMID- 1
OWN - NLM
STAT- PubMed-not-MEDLINE
LR  - 20191218
TI  - Synthesis and Characterization of a Fluorescence Probe of the Phase Transition
      and Dynamic Properties of Membranes.
PG  - 5714-5722
LID - 10.1021/bi00294a006 [doi]
AB  - We describe the synthesis and characterization of a new fluorescence probe whose 
      emission spectra, anisotropies, and wavelength-dependent decay times are highly
      sensitive to the phase state of phospholipid vesicles. 
PST - ppublish
SO  - Biochemistry.

PMID- 2
STAT- Publisher
PB  - National Academies Press (US)
TI  - The National Academies Collection: Reports funded by National Institutes of
      Health
AID - NBK547296 [bookaccession]
AID - 10.17226/19375 [doi]

PMID- 3
STAT- Publisher
DA  - 20140815
ISBN- 030903339X
PB  - National Academies Press (US)
DP  - 1983
BTI - Community Oriented Primary Care: New Directions for Health Services Delivery
CN  - Institute of Medicine (US) Division of Health Care Services

任务是仅打印带有PMIDTI和字段的行,并AB具有以下约束:

仅当记录中存在字段时,原始数据AB中的字段才TI存在。结果应该是:

PMID- 1
TI  - Synthesis and Characterization of a Fluorescence Probe of the Phase Transition
      and Dynamic Properties of Membranes.
AB  - We describe the synthesis and characterization of a new fluorescence probe whose 
      emission spectra, anisotropies, and wavelength-dependent decay times are highly
      sensitive to the phase state of phospholipid vesicles.

PMID- 2
TI  - The National Academies Collection: Reports funded by National Institutes of
      Health

目前的解决方案是:

import re

with open("input.txt", "rt") as in_file:
    prog = re.compile("^(....)- (.*)$")
    for line in in_file:
        line = line.rstrip()
        match = prog.match(line)
        if match:
            tag = match.groups()[0]
            field = match.groups()[1]
        # If "PMID" tag, print it but only if "TI" field is present in the record
        if tag == "PMID":
            pmid = line
        # If "TI" line, print it
        if tag == "TI  ":
            print(line)
            ti_line = True
        # If line is a "continuation line" and we are in TI field
        elif line.startswith("      ") and ti_line:
            print(line)
        else:
            ti_line = False
        # If "AB" line, print it
        if tag == "AB  ":
            print(line)
            ab_line = True      
        # If line is a "continuation line" and we are in AB field
        elif line.startswith("      ") and ab_line:
            print(line)
        else:
            ab_line = False

输出:

TI  - Synthesis and Characterization of a Fluorescence Probe of the Phase Transition
      and Dynamic Properties of Membranes.
AB  - We describe the synthesis and characterization of a new fluorescence probe whose
      emission spectra, anisotropies, and wavelength-dependent decay times are highly
      sensitive to the phase state of phospholipid vesicles.
TI  - The National Academies Collection: Reports funded by National Institutes of
      Health

问题:

标签: pythonparsing

解决方案


如果您不需要输出与输入的格式相同,我会这样做:

import json


def convert_to_json(input, output) -> None:
    current_object = {}
    current_tag = None

    for line in input:
        if not line.strip():
            # empty line
            # we need to save object into output file
            output.write(json.dumps(current_object) + "\n")
            current_object = {}
            current_tag = None
            continue

        if line.startswith("      "):
            # it is continuation of previous tag
            assert current_tag
            current_object[current_tag] = current_object[current_tag] + " " + line.strip()
        else:
            # we found a new tag
            tag, value = line.strip().split('-', 1)
            tag = tag.strip()
            value = value.strip()

            current_tag = tag
            current_object[tag] = value

    # save the last object manually
    # because file may not contain empty line at the end
    if current_object:
        output.write(json.dumps(current_object) + "\n")


# STAGE 1: transform input file into more machine-friendly format
# for example, JSON rows
with open("input.txt", "rt") as input:
    with open('output.json', 'w') as output:
        convert_to_json(input, output)


# STAGE 2: print any information you need
# format it as you need
with open('output.json') as new_input:
    for line in new_input:
        obj = json.loads(line.strip())
        if "PMID" in obj and "TI" in obj:
            print(obj["PMID"])
        if "TI" in obj:
            print(obj["TI"])
        if "AB" in obj:
            print(obj["AB"])
        print()

据我了解,保存源数据格式并非易事(但当然可能),因为输入流中的标签可能按任何顺序排列(例如,PMID 可能在 TI 之前,但它们也可能以相反的顺序出现)。


推荐阅读