首页 > 解决方案 > Python文本提取

问题描述

我正在使用 python 进行文本提取。输出不像我想要的那样理想!

我有一个包含如下信息的文本文件:

FN Clarivate Analytics Web of Science
VR 1.0

PT J

AU Chen, G

   Gully, SM

   Whiteman, JA

   Kilcullen, RN

AF Chen, G

   Gully, SM

   Whiteman, JA

   Kilcullen, RN

TI Examination of relationships among trait-like individual differences,

   state-like individual differences, and learning performance

SO JOURNAL OF APPLIED PSYCHOLOGY

CT 13th Annual Conference of the

   Society-for-Industrial-and-Organizational-Psychology

CY APR 24-26, 1998

CL DALLAS, TEXAS

SP Soc Ind & Org Psychol

RI Gully, Stanley/D-1302-2012

OI Gully, Stanley/0000-0003-4037-3883

SN 0021-9010

PD DEC

PY 2000

VL 85

IS 6

BP 835

EP 847

DI 10.1037//0021-9010.85.6.835

UT WOS:000165745400001

PM 11125649

ER

当我像这样使用我的代码时

import random
import sys

filepath = "data\jap_2000-2001-plain.txt"

with open(filepath) as f:
    articles = f.read().strip().split("\n")

articles_list = []

author = ""
title = ""
year = ""
doi = ""

for article in articles:
    if "AU" in article:
        author = article.split("#")[-1]
    if "TI" in article:
        title = article.split("#")[-1]
    if "PY" in article:
        year = article.split("#")[-1]
    if "DI" in article:
        doi = article.split("#")[-1]
    if article == "ER#":
        articles_list.append("{}, {}, {}, https://doi.org/{}".format(author, title, year, doi))
print("Oh hello sir, how many articles do you like to get?")
amount = input()

random_articles = random.sample(articles_list, k = int(amount))


for i in random_articles:
    print(i)
    print("\n")

exit = input('Please enter exit to exit: \n')
if exit in ['exit','Exit']:
    print("Goodbye sir!")
    sys.exit()

提取不包括换行后输入的数据,如果我运行此代码,输出将类似于“AU Chen, G”并且不包括其他名称,与 Title 等相同。

我的输出看起来像:

Chen, G. 性状关系的检验,2000,doi.dx.10.1037//0021-9010.85.6.835

所需的输出应该是:

Chen, G., Gully, SM., Whiteman, JA., Kilcullen, RN., 2000,性状个体差异、状态状个体差异和学习表现之间的关系检查,doi.dx.10.1037//0021 -9010.85.6.835

但提取只包括每一行的第一行——</p>

有什么建议么?

标签: pythonstringextraction

解决方案


在解析文件时,您需要跟踪您所在的部分。编写状态机有更简洁的方法,但作为一个快速简单的示例,您可以执行以下操作。

基本上,将每个部分的所有行添加到该部分的列表中,然后合并列表并在最后执行任何操作。请注意,我没有对此进行测试,只是伪编码向您展示了总体思路。

authors = []
title = []
section = None

for line in articles:
    line = line.strip()

    # Check for start of new section, select the right list to add to
    if line.startswith("AU"):
        line = line[3:]
        section = authors
    elif line.startswith("TI"):
        line = line[3:]
        section = title
    # Other sections..
    ...

    # Add line to the current section
    if line and section is not None:
        section.append(line)

authors_str = ', '.join(authors)
title_str = ' '.join(title)
print authors_str, title_str

推荐阅读