首页 > 解决方案 > 使用 nltk 提取和标记单词 - 错误的输出

问题描述

我有一个文本文件(从 pdf 转换),我想从中提取名称 - 首先,虽然我想标记所有单词并让 nltk 标记它们(即 PPN 用于专有名词)。我的代码适用于一个文本文件,但不适用于另一个。

有效的文件如下所示:

1
2
GM HEALTH AND SOCIAL CARE STRATEGIC PARTNERSHIP BOARD
MINUTES OF THE MEETING HELD ON 28 APRIL 2017
Bridgewater Community Healthcare NHS
Dorothy Whitaker
Trust
Bolton Council
Councillor Cliff Morris
Margaret Asquith

不起作用的文件是这样的(这个文件的格式比实际的 pdf 看起来更好):

GREATER MANCHESTER COMBINED AUTHORITY (GMCA) 
ECONOMY, BUSINESS GROWTH AND SKILLS SCRUTINY COMMITTEE 
FRIDAY  13  APRIL  2018  AT  2.00PM,  BOARDROOM,  GMCA, 
CHURCHGATE HOUSE  

Present:  Councillor:  Michael Holly (in the Chair) 

   Councillors:  Susan Haworth (Bolton) 
Roy Walker (Bury) 
Ahmed Ali (Manchester) 
Grace Fletcher-Hackwood (Manchester) 
Kate Lewis (Salford) 
Mark Hunter (Stockport) 
Elise Wilson (Stockport) 

这是我的代码:

from nltk import word_tokenize, pos_tag, ne_chunk
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
nltk.download('words')

with open('mergedminutes.txt', 'r') as file: 
    data = file.read()
    data2 = data.split()
    tokens = nltk.word_tokenize(data)
    text = nltk.Text(tokens)

def categorize_words():
    print(pos_tag((tokens)))
output = categorize_words()
file = open("wordsfromminutes.txt", "w")
file.write(str(output))
file.close()

我认为它必须与文件有关。这是我使用第二个文件得到的输出:

('ÿþI\x00t\x00e\x00m\x00', 'JJ'), ('\x009\x00', 'NNP'), ('\x00', 'NNP'), ('\x00', 'NNP' '), ('\x00', 'NNP'), ('\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_ \x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_ \x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_ \x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00', 'NNP'), ( '\x00', 'NNP'), ('\x00', 'NNP'), ('\x00', 'NN

有谁知道这里会发生什么?谢谢。

标签: pdfweb-scrapingnltk

解决方案


第二个文件的输出可能是由于第二个文件中的编码错误。从pdf到txt的转换可能是问题的原因。

尝试将第二个文件的内容复制并粘贴到一个新文件中,并将其保存为 .txt 扩展名。您可以使用 notepad++、gedit、atom 或 vim 等编辑器来执行此操作。然后将该文件用作程序的输入。通过将您提供的示例复制到 .txt 文件,我能够从您的程序中获得以下输出。

[('GREATER', 'NNP'), ('MANCHESTER', 'NNP'), ('COMBINED', 'NNP'), ('AUTHORITY', 'NNP'), ('(', '(') , ('GMCA', 'NNP'), (')', ')'), ('ECONOMY', 'NNP'), (',', ','), ('BUSINESS', 'NNP') , ('GROWTH', 'NNP'), ('AND', 'NNP'), ('SKILLS', 'NNP'), ('SCRUTINY', 'NNP'), ('COMMITTEE', 'NNP') , ('FRIDAY', 'NNP'), ('13', 'CD'), ('APRIL', 'NNP'), ('2018', 'CD'), ('AT', 'NNP') , ('2.00PM', 'CD'), (',', ','), ('BOARRDROOM', 'NNP'), (',', ','), ('GMCA', 'NNP' ), (',', ','), ('ChurchGATE', 'NNP'), ('HOUSE', 'NNP'), ('Present', 'NNP'), (':', ':'), ('Councilor', ' NN'), (':', ':'), ('Michael', 'NNP'), ('Holly', 'NNP'), ('(', '('), ('in', ' IN'), ('the', 'DT'), ('Chair', 'NNP'), (')', ')'), ('Councilors', 'NNS'), (':', ' :'), ('Susan', 'NNP'), ('Haworth', 'NNP'), ('(', '('), ('Bolton', 'NNP'), (')', ' )'), ('Roy', 'NNP'), ('Walker', 'NNP'), ('(', '('), ('Bury', 'NNP'), (')', ' )'), ('艾哈迈德', 'NNP'), ('Ali', 'NNP'), ('(', '('), ('Manchester', 'NNP'), (')', ')'), ('Grace', 'NNP'), (' Fletcher-Hackwood', 'NNP'), ('(', '('), ('Manchester', 'NNP'), (')', ')'), ('Kate', 'NNP'), ('Lewis', 'NNP'), ('(', '('), ('Salford', 'NNP'), (')', ')'), ('Mark', 'NNP'), ('Hunter', 'NNP'), ('(', '('), ('Stockport', 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('Stockport', 'NNP'), (')', ')')]NNP'), (')', ')'), ('Grace', 'NNP'), ('Fletcher-Hackwood', 'NNP'), ('(', '('), ('Manchester' , 'NNP'), (')', ')'), ('Kate', 'NNP'), ('Lewis', 'NNP'), ('(', '('), ('Salford' , 'NNP'), (')', ')'), ('Mark', 'NNP'), ('Hunter', 'NNP'), ('(', '('), ('Stockport' , 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('Stockport' , 'NNP'), (')', ')')]NNP'), (')', ')'), ('Grace', 'NNP'), ('Fletcher-Hackwood', 'NNP'), ('(', '('), ('Manchester' , 'NNP'), (')', ')'), ('Kate', 'NNP'), ('Lewis', 'NNP'), ('(', '('), ('Salford' , 'NNP'), (')', ')'), ('Mark', 'NNP'), ('Hunter', 'NNP'), ('(', '('), ('Stockport' , 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('Stockport' , 'NNP'), (')', ')')]Manchester', 'NNP'), (')', ')'), ('Kate', 'NNP'), ('Lewis', 'NNP'), ('(', '('), (' Salford', 'NNP'), (')', ')'), ('Mark', 'NNP'), ('Hunter', 'NNP'), ('(', '('), (' Stockport', 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('斯托克波特', 'NNP'), (')', ')')]Manchester', 'NNP'), (')', ')'), ('Kate', 'NNP'), ('Lewis', 'NNP'), ('(', '('), (' Salford', 'NNP'), (')', ')'), ('Mark', 'NNP'), ('Hunter', 'NNP'), ('(', '('), (' Stockport', 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('斯托克波特', 'NNP'), (')', ')')]Stockport', 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('斯托克波特', 'NNP'), (')', ')')]Stockport', 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('斯托克波特', 'NNP'), (')', ')')]

我使用的文件和程序可在https://github.com/michaelhochleitner/stackoverflow.com-questions-57148173获得。

将输出写入您的程序版本中的文件对我不起作用。我使用以下命令将程序的打印输出重定向到文件。

python extract_names.py > wordsfromdoesntwork.txt

我正在使用 Python 2.7.15+ 和 nltk 3.4.4。


推荐阅读