pdf - 使用 nltk 提取和标记单词 - 错误的输出
问题描述
我有一个文本文件(从 pdf 转换),我想从中提取名称 - 首先,虽然我想标记所有单词并让 nltk 标记它们(即 PPN 用于专有名词)。我的代码适用于一个文本文件,但不适用于另一个。
有效的文件如下所示:
1
2
GM HEALTH AND SOCIAL CARE STRATEGIC PARTNERSHIP BOARD
MINUTES OF THE MEETING HELD ON 28 APRIL 2017
Bridgewater Community Healthcare NHS
Dorothy Whitaker
Trust
Bolton Council
Councillor Cliff Morris
Margaret Asquith
不起作用的文件是这样的(这个文件的格式比实际的 pdf 看起来更好):
GREATER MANCHESTER COMBINED AUTHORITY (GMCA)
ECONOMY, BUSINESS GROWTH AND SKILLS SCRUTINY COMMITTEE
FRIDAY 13 APRIL 2018 AT 2.00PM, BOARDROOM, GMCA,
CHURCHGATE HOUSE
Present: Councillor: Michael Holly (in the Chair)
Councillors: Susan Haworth (Bolton)
Roy Walker (Bury)
Ahmed Ali (Manchester)
Grace Fletcher-Hackwood (Manchester)
Kate Lewis (Salford)
Mark Hunter (Stockport)
Elise Wilson (Stockport)
这是我的代码:
from nltk import word_tokenize, pos_tag, ne_chunk
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
nltk.download('words')
with open('mergedminutes.txt', 'r') as file:
data = file.read()
data2 = data.split()
tokens = nltk.word_tokenize(data)
text = nltk.Text(tokens)
def categorize_words():
print(pos_tag((tokens)))
output = categorize_words()
file = open("wordsfromminutes.txt", "w")
file.write(str(output))
file.close()
我认为它必须与文件有关。这是我使用第二个文件得到的输出:
('ÿþI\x00t\x00e\x00m\x00', 'JJ'), ('\x009\x00', 'NNP'), ('\x00', 'NNP'), ('\x00', 'NNP' '), ('\x00', 'NNP'), ('\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_ \x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_ \x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_ \x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00_\x00', 'NNP'), ( '\x00', 'NNP'), ('\x00', 'NNP'), ('\x00', 'NN
有谁知道这里会发生什么?谢谢。
解决方案
第二个文件的输出可能是由于第二个文件中的编码错误。从pdf到txt的转换可能是问题的原因。
尝试将第二个文件的内容复制并粘贴到一个新文件中,并将其保存为 .txt 扩展名。您可以使用 notepad++、gedit、atom 或 vim 等编辑器来执行此操作。然后将该文件用作程序的输入。通过将您提供的示例复制到 .txt 文件,我能够从您的程序中获得以下输出。
[('GREATER', 'NNP'), ('MANCHESTER', 'NNP'), ('COMBINED', 'NNP'), ('AUTHORITY', 'NNP'), ('(', '(') , ('GMCA', 'NNP'), (')', ')'), ('ECONOMY', 'NNP'), (',', ','), ('BUSINESS', 'NNP') , ('GROWTH', 'NNP'), ('AND', 'NNP'), ('SKILLS', 'NNP'), ('SCRUTINY', 'NNP'), ('COMMITTEE', 'NNP') , ('FRIDAY', 'NNP'), ('13', 'CD'), ('APRIL', 'NNP'), ('2018', 'CD'), ('AT', 'NNP') , ('2.00PM', 'CD'), (',', ','), ('BOARRDROOM', 'NNP'), (',', ','), ('GMCA', 'NNP' ), (',', ','), ('ChurchGATE', 'NNP'), ('HOUSE', 'NNP'), ('Present', 'NNP'), (':', ':'), ('Councilor', ' NN'), (':', ':'), ('Michael', 'NNP'), ('Holly', 'NNP'), ('(', '('), ('in', ' IN'), ('the', 'DT'), ('Chair', 'NNP'), (')', ')'), ('Councilors', 'NNS'), (':', ' :'), ('Susan', 'NNP'), ('Haworth', 'NNP'), ('(', '('), ('Bolton', 'NNP'), (')', ' )'), ('Roy', 'NNP'), ('Walker', 'NNP'), ('(', '('), ('Bury', 'NNP'), (')', ' )'), ('艾哈迈德', 'NNP'), ('Ali', 'NNP'), ('(', '('), ('Manchester', 'NNP'), (')', ')'), ('Grace', 'NNP'), (' Fletcher-Hackwood', 'NNP'), ('(', '('), ('Manchester', 'NNP'), (')', ')'), ('Kate', 'NNP'), ('Lewis', 'NNP'), ('(', '('), ('Salford', 'NNP'), (')', ')'), ('Mark', 'NNP'), ('Hunter', 'NNP'), ('(', '('), ('Stockport', 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('Stockport', 'NNP'), (')', ')')]NNP'), (')', ')'), ('Grace', 'NNP'), ('Fletcher-Hackwood', 'NNP'), ('(', '('), ('Manchester' , 'NNP'), (')', ')'), ('Kate', 'NNP'), ('Lewis', 'NNP'), ('(', '('), ('Salford' , 'NNP'), (')', ')'), ('Mark', 'NNP'), ('Hunter', 'NNP'), ('(', '('), ('Stockport' , 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('Stockport' , 'NNP'), (')', ')')]NNP'), (')', ')'), ('Grace', 'NNP'), ('Fletcher-Hackwood', 'NNP'), ('(', '('), ('Manchester' , 'NNP'), (')', ')'), ('Kate', 'NNP'), ('Lewis', 'NNP'), ('(', '('), ('Salford' , 'NNP'), (')', ')'), ('Mark', 'NNP'), ('Hunter', 'NNP'), ('(', '('), ('Stockport' , 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('Stockport' , 'NNP'), (')', ')')]Manchester', 'NNP'), (')', ')'), ('Kate', 'NNP'), ('Lewis', 'NNP'), ('(', '('), (' Salford', 'NNP'), (')', ')'), ('Mark', 'NNP'), ('Hunter', 'NNP'), ('(', '('), (' Stockport', 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('斯托克波特', 'NNP'), (')', ')')]Manchester', 'NNP'), (')', ')'), ('Kate', 'NNP'), ('Lewis', 'NNP'), ('(', '('), (' Salford', 'NNP'), (')', ')'), ('Mark', 'NNP'), ('Hunter', 'NNP'), ('(', '('), (' Stockport', 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('斯托克波特', 'NNP'), (')', ')')]Stockport', 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('斯托克波特', 'NNP'), (')', ')')]Stockport', 'NNP'), (')', ')'), ('Elise', 'NNP'), ('Wilson', 'NNP'), ('(', '('), ('斯托克波特', 'NNP'), (')', ')')]
我使用的文件和程序可在https://github.com/michaelhochleitner/stackoverflow.com-questions-57148173获得。
将输出写入您的程序版本中的文件对我不起作用。我使用以下命令将程序的打印输出重定向到文件。
python extract_names.py > wordsfromdoesntwork.txt
我正在使用 Python 2.7.15+ 和 nltk 3.4.4。
推荐阅读
- .net - .Net 应用程序被 404 错误发送垃圾邮件 - Elmah
- postgresql - SELECT 中的 RETURNING 参数(postgres)
- actions-on-google - 在完成 Actions on Google 的履行请求时,如何确定源设备?
- angularjs - 找不到命名空间“角度”
- angular - PrimeNG 订单列表更改按钮位置
- python - 如何在 SciPy.optimize.minimize 中定义不连续的边界
- python-3.x - 应用于嵌套列表时,函数不会更改值
- html - HTML / CSS没有链接?
- python - 将变量中的两个数字相加
- excel - 在 Excel 中使用分隔符将单行单元格拆分为多行单元格