首页 > 解决方案 > Python 错误的命名实体识别

问题描述

所以我试图编写一个文本预处理器并试图让 nltk.ne_chunk() 工作但我得到了以下代码的许多错误

z = "Francois Legault of the CAQ will now become the new premier of Quebec. This is possible as his party defeated the Liberals in the Provincial elections held on October 1st 2018."

def preprocess_pipe1(doc1):
sent1 = nltk.sent_tokenize(doc1)
#print(sent1)
print(" ")
print ("SENTENCE SPLITTER")
for x in sent1:
    print(x)

print(" ")

sent1 = [nltk.word_tokenize(sent2) for sent2 in sent1]
#print(sent1)
print(" ")
print ("TOKENIZER")
for x in sent1:
    print(x)

print(" ")

sent1 = [nltk.pos_tag(sent2) for sent2 in sent1]
#print(sent1)
print(" ")
print ("POS TAGGER")
for x in sent1:
    print(x)

return(sent1)


sent2=preprocess_pipe1(z)
sent3=nltk.ne_chunk(sent2)
print(sent3)

` 错误如下

CAQ 的 SENTENCE SPLITTER Francois Legault 现在将成为魁北克的新总理。这是可能的,因为他的政党在 2018 年 10 月 1 日举行的省级选举中击败了自由党。

代币化器

['Francois', 'Legault', 'of', 'the', 'CAQ', 'will', 'now', 'become', 'the', 'new', 'premier', 'of', 'Quebec', '.']
['This', 'is', 'possible', 'as', 'his', 'party', 'defeated', 'the', 'Liberals', 'in', 'the', 'Provincial', 'elections', 'held', 'on', 'October', '1st', '2018', '.']


POS TAGGER
[('Francois', 'NNP'), ('Legault', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('CAQ', 'NNP'), ('will', 'MD'), ('now', 'RB'), ('become', 'VB'), ('the', 'DT'), ('new', 'JJ'), ('premier', 'NN'), ('of', 'IN'), ('Quebec', 'NNP'), ('.', '.')]
[('This', 'DT'), ('is', 'VBZ'), ('possible', 'JJ'), ('as', 'IN'), ('his', 'PRP$'), ('party', 'NN'), ('defeated', 'VBD'), ('the', 'DT'), ('Liberals', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('Provincial', 'NNP'), ('elections', 'NNS'), ('held', 'VBD'), ('on', 'IN'), ('October', 'NNP'), ('1st', 'CD'), ('2018', 'CD'), ('.', '.')]

错误:

回溯(最近一次通话最后):文件“C:/Users/Robin Karlose/PycharmProjects/NLTK Test 1/Code 5 - NER test.py”,第 71 行,在 sent3=nltk.ne_chunk(sent2) 文件“C:\ Users\Robin Karlose\PycharmProjects\NLTK Test 1\venv\lib\site-packages\nltk\chunk__init__.py",第 177 行,在 ne_chunk 返回 chunker.parse(tagged_tokens) 文件“C:\Users\Robin Karlose\PycharmProjects\ NLTK 测试 1\venv\lib\site-packages\nltk\chunk\named_entity.py",第 123 行,解析中标记 = self._tagger.tag(tokens) 文件“C:\Users\Robin Karlose\PycharmProjects\NLTK 测试1\venv\lib\site-packages\nltk\tag\sequential.py”,第 63 行,在标签 tags.append(self.tag_one(tokens, i, tags)) 文件“C:\Users\Robin Karlose\PycharmProjects \NLTK 测试 1\venv\lib\site-packages\nltk\tag\sequential.py",第 83 行,在 tag_one tag = tagger.choose_tag(tokens, index, history) 文件“C:\Users\Robin Karlose\PycharmProjects\NLTK Test 1\venv\lib\site-packages\nltk\tag\sequential.py”,第 632 行,在 choose_tag 特征集 = self.feature_detector(tokens, index, history) 文件“C:\Users\Robin Karlose\PycharmProjects\NLTK Test 1\venv\lib\site-packages\nltk\tag\sequential.py”,第 680 行,在 feature_detector return self._feature_detector(tokens, index, history) 文件“C:\Users\Robin Karlose\PycharmProjects\NLTK Test 1\venv\lib\site-packages\nltk\chunk\named_entity.py”,第 56 行,在_feature_detector pos = simple_pos(tokens[index][1]) 文件“C:\Users\Robin Karlose\PycharmProjects\NLTK Test 1\venv\lib\site-packages\nltk\chunk\named_entity.py”,第 186 行,在如果是simplify_pos。startswith('V'): return "V" AttributeError: 'tuple' object has no attribute 'startswith'

有趣的是,当我运行此代码时,NER 工作得很好

import nltk
import nltk.corpus

sent = nltk.corpus.treebank.tagged_sents()[22]
print(sent)
print(nltk.ne_chunk(sent))

据我了解 - 在这两种情况下,我都将 POS 标记文本发送到 NLTK 命名实体识别函数(即 nltk.ne_chunk() ),但对于我的一生,我无法理解为什么在第一种情况下会有这么多错误。

如果有人能对此事提供一些见解,我将不胜感激!

标签: pythonnlpnltk

解决方案


推荐阅读