python - Spacy ValueError:[E103] 试图设置冲突的文档
问题描述
我按照教程进行SpaCy
了提取spans
和覆盖doc.ents
,spans
如下所示:
import spacy
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en_core_web_md')
COUNTRIES = ['Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua']
matcher = PhraseMatcher(nlp.vocab) # initialises the PhraseMatcher
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns)
text = 'After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait.'
doc = nlp(text)
for match_id, start, end in matcher(doc): # Iterate over the matches
span = Span(doc, start, end, label='GPE') # Create a Span with the label for "GPE"
doc.ents = list(doc.ents) + [span] # Overwrite the doc.ents and add the span
# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])
但是,该行doc.ents = list(doc.ents) + [span]
导致以下错误:
ValueError Traceback (most recent call last)
<ipython-input-141-896d7076e05e> in <module>
3 for match_id, start, end in matcher(doc): # Iterate over the matches
4 span = Span(doc, start, end) # Create a Span with the label for "GPE"
----> 5 doc.ents = list(doc.ents) + [span] # Overwrite the doc.ents and add the span
6
7 # Print the entities in the document
doc.pyx in spacy.tokens.doc.Doc.ents.__set__()
ValueError: [E103] Trying to set conflicting doc.ents: '(74, 75, 'GPE')' and '(74, 75, '')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
该错误没有任何意义,因为 doc.ents 中的条目 Namibia 的标签为 GPE,span Namibia 的标签也为 GPE,所以这是一致的,不会像错误提示的那样冲突。有谁知道为什么我不能将两个列表 (list(doc.ents
和[span]
) 添加在一起?提前致谢。
解决方案
以下代码适用于带有 Spacy 2.2.4 和 2.1.0 的 Python 3:
而不是使用:
nlp = spacy.load('en_core_web_sm')
利用:
from spacy.lang.en import English
nlp = English()
前者给出错误,而如果我们使用后者,则会获得以下输出:
[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]
推荐阅读
- c++ - 如何使用 lambda 替换 std::bind
- r - 计算每个类别(机器)的日期范围重叠的次数
- mysql - 如何将 3 个查询合二为一,并在每个表中加入并计数一行?
- email - 纯文本电子邮件无法显示非 ASCII 字符?
- r - 是否可以重命名列表中的多个列表名称?
- django - 按降序反转评论列表以获得django模板中的最后n(2)个对象
- python-3.x - 如何与 ThreadPoolExecutor & Queue 并发运行
- bash - 运行 bash 脚本时如何验证是否缺少所需的命令?
- react-native - 是否可以在@react-navigation/stack 中保留屏幕?
- typescript - 无法在组件内使用 useDispatch()