首页 > 解决方案 > Spacy ValueError:[E103] 试图设置冲突的文档

问题描述

我按照教程进行SpaCy了提取spans和覆盖doc.entsspans如下所示:

import spacy
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_md')

COUNTRIES = ['Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua']
matcher = PhraseMatcher(nlp.vocab)         # initialises the PhraseMatcher
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns)

text = 'After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait.'

doc = nlp(text)
for match_id, start, end in matcher(doc):         # Iterate over the matches
    span = Span(doc, start, end, label='GPE')     # Create a Span with the label for "GPE"  
    doc.ents = list(doc.ents) + [span]            # Overwrite the doc.ents and add the span

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

但是,该行doc.ents = list(doc.ents) + [span]导致以下错误:

ValueError                                Traceback (most recent call last)
<ipython-input-141-896d7076e05e> in <module>
      3 for match_id, start, end in matcher(doc):         # Iterate over the matches
      4     span = Span(doc, start, end)     # Create a Span with the label for "GPE"
----> 5     doc.ents = list(doc.ents) + [span]            # Overwrite the doc.ents and add the span
      6 
      7 # Print the entities in the document

doc.pyx in spacy.tokens.doc.Doc.ents.__set__()

ValueError: [E103] Trying to set conflicting doc.ents: '(74, 75, 'GPE')' and '(74, 75, '')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

该错误没有任何意义,因为 doc.ents 中的条目 Namibia 的标签为 GPE,span Namibia 的标签也为 GPE,所以这是一致的,不会像错误提示的那样冲突。有谁知道为什么我不能将两个列表 (list(doc.ents[span]) 添加在一起?提前致谢。

标签: pythonspacy

解决方案


以下代码适用于带有 Spacy 2.2.4 和 2.1.0 的 Python 3:

而不是使用:

nlp = spacy.load('en_core_web_sm')

利用:

from spacy.lang.en import English
nlp = English()

前者给出错误,而如果我们使用后者,则会获得以下输出:

[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]

推荐阅读