首页 > 解决方案 > 使用跨度对象。[斯帕西,蟒蛇]

问题描述

我不确定这是否是一个非常愚蠢的问题,但这里有。

text_corpus = '''Insurance bosses plead guilty\n\nAnother three US insurance executives have pleaded guilty to fraud charges stemming from an ongoing investigation into industry malpractice.\n\nTwo executives from American International Group (AIG) and one from Marsh & McLennan were the latest. The investigation by New York attorney general Eliot Spitzer has now obtained nine guilty pleas. The highest ranking executive pleading guilty on Tuesday was former Marsh senior vice president Joshua Bewlay.\n\nHe admitted one felony count of scheming to defraud and faces up to four years in prison. A Marsh spokeswoman said Mr Bewlay was no longer with the company. Mr Spitzer\'s investigation of the US insurance industry looked at whether companies rigged bids and fixed prices. Last month Marsh agreed to pay $850m (£415m) to settle a lawsuit filed by Mr Spitzer, but under the settlement it "neither admits nor denies the allegations".\n'''

def get_entities(document_text, model):
    analyzed_doc = model(document_text)
    entities = [entity for entity in analyzed_doc.ents if entity.label_ in ["PER", "ORG", "LOC", "GPE"]]
    return entities
model = spacy.load("en_core_web_sm")
entities_1 = get_entities(text_corpus, model)
entities_2 = get_entities(text_corpus, model)

但是当它运行以下命令时,

entities_1[0] in entities_2

输出是False

这是为什么?两个实体列表中的对象是相同的。然而,一个列表中的一个项目不在另一个列表中。这非常奇怪。有人可以解释一下为什么对我如此吗?

标签: pythonspacy

解决方案


这是由于ents's 在 spaCy 中的表示方式。它们是具有特定实现的类,因此甚至entities_2[0] == entities_1[0]会评估为False. 从表面上看,Span类没有一个实现,__eq__至少乍一看,这是一个简单的原因。

如果您打印出entities_2[0] 的值,它会给您US,但这仅仅是因为span 类__repr__在同一个文件中实现了一个方法。如果您想进行布尔比较,一种方法是使用的text属性Span并执行以下操作:

entities_1[0].text in [e.text for e in entities_2]

编辑:

正如@abb 指出的,Spanimplements __richcmp__,但是这适用于同一个实例,Span因为它检查令牌本身的位置。


推荐阅读