首页 > 解决方案 > Matcher 正在返回一些重复项

问题描述

我想要输出,["good customer service","great ambience"]但我得到了["good customer","good customer service","great ambience"]因为模式也与好客户匹配,但这个短语没有任何意义。如何删除这些重复项

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("good customer service and great ambience")
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: adjective followed by one or more noun
 pattern = [{"POS": 'ADJ'},{"POS": 'NOUN', "OP": '+'}]

matcher.add("ADJ_NOUN_PATTERN", None,pattern)

matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

标签: pythonpython-3.xnlpspacymatcher

解决方案


Spacy 有一个内置函数可以做到这一点。检查filter_spans

文档说:

当跨度重叠时,(第一个)最长的跨度优于较短的跨度。

例子:

doc = nlp("This is a sentence.")
spans = [doc[0:2], doc[0:2], doc[0:4]]
filtered = filter_spans(spans)

推荐阅读