首页 > 解决方案 > Spacy 3.0 Matcher 删除重叠并保留所用模式的信息

问题描述

是否有更短、更简洁或内置的方法可以从 Matcher 中删除重叠匹配结果,同时保留用于匹配的 Pattern 的值?这样您就可以知道哪个模式给出了匹配结果。模式 ID 最初是从匹配器结果中给出的,但是我看到的消除重叠的解决方案会删除 ID 号。

这是我目前用作解决方案的方法,它有效但有点长:

import spacy
from spacy.lang.en import English
from spacy.matcher import Matcher

text ="United States vs Canada, Canada vs United States, United States vs United Kingdom, Mark Jefferson vs College, Clown vs Jack Cadwell Jr., South America Snakes vs Lopp, United States of America, People vs Jack Spicer"

doc = nlp(text)

#Matcher
matcher=Matcher(nlp.vocab) 
# Two patterns
pattern1 = [{"POS": "PROPN", "OP": "+", "IS_TITLE":True}, {"TEXT": {"REGEX": "vs$"}}, {"POS": "PROPN", "OP": "+", "IS_TITLE":True}]
pattern2 =[{"POS": "ADP"},{"POS": "PROPN", "IS_TITLE":True}]
matcher.add("Games", [pattern1])
matcher.add("States", [pattern2])

#Output stored as list of tuples with the following: (pattern name ID, pattern start, pattern end) 
matches = matcher(doc)

首先,我将结果存储在字典中,其中一个元组列表作为值,模式名称作为键

result = {}
for key, subkey, value in matches:
    result.setdefault(nlp.vocab.strings[key], []).append((subkey,value))
print(result)

打印到:

{'States': [(2, 4), (6, 8), (12, 14), (18, 20), (22, 24), (30, 32), (35, 37), (39, 41)],

 'Games': [(1, 4), (0, 4), (5, 8), (5, 9), (11, 14), (10, 14), (11, 15), (10, 15), (17, 20),
  (16, 20), (21, 24), (21, 25), (21, 26), (38, 41), (38, 42)]}

然后我迭代结果并用于filter_spans删除重叠并将开始和结束附加为元组:

for key, value in result.items():
    new_vals = [doc[start:end] for start, end in value]
    val2 =[]
    for span in spacy.util.filter_spans(new_vals):
        val2.append((span.start, span.end))
    result[key]=val2

print(result)

这将打印一个没有重叠的结果列表:

{'States': [(2, 4), (6, 8), (12, 14), (18, 20), (22, 24), (30, 32), (35, 37), (39, 41)], 

'Games': [(0, 4), (5, 9), (10, 15), (16, 20), (21, 26), (38, 42)]}

要获取文本值,只需循环模式并打印跨度:

print ("---Games---")
for start, end in result['Games']:
    span =doc[start:end] 
    print (span.text)

print (" ")

print ("---States---")
for start, end in result['States']:
    span =doc[start:end] 
    print (span.text)

输出:

---Games---
United States vs Canada
Canada vs United States
United States vs United Kingdom
Mark Jefferson vs College
Clown vs Jack Cadwell Jr.
People vs Jack Spicer
 
---States---
vs Canada
vs United
vs United
vs College
vs Jack
vs Lopp
of America
vs Jack

标签: python-3.xnlppattern-matchingspacy

解决方案


在您的处理中,您可以创建新的跨度来保留标签而不是使用doc[start:end]不包括标签的 :

from spacy.tokens import Span
span = Span(doc, start, end, label=match_id)

使用匹配器选项比使用 spaCy v3.0+ 更容易as_spans

import spacy
from spacy.matcher import Matcher

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("A", [[{"ORTH": "a", "OP": "+"}]])
matcher.add("B", [[{"ORTH": "b"}]])

matched_spans = matcher(nlp("a a a a b"), as_spans=True)
for span in spacy.util.filter_spans(matched_spans):
    print(span.label_, ":", span.text)

推荐阅读