python-3.x - Spacy 3.0 Matcher 删除重叠并保留所用模式的信息
问题描述
是否有更短、更简洁或内置的方法可以从 Matcher 中删除重叠匹配结果,同时保留用于匹配的 Pattern 的值?这样您就可以知道哪个模式给出了匹配结果。模式 ID 最初是从匹配器结果中给出的,但是我看到的消除重叠的解决方案会删除 ID 号。
这是我目前用作解决方案的方法,它有效但有点长:
import spacy
from spacy.lang.en import English
from spacy.matcher import Matcher
text ="United States vs Canada, Canada vs United States, United States vs United Kingdom, Mark Jefferson vs College, Clown vs Jack Cadwell Jr., South America Snakes vs Lopp, United States of America, People vs Jack Spicer"
doc = nlp(text)
#Matcher
matcher=Matcher(nlp.vocab)
# Two patterns
pattern1 = [{"POS": "PROPN", "OP": "+", "IS_TITLE":True}, {"TEXT": {"REGEX": "vs$"}}, {"POS": "PROPN", "OP": "+", "IS_TITLE":True}]
pattern2 =[{"POS": "ADP"},{"POS": "PROPN", "IS_TITLE":True}]
matcher.add("Games", [pattern1])
matcher.add("States", [pattern2])
#Output stored as list of tuples with the following: (pattern name ID, pattern start, pattern end)
matches = matcher(doc)
首先,我将结果存储在字典中,其中一个元组列表作为值,模式名称作为键
result = {}
for key, subkey, value in matches:
result.setdefault(nlp.vocab.strings[key], []).append((subkey,value))
print(result)
打印到:
{'States': [(2, 4), (6, 8), (12, 14), (18, 20), (22, 24), (30, 32), (35, 37), (39, 41)],
'Games': [(1, 4), (0, 4), (5, 8), (5, 9), (11, 14), (10, 14), (11, 15), (10, 15), (17, 20),
(16, 20), (21, 24), (21, 25), (21, 26), (38, 41), (38, 42)]}
然后我迭代结果并用于filter_spans
删除重叠并将开始和结束附加为元组:
for key, value in result.items():
new_vals = [doc[start:end] for start, end in value]
val2 =[]
for span in spacy.util.filter_spans(new_vals):
val2.append((span.start, span.end))
result[key]=val2
print(result)
这将打印一个没有重叠的结果列表:
{'States': [(2, 4), (6, 8), (12, 14), (18, 20), (22, 24), (30, 32), (35, 37), (39, 41)],
'Games': [(0, 4), (5, 9), (10, 15), (16, 20), (21, 26), (38, 42)]}
要获取文本值,只需循环模式并打印跨度:
print ("---Games---")
for start, end in result['Games']:
span =doc[start:end]
print (span.text)
print (" ")
print ("---States---")
for start, end in result['States']:
span =doc[start:end]
print (span.text)
输出:
---Games---
United States vs Canada
Canada vs United States
United States vs United Kingdom
Mark Jefferson vs College
Clown vs Jack Cadwell Jr.
People vs Jack Spicer
---States---
vs Canada
vs United
vs United
vs College
vs Jack
vs Lopp
of America
vs Jack
解决方案
在您的处理中,您可以创建新的跨度来保留标签而不是使用doc[start:end]
不包括标签的 :
from spacy.tokens import Span
span = Span(doc, start, end, label=match_id)
使用匹配器选项比使用 spaCy v3.0+ 更容易as_spans
:
import spacy
from spacy.matcher import Matcher
nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("A", [[{"ORTH": "a", "OP": "+"}]])
matcher.add("B", [[{"ORTH": "b"}]])
matched_spans = matcher(nlp("a a a a b"), as_spans=True)
for span in spacy.util.filter_spans(matched_spans):
print(span.label_, ":", span.text)
推荐阅读
- angular - 角材料表多个标题不起作用
- javascript - Select2 - 从后端获取数据并向选择框添加选项
- c# - Net48 和 NetStandard 项目存在 System.Runtime 问题
- javascript - ajax调用后复选框不起作用
- scala - Fiber出现故障时需要取消吗?
- python - 安装在虚拟环境中时,模块 Keras 显示为“没有模块名称 'keras'”
- javascript - Mocha.js:异步函数打破了嵌套结构
- etl - NiFi:在 Python 脚本中使用 xml.etree.ElementTree
- php - OAuth 2 授权代码授予实施
- python - 信号是否需要由 QMutex 保护?