nlp - PhraseMatcher to match in a different token attribute
问题描述
We would like to match a set of phrases using PhraseMatcher. However we whould like to match not only on the verbatim text, but a normalized version of the input. For instance, lower case, with the accents removed, etc.
We have tried to add a custom attibute to the Token, and use it in the init of the PhraseMatcher to match it but, it did not work.
We could transform the text using a custom pipeline but we want to keep the original text to be able to use other components of spacy.
def deaccent(text):
...
return modified_text
def get_normalization(doc):
return deaccent(doc.text)
Token.set_extension('get_norm', getter=get_normalization)
patterns_ = [{"label": "TECH", "pattern": "java"}]
ruler = EntityRuler(nlp, phrase_matcher_attr="get_norm")
ruler.add_patterns(patterns_)
nlp.add_pipe(ruler)
What is the way to do this?
解决方案
由于 EntityRuler 基于 PhraseMatcher,我在这里复制了一个 Spacy v2.2.0 的工作示例。按照评论了解如何使用令牌中的“NORM”属性。
最后,您可以看到单词“FÁCIL”如何匹配模式“facil”,因为它已被规范化。
import re
import spacy
from unicodedata import normalize
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
from spacy.lang.es import Spanish
# Define our custom pipeline component that overwrites the custom attribute "norm" from tokens
class Deaccentuate(object):
def __init__(self, nlp):
self._nlp = nlp
def __call__(self, doc):
for token in doc:
token.norm_ = self.deaccent(token.lower_) # write norm_ attribute!
return doc
@staticmethod
def deaccent(text):
""" Remove accentuation from the given string """
text = re.sub(
r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1",
normalize("NFD", text), 0, re.I
)
return normalize("NFC", text)
nlp = Spanish()
# Add component to pipeline
custom_component = Deaccentuate(nlp)
nlp.add_pipe(custom_component, first=True, name='normalizer')
# Initialize matcher with patterns to be matched
matcher = PhraseMatcher(nlp.vocab, attr="NORM") # match in norm attribute from token
patterns_ = nlp.pipe(['facil', 'dificil'])
matcher.add('MY_ENTITY', None, *patterns_)
# Run an example and print results
doc = nlp("esto es un ejemplo FÁCIL")
matches = matcher(doc)
for match_id, start, end in matches:
span = Span(doc, start, end, label=match_id)
print("MATCHED: " + span.text)
此错误已在 v2.1.8 版本中修复 https://github.com/explosion/spaCy/issues/4002
推荐阅读
- c# - 返回类型为 Task 时返回什么
在.netcore - python - 具有多个任意但相等类型的元组
- vue.js - Vuetify 在构建为 web 组件时不呈现 v-menu
- git - 当我 `git pull --rebase` 并发生冲突时,我如何`git show` 其他人的提交?
- azure - 如何在 Azure 函数应用中更改 LinuxFxVersion
- javascript - Webpack 5 IgnorePlugin - 仅从 CSS 文件的输出中不忽略 JS 文件?
- c# - 引导多个生产者和消费者
- leaflet - 如何使用传单绘制多边形显示传单路径/传单测量路径的测量值?
- scipy - 将数据写入 .mat 文件
- c - 将宏与令牌粘贴运算符一起使用