首页 > 解决方案 > PhraseMatcher to match in a different token attribute

问题描述

We would like to match a set of phrases using PhraseMatcher. However we whould like to match not only on the verbatim text, but a normalized version of the input. For instance, lower case, with the accents removed, etc.

We have tried to add a custom attibute to the Token, and use it in the init of the PhraseMatcher to match it but, it did not work.

We could transform the text using a custom pipeline but we want to keep the original text to be able to use other components of spacy.

def deaccent(text):
    ...
    return modified_text


def get_normalization(doc):
    return deaccent(doc.text)


Token.set_extension('get_norm', getter=get_normalization)


patterns_ = [{"label": "TECH", "pattern": "java"}]
ruler = EntityRuler(nlp, phrase_matcher_attr="get_norm")
ruler.add_patterns(patterns_)


nlp.add_pipe(ruler)

What is the way to do this?

标签: nlpspacy

解决方案


由于 EntityRuler 基于 PhraseMatcher,我在这里复制了一个 Spacy v2.2.0 的工作示例。按照评论了解如何使用令牌中的“NORM”属性。

最后,您可以看到单词“FÁCIL”如何匹配模式“facil”,因为它已被规范化。

import re
import spacy 
from unicodedata import normalize
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
from spacy.lang.es import Spanish

# Define our custom pipeline component that overwrites the custom attribute "norm" from tokens
class Deaccentuate(object):
    def __init__(self, nlp):
        self._nlp = nlp

    def __call__(self, doc):
        for token in doc:
            token.norm_ = self.deaccent(token.lower_)  # write norm_ attribute!
        return doc

    @staticmethod
    def deaccent(text):
        """ Remove accentuation from the given string """
        text = re.sub(
            r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1",
            normalize("NFD", text), 0, re.I
        )
        return normalize("NFC", text)


nlp = Spanish()
# Add component to pipeline
custom_component = Deaccentuate(nlp)
nlp.add_pipe(custom_component, first=True, name='normalizer')
# Initialize matcher with patterns to be matched 
matcher =  PhraseMatcher(nlp.vocab, attr="NORM")  # match in norm attribute from token
patterns_ = nlp.pipe(['facil', 'dificil'])
matcher.add('MY_ENTITY', None, *patterns_)

# Run an example and print results
doc = nlp("esto es un ejemplo FÁCIL")
matches = matcher(doc)
for match_id, start, end in matches:
    span = Span(doc, start, end, label=match_id)
    print("MATCHED: " + span.text)

此错误已在 v2.1.8 版本中修复 https://github.com/explosion/spaCy/issues/4002


推荐阅读