首页 > 解决方案 > 从 spaCy Match 中提取后如何引用文本?

问题描述

我使用 spaCy 匹配来提取关键字。

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab, validate=True)

patterns = [{"LOWER": "self"}, {'IS_PUNCT': True, 'OP':'*'}, {"LOWER": "employed"}]
patterns1 = [{'LOWER': 'finance'}]
patterns2 = [{'LOWER': 'accounting'}]
    
matcher.add("Experience", None, patterns)
matcher.add("CFA", None, patterns1)
matcher.add("CPA", None, patterns2)
    
text = """ I am a self employed working in a remote factory. However, I study finance and accounting by myself in
my spare time."""

doc = nlp(text)
matches = matcher(doc)

后来,我创建了一个包含所有关键字的数据框:

L=[]
M=[]
for match_id, start, end in matches:
        rule_id = nlp.vocab.strings[match_id]  # get the unicode ID, i.e. 'CategoryID'
        span = doc[start : end]  # get the matched slice of the doc
        L.append(rule_id)
        M.append(span.text)

import pandas as pd
df = pd.DataFrame(
    {'Keywords': L,
     'Profession': M,})
print(df)

#Output
     Keywords     Profession
0  Experience  self employed
1         CFA        finance
2         CPA     accounting

然后我想在职业是自雇人士时建立一个子集数据框。

#Output
     Keywords     Profession
0  Experience  self employed

如果我用硬编码来做,我每次都必须根据提取的测试来调整它。例如,文本可以是自雇、自雇、自雇等。

我很欣赏任何想法。谢谢

标签: pythonnlppattern-matching

解决方案


在您的情况下,IS_PUNCT可选应该这样做:

patterns = [{"LOWER": "self"}, {'IS_PUNCT': True, 'OP':'?'}, {"LOWER": "employed"}]

我仍然不确定我是否知道,你想要实现什么。当您的模式匹配时,您是否希望始终保存“自雇人士”?如果是这样,这是一个可能的解决方案:

for match_id, start, end in matches:
        rule_id = nlp.vocab.strings[match_id]  # get the unicode ID, i.e. 'CategoryID'
        span = doc[start : end]  # get the matched slice of the doc
        exp_span = span.text
        if rule_id == "Experience":
            exp_span = "self employed"
        L.append(rule_id)
        M.append(exp_span)

推荐阅读