首页 > 解决方案 > 使用 Spacy 的基于模式的标点符号

问题描述

作为测试,我使用 Spacy 在识别跨度后标点文本。

import spacy, en_core_web_sm
from spacy.matcher import Matcher

# Read input file
nlp = spacy.load('en_core_web_sm')

matcher = Matcher(nlp.vocab)
Punctuation_patterns = [[{'POS': 'NOUN'},{'POS': 'NOUN'},{'POS': 'NOUN'}],
                        ]

matcher.add('PUNCTUATION', None, *Punctuation_patterns)
doc = nlp("The cat cat cat sat on the mat. The dog sat on the mat.")
matches = matcher(doc)
spans = []
for match_id, start, end in matches:
    span = doc[start:end]  # the matched slice of the doc
    spans.append({'start': span.start_char, 'end': span.end_char})
    layer1 = (' '.join(['"{}"'.format(span.text)if token.dep_ == 'ROOT'  else '{}'.format(token) for token in doc]))
    print (layer1)

输出:

The cat cat cat "cat cat cat" on the mat . The dog "cat cat cat" on the mat .

预期产出

The "cat cat cat" sat on the mat. The dog sat on the mat.

我只是在用 ROOT 进行测试,如何使用 spacy 识别跨度匹配以获得所需的输出?

编辑1:在像狗狗狗这样的多重检测的情况下

for match_id, start, end in matches:
    span = doc[start:end]  # the matched slice of the doc
    spans.append({'start': span.start_char, 'end': span.end_char})
    result = doc.text

for match_id, start, end in matches:
    span = doc[start:end]
    result = result.replace(span.text, f'"{span.text}"', 1)
    print (result)

电流输出:

The "cat cat cat" sat on the mat. The dog dog dog sat on the mat.
The "cat cat cat" sat on the mat. The "dog dog dog" sat on the mat.

预期的:

  The "cat cat cat" sat on the mat. The "dog dog dog" sat on the mat.

标签: pythonspacy

解决方案


您可以使用

result = doc.text
for match_id, start, end in matches:
    span = doc[start:end]
    result = result.replace(span.text, f'"{span.text}"', 1)
print (result)

也就是说,您定义一个变量来保存结果result,并为其赋值doc.text。然后,您遍历匹配项并将每个匹配的跨度替换为用双引号括起来的相同跨度文本。


推荐阅读