首页 > 解决方案 > Spacy没有标记周期

问题描述

如果最后的“单词”是包含句点的非单词,我该如何修复/调整 spacy 不会分隔句末句点的事实?

>>> nlp = spacy.spacy.load('en_core_web_md')
>>> doc = nlp("The Eiffel Tower is located at 48.86N 2.29E.")
>>> print(doc[-1])
2.29E.
>>> print(nlp("The Eiffel Tower is very beautiful.")[-1])
.
   

我正在尝试提取(命名实体识别)文档中的纬度/经度引用,但无法找到一种方法来使提取的实体与"48.86N 2.29E"没有最后句点的文本相对应。

我想保持所有其他常用(英语)标记化规则不被修改。

标签: spacy

解决方案


您需要在标记器中注册自定义后缀。这可以按如下方式完成:

import re
import spacy
from spacy.tokenizer import Tokenizer

suffix_re = re.compile(r'''\.$''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, suffix_search=suffix_re.search)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp("The Eiffel Tower is very beautiful.")
print([t.text for t in doc])

doc2 = nlp("The Eiffel Tower is located at 48.86N 2.29E.")
print([t.text for t in doc2])

doc3 = nlp("The Eiffel Tower, Norte Dame and Champs Elysee are located at 48.86N 2.29E.")
print([t.text for t in doc3])

输出

['The', 'Eiffel', 'Tower', 'is', 'very', 'beautiful', '.']
['The', 'Eiffel', 'Tower', 'is', 'located', 'at', '48.86N', '2.29E', '.']
['The', 'Eiffel', 'Tower,', 'Norte', 'Dame', 'and', 'Champs', 'Elysee', 'are', 'located', 'at', '48.86N', '2.29E', '.']


推荐阅读