python - Spacy - modify tokenizer for numeric patterns
问题描述
I have seen some ways to create a custom tokenizer, but I am a little confused. What I am doing is using the Phrase Matcher to match patterns. However, it would match a 4-digit number pattern, say 1234
, in 111-111-1234
, since it splits on the dash.
All I want to do is modify the current tokenizer (from nlp = English()
) and add a rule that it should not split on some characters but only for numeric patterns.
解决方案
To do this you will need to overwrite spaCy's default infix
tokenization scheme with your own. You can do this by modifying the infix tokenization scheme used by spaCy found here.
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
# default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("111-222-1234 for abcDE")
print([t.text for t in doc])
# modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\*^](?=[0-9-])", # Remove the hyphen
r"(?<=[{al}{q}])\.?(?=[{au}{q}])".format( # Make the dot optional
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
)
,
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
]
)
infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp("111-222-1234 for abcDE")
print([t.text for t in doc])
Output
With default tokenizer:
['111', '-', '222', '-', '1234', 'for', 'abcDE']
With custom tokenizer:
['111-222-1234', 'for', 'abc', 'DE']
推荐阅读
- authentication - 带 JWT 的多租户身份验证策略
- google-bigquery - SyncFusion Dashboard 和 Google Bigquery 是可能的吗?
- mysql - MySQL SELECT 仅非字母数字
- c++ - 在 C++ 中创建一个指向另一个元素的向量
- jquery - 是否可以有效地使主题定制器部分可排序并在保存后保持顺序?
- python - Python 中的增强型 Dickey-Fuller 测试存在少量观察的问题
- html - Wordpress 网格发布 CSS 样式
- wpf - WPF表单调试卡住
- javascript - D3 js图中相同颜色的箭头和链接
- entity-framework-core - 实体框架核心 - 拥有的实体不使用 Linq 聚合