首页 > 解决方案 > Spacy:如何使用您自己的分隔符创建标记器

问题描述

我需要创建一个标记器,它将使用逗号分割我的文本。

对于那个文本

"5g, dynamic vision sensor (dvs), 3-d reconstruction, neuromorphic engineering, neural networks, humanoid robots, neuromorphics, closed loop systems, field programmable gate arrays, spiking motor controller, neuromorphic implementation, icub, relation neural network"

我想得到这个输出

['5g', 'dynamic vision sensor (dvs)', '3-d reconstruction', 'neuromorphic engineering', 'neural networks', 'humanoid robots', 'neuromorphics', 'closed loop systems', 'field programmable gate arrays', 'spiking motor controller', 'neuromorphic implementation', 'icub', 'relation neural network']

我尝试使用自定义标记器

def custom_tokenizer(nlp):
    pattern = re.compile(r'([\sa-z0-9\(\)-]+)')
    return Tokenizer(nlp.vocab, 
                     token_match=pattern.finditer)

nlp.tokenizer = custom_tokenizer(nlp)

但它返回了我

['5g,', 'dynamic', 'vision', 'sensor', '(dvs),', '3-d', 'reconstruction,', 'neuromorphic', 'engineering,', 'neural', 'networks,', 'humanoid', 'robots,', 'neuromorphics,', 'closed', 'loop', 'systems,', 'field', 'programmable', 'gate', 'arrays,', 'spiking', 'motor', 'controller,', 'neuromorphic', 'implementation,', 'icub,', 'relation', 'neural', 'network']

我检查了模式,它工作正常。如何停止使用空格分割文本?

标签: pythonspacytokenize

解决方案


推荐阅读