首页 > 解决方案 > 使用 keras ootb text_to_word_sequence 防止拆分包含 - 的单词

问题描述

我在用:

from keras.preprocessing.text import text_to_word_sequence

text = 'Decreased glucose-6-phosphate dehydrogenase activity along with oxidative stress affects visual contrast sensitivity in alcoholics.'

words = set(text_to_word_sequence(text))

print(words)

这导致:

{'oxidative', 'contrast', '6', 'affects', 'in', 'dehydrogenase', 'visual', 'stress', 'glucose', 'phosphate', 'along', 'activity', 'with', 'alcoholics', 'decreased', 'sensitivity'}

有没有办法防止单词分裂:葡萄糖-6-磷酸

标签: python-3.xkerasnlp

解决方案


是的,通过从参数中删除连字符filters

from keras_preprocessing.text import text_to_word_sequence

text = 'Decreased glucose-6-phosphate ...'

words = set(text_to_word_sequence(text,
 filters='!"#$%&()*+,./:;<=>?@[\\]^_`{|}~\t\n'))
words

{'activity',
 'affects',
 'alcoholics',
 'along',
 'contrast',
 'decreased',
 'dehydrogenase',
 'glucose-6-phosphate',
 'in',
 'oxidative',
 'sensitivity',
 'stress',
 'visual',
 'with'}

这当然会影响文本中包含连字符的任何单词。


推荐阅读