首页 > 解决方案 > Tensorflow 文本分词器不正确分词

问题描述

我正在尝试TF Tokenizer用于 NLP 模型

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split=" ")
sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ", 
               "This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]

tokenizer.fit_on_texts(sample_text)

print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]))

操作:

[[1, 7, 8, 9]]

字索引:

print(tokenizer.index_word[8])  ===> 'ab'
print(tokenizer.index_word[9])  ===> 'cdefghijklmnopqrstuvwxyz'

问题是在这种情况下tokenizer创建令牌。.我正在给出split = " "Tokenizer所以我期待以下操作:

[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'

正如我希望标记器words基于space (" ")而不是基于任何special characters

如何tokenizer仅在 上创建创建令牌spaces

标签: tensorflowkerastexttensorflow2.0

解决方案


Tokenizer接受另一个称为filter当前默认为所有 ascii 标点符号 ( filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')的参数。在标记化期间,其中包含的所有字符filter都将替换为指定的split字符串。

如果您查看源代码,Tokenizer特别是方法fit_on_texts,您会看到它使用函数text_to_word_sequence接收filter字符并认为它们与split它接收的相同:

def text_to_word_sequence(... ):
    ...
    translate_dict = {c: split for c in filters}
    translate_map = maketrans(translate_dict)
    text = text.translate(translate_map)

    seq = text.split(split)
    return [i for i in seq if i]

因此,为了不拆分指定的split,只需将空字符串传递给filter参数


推荐阅读