首页 > 解决方案 > tensorflow.keras.preprocessing.text.Tokenizer.texts_to_matrix 有什么作用?

问题描述

请解释tokenizer.texts_to_matrix做了什么以及结果是什么?

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token="<OOV>")

sentences = [text]
print(sentences)
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
matrix = tokenizer.texts_to_matrix(sentences)
print(word_index)
print(sequences)
print(matrix)
---
['The fool doth think he is wise, but the wise man knows himself to be a fool.']

# word_index
{'<OOV>': 1, 'the': 2, 'fool': 3, 'wise': 4, 'doth': 5, 'think': 6, 'he': 7, 'is': 8, 'but': 9, 'man': 10, 'knows': 11, 'himself': 12, 'to': 13, 'be': 14, 'a': 15}

# sequences
[[2, 3, 5, 6, 7, 8, 4, 9, 2, 4, 10, 11, 12, 13, 14, 15, 3]]

# matrix
[[0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

标签: tensorflowkeras

解决方案


在二进制模式(默认模式)下,它指示输入文本中来自学习词汇的哪些单词。你已经训练了你的分词器

['The fool doth think he is wise, but the wise man knows himself to be a fool.']

因此,当您将1相同的文本转换为矩阵时,它将包含所有单词(由从 1OOVword_index

一些例子

tokenizer.texts_to_matrix(['foo'])
# only OOV in this one text
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.]])
 tokenizer.texts_to_matrix(['he he'])
# known word, twice (does not matter how often)
array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.]])
tokenizer.texts_to_matrix(['the fool'])
array([[0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.]])

其他模组

其他模组更清晰

  • count - 词汇中的单词在文本中出现的次数
tokenizer.texts_to_matrix(['He, he the fool'], mode="count")
array([[0., 0., 1., 1., 0., 0., 0., 2., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.]])
  • freq - 总和归一化为 1.0 的计数
tokenizer.texts_to_matrix(['he he the fool'], mode="freq")
array([[0.  , 0.  , 0.25, 0.25, 0.  , 0.  , 0.  , 0.5 , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  ]])
tokenizer.texts_to_matrix(['he he the fool'], mode="tfidf")
array([[0.        , 0.        , 0.84729786, 0.84729786, 0.        ,
        0.        , 0.        , 1.43459998, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ]])

推荐阅读