首页 > 解决方案 > How to choose num_words parameter for keras Tokenizer?

问题描述

tokenizer = Tokenizer(num_words=my_max)

I am using the keras preprocessing tokenizer to process a corpus of text for a machine learning model. One of the parameters for the Tokenizer is the num_words parameter that defines the number of words in the dictionary. How should this parameter be picked? I could choose a huge number and guarantee that every word would be included but certain words that only appears once might be more useful if grouped together as a simple "out of vocabulary" token. What is the strategy in setting this parameter?

My particular use case is a model processing tweets so every entry is below 140 characters and there is some overlap in the types of words that are used. the model is for a kaggle competition about extracting the text that exemplifies a sentiment (i.e "my boss is bullying me" returns "bullying me")

标签: tensorflowmachine-learningkerasnlptokenize

解决方案


这里的基本问题是“什么样的词可以建立情感,它们在推文中出现的频率如何?”

当然,这没有硬性和快速的答案。

以下是我将如何解决这个问题:

  1. 预处理您的数据,以便从推文中删除连词、停用词和“垃圾”。
  2. 获取语​​料库中唯一单词的数量。所有这些词都是传达情感所必需的吗?
  3. 分析频率最高的单词。这些是表达情感的词吗?可以在您的预处理中删除它们吗?分词器会记录前 N 个唯一单词,直到字典中有 num_words 为止,因此这些流行词更有可能出现在您的字典中。

然后,我将开始尝试不同的值,并查看对您的输出的影响。

抱歉没有“真实”的答案。我认为选择这个值没有单一的真正策略。相反,答案应该来自利用数据的特征和统计数据。


推荐阅读