keras - Should the vocabulary be restricted to the training-set vocabulary when training an NN model with pretrained word2vec like GLOVE?
问题描述
I wanted to use word embeddings for the embedding Layer in my neural network using pre-trained vectors from GLOVE. Do I need to restrict the vocabulary to the training-set when constructing the word2index dictionary? Wouldn't that lead to a limited non-generalizable model? Is considering all the vocabulary of GLOVE a recommended practice?
解决方案
Yes, it is better to restrict your vocab size. Because pre-trained embeddings (like GLOVE) have many words in them that are not very useful (and so Word2Vec) and the bigger vocab size the more RAM you need and other problems.
Select your tokens from all of your data. it won't lead to a limited non-generalizable model if your data is big enough. if you think that your data does not have as many tokens as are needed, then you should know 2 things:
- Your data is not good enough and you have to gather more.
- Your model can't generate well on the tokens that it hasn't seen at training! so it has no point to having many unused words on your embedding and better to gather more data to cover those words.
I have an answer to show how you can select a minor set of word vectors from a pre-trained model in here
推荐阅读
- class - 比较 Kotlin 类型和 java 类
- javascript - Meteor Collections Simpleschema,自动值取决于其他字段值
- office-js - 从清单文件 Office JS 中的加载项命令调用自定义函数
- java - JavaFX Frozen GUI 在流面板上添加按钮时
- ubuntu - TPM 与裤子安装问题
- identityserver4 - 从 SamlResponse 读取中继状态
- javascript - chrome中的表格单元格问题
- iteration - Lisp,向后迭代
- amazon-web-services - 增加 AWS 实例中的根空间
- android - 第一次请求时的 GitHub GraphQL 游标分页