word2vec - Handling OOV words in GoogleNews-vectors-negative300.bin
问题描述
I need to calculate the word vectors for each word of a sentence that is tokenized as follows:
['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin'].
If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get by OOV. However, when I use Google's word2vec from GoogleNews-vectors-negative300.bin, it returns an InvalidKey Error. My question is how to we calculate the word vectors that are OOV then? I searched online I could not find anything. Of course on way to do this is removing all the sentences that have words not listed in the google's word2vec. However, I noticed only 5550 out of 16134 have words completely in the embedding.
I did also
model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True)
model.train(sentences_with_OOV_words)
However, tensorflow 2 returns an error.
Any help would be greatly appreciate it.
解决方案
If vocab is not found, initialize them with zero vector of the same size (Google word2vec should be a vector of 300 dimensions):
try:
word_vector = model.wv.get_vector('your_word_here')
except KeyError:
word_vector = np.zeros((300,))
推荐阅读
- swift - 在 iOS 13 智能相册中找不到相机胶卷相册
- r - 如何更改相对于另一列和组的列
- python - 使用python找到威尔逊素数的最有效方法是什么?
- html - 防止 TinyMCE 在 HTML 注释中添加空格
- amazon-web-services - AWS-Logs, ElasticSearch :特定日志未显示在 ElasticSearch 中,仅适用于选定日志
- linux - 中断上下文和进程上下文的区别?
- sql-server - 从 Pyspark 2.4 连接 SQL Server 以写入数据时出现问题
- python - 是否可以强制 cp-sat 满足可行解决方案的所有约束?
- python - 设置全局变量后下载 Python 3.7.4 库的问题
- active-directory - (Windows Server 2016) Active Directory OU 策略