首页 > 解决方案 > Handling OOV words in GoogleNews-vectors-negative300.bin

问题描述

I need to calculate the word vectors for each word of a sentence that is tokenized as follows:

['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin']. 

If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get by OOV. However, when I use Google's word2vec from GoogleNews-vectors-negative300.bin, it returns an InvalidKey Error. My question is how to we calculate the word vectors that are OOV then? I searched online I could not find anything. Of course on way to do this is removing all the sentences that have words not listed in the google's word2vec. However, I noticed only 5550 out of 16134 have words completely in the embedding.

I did also

model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True) 
model.train(sentences_with_OOV_words)

However, tensorflow 2 returns an error.

Any help would be greatly appreciate it.

标签: word2vecoov

解决方案


If vocab is not found, initialize them with zero vector of the same size (Google word2vec should be a vector of 300 dimensions):

try:
    word_vector = model.wv.get_vector('your_word_here')

except KeyError:
    word_vector = np.zeros((300,))

推荐阅读