首页 > 解决方案 > How to explain gensim word2vec output?

问题描述

I run the following code and just wonder why the top 3 most similar words for "exposure" don't include "charge" and "lend"?

from gensim.models import Word2Vec
corpus = [['total', 'exposure', 'charge', 'lend'],
          ['customer', 'paydown', 'rate', 'months', 'month']]
gens_mod = Word2Vec(corpus, min_count=1, vector_size=300, window=2, sg=1, workers=1, seed=1)
keyword="exposure"
gens_mod.wv.most_similar(keyword)

Output:
[('customer', 0.12233059108257294),
 ('month', 0.008674687705934048),
 ('total', -0.011738087050616741),
 ('rate', -0.03600010275840759),
 ('months', -0.04291829466819763),
 ('paydown', -0.044823747128248215),
 ('lend', -0.05356598272919655),
 ('charge', -0.07367636263370514)]

标签: pythonnlpgensimword2vecword-embedding

解决方案


The word2vec algorithm is only useful & valuable with large amounts of training data, where every word of interest has a variety of realistic, subtly-contrasting usage examples.

A toy-sized dataset won't show its value. It's always a bad idea to set min_count=1. And, it's nonsensical to try to train 300-dimensional word-vectors from a corpus of only 9 words, 9 unique words, and most of the words having the exact same neighbors.

Try it on a more realistic dataset - tens-of-thousands of unique words, all with multiple usage examples – and you'll see more intuitively-correct similarity results.


推荐阅读