python - How to explain gensim word2vec output?
问题描述
I run the following code and just wonder why the top 3 most similar words for "exposure" don't include "charge" and "lend"?
from gensim.models import Word2Vec
corpus = [['total', 'exposure', 'charge', 'lend'],
['customer', 'paydown', 'rate', 'months', 'month']]
gens_mod = Word2Vec(corpus, min_count=1, vector_size=300, window=2, sg=1, workers=1, seed=1)
keyword="exposure"
gens_mod.wv.most_similar(keyword)
Output:
[('customer', 0.12233059108257294),
('month', 0.008674687705934048),
('total', -0.011738087050616741),
('rate', -0.03600010275840759),
('months', -0.04291829466819763),
('paydown', -0.044823747128248215),
('lend', -0.05356598272919655),
('charge', -0.07367636263370514)]
解决方案
The word2vec algorithm is only useful & valuable with large amounts of training data, where every word of interest has a variety of realistic, subtly-contrasting usage examples.
A toy-sized dataset won't show its value. It's always a bad idea to set min_count=1
. And, it's nonsensical to try to train 300-dimensional word-vectors from a corpus of only 9 words, 9 unique words, and most of the words having the exact same neighbors.
Try it on a more realistic dataset - tens-of-thousands of unique words, all with multiple usage examples – and you'll see more intuitively-correct similarity results.
推荐阅读
- python - 如何计算最近 36 个月的股票方差(部分月度数据缺失)?
- spring-cloud - Spring Cloud 认证服务
- jmeter - Jmeter - 在运行共享点登录的登录脚本时,在自定义 api 上出现 403 禁止错误
- javascript - 用 ? 处理 undefined 或 null 抛出 SyntaxError: Unexpected token using webpack
- javascript - 引导网站重叠中的各个部分
- chromecast - 如何隐藏 CAF 通知
- amazon-web-services - AWS ALB 侦听器 - https 和 http
- javascript - 如何使用自定义 Vue 实例注入 Vue 组件
- networking - 将所有子域请求隧道传输到本地计算机
- ubuntu - Ubuntu WSL:sudo mv
给予许可被拒绝