首页 > 解决方案 > 训练 gensim word2vec 模型后单词不在词汇表中,为什么?

问题描述

所以我想使用词嵌入来获得一些方便的余弦相似度值。在创建模型并检查单词“not”的相似性(在我给模型的数据中)之后,它告诉我这个词不在词汇表中。

为什么它找不到“不”这个词的相似之处?

描述数据如下所示:
[['not', 'only', 'do', 'angles', 'make', 'joints', 'stronger', 'they', 'also', 'provide', 'more', 'consistent', '直','角','辛普森','strongtie','offers','a','wide','variety','of','angles','in','various','sizes ', 'and', 'thicknesses', 'to', 'handle', 'lightduty', 'jobs', 'or', 'projects', 'where', 'a', 'structural', 'connection', 'is', 'needed', 'some', 'can', 'be', 'bent', 'skewed', 'to', 'match', 'the', 'project', 'for', 'outdoor ','projects', 'or', 'those', 'where', 'moisture', 'is', 'present', 'use', 'our', 'zmax', 'zinccoated', 'connectors', 'which ', '提供', '额外', '抵抗', '抗', '腐蚀', 'look', 'for', 'a', 'z', 'at', 'the', 'end', 'of'、'the'、'model'、'numberversatile'、'connector'、'for'、'various'、'connections'、'and'、'home'、'repair'、'projectsstronger'、'than ', '成角度', '钉子', 'or', '螺丝', '紧固', 'alonehelp', '确保', '关节', '是', '一致','直','和','强尺寸','in','x','in','x','inmade','from','gauge','steelgalvanized','for','extra ', '腐蚀', 'resistanceinstall', 'with', 'd', 'common', 'nails', 'or', 'x', 'in', 'strongdrive', 'sd', 'screws'] ]螺丝']]螺丝']]

请注意,我已经尝试将数据作为单独的句子而不是单独的单词给出。

def word_vec_sim_sum(row):
    description = row.product_description.split()
    description_embedding = gensim.models.Word2Vec([description], size=150,
        window=10,
        min_count=2,
        workers=10,
        iter=10)       
    print(description_embedding.wv.most_similar(positive="not"))

标签: pythongensimword2vec

解决方案


你需要降低min_count.

来自文档min_count (int, optional) – 忽略总频率低于此的所有单词。在您提供的数据中"not"出现一次,因此被忽略。通过设置min_count为 1 它可以工作。

import gensim as gensim

data = [['not', 'only', 'do', 'angles', 'make', 'joints', 'stronger', 'they', 'also', 'provide', 'more', 'consistent',
         'straight', 'corners', 'simpson', 'strongtie', 'offers', 'a', 'wide', 'variety', 'of', 'angles', 'in',
         'various', 'sizes', 'and', 'thicknesses', 'to', 'handle', 'lightduty', 'jobs', 'or', 'projects', 'where', 'a',
         'structural', 'connection', 'is', 'needed', 'some', 'can', 'be', 'bent', 'skewed', 'to', 'match', 'the',
         'project', 'for', 'outdoor', 'projects', 'or', 'those', 'where', 'moisture', 'is', 'present', 'use', 'our',
         'zmax', 'zinccoated', 'connectors', 'which', 'provide', 'extra', 'resistance', 'against', 'corrosion', 'look',
         'for', 'a', 'z', 'at', 'the', 'end', 'of', 'the', 'model', 'numberversatile', 'connector', 'for', 'various',
         'connections', 'and', 'home', 'repair', 'projectsstronger', 'than', 'angled', 'nailing', 'or', 'screw',
         'fastening', 'alonehelp', 'ensure', 'joints', 'are', 'consistently', 'straight', 'and', 'strongdimensions',
         'in', 'x', 'in', 'x', 'inmade', 'from', 'gauge', 'steelgalvanized', 'for', 'extra', 'corrosion',
         'resistanceinstall', 'with', 'd', 'common', 'nails', 'or', 'x', 'in', 'strongdrive', 'sd', 'screws']]


def word_vec_sim_sum(row):
    description = row
    description_embedding = gensim.models.Word2Vec([description], size=150,
                                                   window=10,
                                                   min_count=1,
                                                   workers=10,
                                                   iter=10)
    print(description_embedding.wv.most_similar(positive="not"))


word_vec_sim_sum(data[0])

和输出:

[('do', 0.21456070244312286), ('our', 0.1713767945766449), ('can', 0.1561305820941925), ('repair', 0.14236785471439362), ('screw', 0.1322808712720871), ('offers', 0.13223429024219513), ('project', 0.11764446645975113), ('against', 0.08542445302009583), ('various', 0.08226475119590759), ('use', 0.08193354308605194)]

推荐阅读