首页 > 解决方案 > spacy/sense2vec 中的 most_similar 方法打印所有内容

问题描述

当我使用most_similar中提供的方法时sense2vec,我会打印出整个词汇表。我认为这不能正常工作。例如,我50000000只是为了测试“ decrease|VERB”,我有 188325 的列表

import spacy
from sense2vec import Sense2Vec
from sense2vec import Sense2VecComponent
nlp = spacy.load("en_core_web_sm")
s2v = Sense2Vec().from_disk("./s2v_old/")
most_similar = s2v.most_similar("decrease|VERB", n=50000000)
j =sorted(list(set([i.lower() for i in [' '.join(i[0].split('|')[0].split('_')) for i in most_similar] if i.isalpha()])))
print(len(j)) # 188325

print(j[:100]) 
['a',
 'aa',
 'aaa',
 'aaaa',
 'aaaaa',
 'aaaaaa',
 'aaaaaaa',
 'aaaaaaaa',
 'aaaaaaaaa',
 'aaaaaaaaaaa',
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 'aaaaaaaaaaaaaaand',...]

这不是减少的意思。我认为概率计算只是忽略了事实的某个地方存在错误。

标签: spacysense2vec

解决方案


如果您请求最相似的 50000000 个单词,您将获得整个词汇表。尝试n=3n=10改为:

s2v.most_similar("decrease|VERB", n=10)
# [('increase|VERB', 0.961), ('decreasing|VERB', 0.9295), ('increasing|VERB', 0.9273), ('decreases|VERB', 0.9251), ('increases|VERB', 0.9062), ('reducing|VERB', 0.904), ('increases|NOUN', 0.8928), ('decrease|NOUN', 0.8826), ('decreases|NOUN', 0.8751), ('reduce|VERB', 0.87)]

请注意,结果已经按相似度递减排序。

如果您尝试将字符串“decrease”与字符串“50000000”进行比较,那么这不是正确的方法。这是使用指南的链接:https ://github.com/explosion/sense2vec#-quickstart


推荐阅读