首页 > 解决方案 > Gensim find vectors/words in ball of radius r

问题描述

I would like take word "book" (for example) get its vector representation, call it v_1 and find all words whose vector representation is within ball of radius r of v_1 i.e. ||v_1 - v_i||<=r, for some real number r.

I know gensim has most_similar function, which allows to state number of top vectors to return, but it is not quite what I need. I surely can use brute force search and get the answer, but it will be to slow.

标签: pythongensimword-embedding

解决方案


If you call most_similar() with a topn=0, it will return the raw unsorted cosine-similarities to all other words known to the model. (These similarities will not be in tuples with the words, but simply in the same order as the words in the index2entity property.)

You could then filter those similarities for those higher than your preferred threshold, and return just those indexes/words, using a function like numpy's argwhere.

For example:

target_word = 'apple'
threshold = 0.9
all_sims = wv.most_similar(target_word, topn=0)
satisfactory_indexes = np.argwhere(all_sims > threshold)
satisfactory_words = [wv.index2entity[i] for i in satisfactory_indexes]

推荐阅读