python - Gensim find vectors/words in ball of radius r
问题描述
I would like take word "book" (for example) get its vector representation, call it v_1 and find all words whose vector representation is within ball of radius r of v_1 i.e. ||v_1 - v_i||<=r, for some real number r.
I know gensim has most_similar
function, which allows to state number of top vectors to return, but it is not quite what I need. I surely can use brute force search and get the answer, but it will be to slow.
解决方案
If you call most_similar()
with a topn=0
, it will return the raw unsorted cosine-similarities to all other words known to the model. (These similarities will not be in tuples with the words, but simply in the same order as the words in the index2entity
property.)
You could then filter those similarities for those higher than your preferred threshold, and return just those indexes/words, using a function like numpy
's argwhere
.
For example:
target_word = 'apple'
threshold = 0.9
all_sims = wv.most_similar(target_word, topn=0)
satisfactory_indexes = np.argwhere(all_sims > threshold)
satisfactory_words = [wv.index2entity[i] for i in satisfactory_indexes]
推荐阅读
- python - 从用户那里获得 2 个不同的日期输入
- javascript - 用猫鼬更新许多属性的优雅方式
- python - 正则表达式负前瞻在 Python 中的行为出乎意料
- hibernate - Spring Boot JPA 结果集重复第一行数据
- erlang - mix ecto.migrate 导致 recv 超时失败的可能根本原因
- scala - 监控 Monix 应用程序内部动态的最佳实践
- python - scrapy crawl 命令使用了错误的 python 解释器(不是当前的)
- java - 找不到Java应用程序的主要方法
- arrays - apiClient.EnvelopeDocumentFields:更新返回请求 200
- reactjs - React Native:如何让 Moneris 事务返回到应用程序?