python - How to combine vectors generated by PV-DM and PV-DBOW methods of doc2vec?
问题描述
I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.
I am trying to find similar documents for these 400 datasets using gensim doc2vec. The paper "Distributed Representations of Sentences and Documents" says that "The combination of PV-DM and PV-DBOW often work consistently better (7.42% in IMDB) and therefore recommended."
So I would like to combine the vectors of these two methods and find cosine similarity with all the train documents and select the top 5 with the least cosine distance.
So what's the effective method to combine the vectors of these 2 methods: adding or averaging or any other method ???
After combining these 2 vectors I can normalise each vector and then find the cosine distance.
解决方案
该论文暗示他们已经连接了这两种方法的向量。例如,给定一个 300d PV-DBOW 向量和一个 300d PV-DM 向量,您将在连接后得到一个 600d 的文本向量。
但是,请注意,他们在 IMDB 上的底线结果很难让外人重现。我的测试有时只显示了这些连接向量的小优势。(我特别想知道通过分离级联模型的 300d PV-DBOW + 300d PV-DM 是否比仅在相同的时间内以更少的步骤/并发症训练真正的 600d 模型更好。)
gensim
您可以在其docs/notebooks
目录中包含的示例笔记本之一中查看我重复原始“段落向量”论文的一些实验的演示:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
除其他外,它包括一些步骤和有用的方法,用于将模型对视为一个连接的整体。
推荐阅读
- bash - BASH:从函数返回数组输出
- r - 小样本(20-25 个观察值)- 稳健标准误差(Newey-West)不会改变系数/标准误差。这是正常的吗?
- json.net - JsonConverter 和 Swashbuckle - 装饰招摇的方法
- javascript - 媒体屏幕更改时如何隐藏 WhatsHelp.io javascript
- spring - 通过 Spring Security 登录 VueJS App
- html - Angular 8输入材料没有样式
- php - 提交表单数据和输入值到 AJAX 帖子
- excel - 如何最好地处理大整数?
- date - 如何在 Kotlin 中将输入字符串验证为有效日期?
- javascript - 如何使单选按钮在角度上动态