python - Calculate Cross-Lingual Phrase Similarity (using e.g., MUSE and Gensim)
问题描述
I am new to NLP and Word Embeddings and still need to learn many concepts within these topics, so any pointers would be appreciated. This question is related to this and this, and I think there may have been developments since these questions had been asked. Facebook MUSE provides aligned, supervised word embeddings for 30 languages, and it can be used to calculate word similarity across different languages. As far as I understand, The embeddings provided by MUSE satisfy the requirement of coordinate space compatibilty. It seems that it is possible to load these embeddings into libraries such as Gensim, but I wonder:
- Is it possible to load multiple-language word embeddings into Gensim (or other libraries), and if so:
- What type of similarity measure might fit in this use case?
- How to use these loaded word embeddings to calculate cross-lingual similarity score of phrases* instead of words?
*e.g., "ÖPNV" in German vs "Trasporto pubblico locale" in Italian for the English term "Public Transport".
I am open o any implementation (libraries/languages/embeddings) though I may need some time to learn this topic. Thank you in advance.
解决方案
It is quite usual to average multiple word embeddings to get a phrase or sentence representation. After all, this is exactly what FastText does by default when it is used for sentence classification.
You can, of course, load as many word-embeddings sets in Gensim, but you would need to implement the cross-lingual comparison yourself. You can the vector just using the square bracket notation:
model = gensim.models.fasttext.load_facebook_model('your_path')
vector = model['computer']
Just use cosine similarity for comparing the vector. If you don't want to write it yourself, use scipy.
推荐阅读
- json - JSON Schema - 如果另一个子模式中的属性包含固定值,则在根上验证模式
- python - 如何暂停线程?并且一次发送 5 个用户名
- python - 采用可选参数的非数据描述符
- php - 使用php在mysql中插入无义务数据
- powershell - 如何在 Jenkins 中将字符串参数传递给 powershell
- sql - Spring Data Jpa:联合所有两个不相关的表
- graph - 在 Cypress 中执行区域选择
- sql - 序列 ID 未正确插入
- python - SQLAlchemy 多对多理解
- java - 使用 java 优化 MP4 视频以实现快速流式传输