首页 > 解决方案 > Calculate Cross-Lingual Phrase Similarity (using e.g., MUSE and Gensim)

问题描述

I am new to NLP and Word Embeddings and still need to learn many concepts within these topics, so any pointers would be appreciated. This question is related to this and this, and I think there may have been developments since these questions had been asked. Facebook MUSE provides aligned, supervised word embeddings for 30 languages, and it can be used to calculate word similarity across different languages. As far as I understand, The embeddings provided by MUSE satisfy the requirement of coordinate space compatibilty. It seems that it is possible to load these embeddings into libraries such as Gensim, but I wonder:

  1. Is it possible to load multiple-language word embeddings into Gensim (or other libraries), and if so:
  2. What type of similarity measure might fit in this use case?
  3. How to use these loaded word embeddings to calculate cross-lingual similarity score of phrases* instead of words?

*e.g., "ÖPNV" in German vs "Trasporto pubblico locale" in Italian for the English term "Public Transport".

I am open o any implementation (libraries/languages/embeddings) though I may need some time to learn this topic. Thank you in advance.

标签: pythonnlpmultilingualgensimword-embedding

解决方案


It is quite usual to average multiple word embeddings to get a phrase or sentence representation. After all, this is exactly what FastText does by default when it is used for sentence classification.

You can, of course, load as many word-embeddings sets in Gensim, but you would need to implement the cross-lingual comparison yourself. You can the vector just using the square bracket notation:

model = gensim.models.fasttext.load_facebook_model('your_path')
vector = model['computer']

Just use cosine similarity for comparing the vector. If you don't want to write it yourself, use scipy.


推荐阅读