python-3.x - 使用 word2vec 确定最相似的短语
问题描述
我尝试创建一个模型,该模型使用 word2vec 确定另一个句子最相似的句子。
这个想法是确定一个句子的最相似,我为组成这个句子的单词创建了一个平均向量。
然后,我应该使用嵌入词来预测最相似的句子。我的问题是:在创建源句的平均向量后,如何确定最佳相似的目标句?
这里的代码:
import gensim
from gensim import utils
import numpy as np
import sys
from sklearn.datasets import fetch_20newsgroups
from nltk import word_tokenize
from nltk import download
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
download('punkt') #tokenizer, run once
download('stopwords') #stopwords dictionary, run once
stop_words = stopwords.words('english')
def preprocess(text):
text = text.lower()
doc = word_tokenize(text)
doc = [word for word in doc if word not in stop_words]
doc = [word for word in doc if word.isalpha()] #restricts string to alphabetic characters only
return doc
############ doc content -> num label -> string label
#note to self: texts[XXXX] -> y[XXXX] = ZZZ -> ng20.target_names[ZZZ]
# Fetch ng20 dataset
ng20 = fetch_20newsgroups(subset='all',
remove=('headers', 'footers', 'quotes'))
# text and ground truth labels
texts, y = ng20.data, ng20.target
corpus = [preprocess(text) for text in texts]
def filter_docs(corpus, texts, labels, condition_on_doc):
"""
Filter corpus, texts and labels given the function condition_on_doc which takes
a doc.
The document doc is kept if condition_on_doc(doc) is true.
"""
number_of_docs = len(corpus)
print(number_of_docs)
if texts is not None:
texts = [text for (text, doc) in zip(texts, corpus)
if condition_on_doc(doc)]
labels = [i for (i, doc) in zip(labels, corpus) if condition_on_doc(doc)]
corpus = [doc for doc in corpus if condition_on_doc(doc)]
print("{} docs removed".format(number_of_docs - len(corpus)))
return (corpus, texts, labels)
corpus, texts, y = filter_docs(corpus, texts, y, lambda doc: (len(doc) != 0))
def document_vector(word2vec_model, doc):
# remove out-of-vocabulary words
#print("doc:")
#print(doc)
doc = [word for word in doc if word in word2vec_model.vocab]
return np.mean(word2vec_model[doc], axis=0)
def has_vector_representation(word2vec_model, doc):
"""check if at least one word of the document is in the
word2vec dictionary"""
return not all(word not in word2vec_model.vocab for word in doc)
corpus, texts, y = filter_docs(corpus, texts, y, lambda doc: has_vector_representation(model, doc))
x =[]
for doc in corpus: #look up each doc in model
x.append(document_vector(model, doc))
X = np.array(x) #list to array
model.most_similar(positive=X, topn=1)
解决方案
只需使用余弦距离。它在scipy中实现。
为了提高效率,您可以自己实现它并预先计算向量的范数X
:
X_norm = np.linalg.norm(X, axis=1).expand_dims(0)
调用expand_dims
确保维度被广播。那么对于vectors Y
,你可以得到最相似的,你可以得到最相似的:
def get_most_similar_in_X(Y):
Y_norm = np.linalg.norm(Y, axis=1).expand_dims(1)
similarities = np.dot(Y, X.T) / Y_norm / X_norm
return np.argmax(distances, axis=2)
你会得到与 中的向量X
最相似的向量索引Y
。
推荐阅读
- .net - grpc服务配置和服务方法未找到
- linux - Shell 脚本:MongoDB 转储
- laravel - 在 AWS 实例上使用 redis 时,laravel echo 服务器不监听事件
- windows - 如何在 Windows 驱动器上单独共享和保护单独的文件夹
- swift - Swift UI 瀑布网格/Firestore
- python - 有预定义的颜色列表吗?所以我可以通过索引访问颜色
- c++ - fromRequest 模板,Drogon C++ 框架的特化问题
- javascript - 如何将视频源与音频结合起来?
- javascript - 在 Javascript 中导入
- python - 如何在 DataFrame 的所有列中查找元素的所有出现?