首页 > 解决方案 > 需要帮助返回具有最高余弦相似度的句子

问题描述

我得到了余弦相似度,tfidf_matrix 矩阵是我存储文档的地方,但我不知道如何遍历它以找到它等于余弦相似度分数,以便我可以返回最相关的句子

from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np from scipy.sparse.csr import csr_matrix import pandas as pd import sys from sklearn.metrics.pairwise import cosine_similarity

def 查询(文章,问题):

sklearn的TfidfVectorizer 需要字符串列表作为输入。

question = question.lower()

dataset = [question]

with open(article,'r') as f:
    output = f.read()
    output = output.lower()
    output = [output]

vectorizer = TfidfVectorizer(input=output, analyzer='word', ngram_range=(1,1),
                 min_df = 0, stop_words=None)

tfidf_matrix = vectorizer.fit_transform(output)
query_tfidf = vectorizer.transform([question.lower()])

CosSim = cosine_similarity(tfidf_matrix,query_tfidf)

将 TF-IDF 表格式化为 pd.DataFrame 格式。

#for x in tfidf_matrix:

vocab = vectorizer.get_feature_names()
documents_tfidf_lol = [{word:tfidf_value for word, tfidf_value in zip(vocab, sent)} 
for sent in tfidf_matrix.toarray()]

documents_tfidf = pd.DataFrame(documents_tfidf_lol)
documents_tfidf.fillna(0, inplace=True)

documents_tfidf2 = pd.DataFrame(CosSim)
documents_tfidf2.fillna(0, inplace=True)

t = (tfidf_matrix[:, None] == CosSim).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)

print(t)
        #print(documents_tfidf2)

query("a1.txt", "什么是狗")

标签: pythonscikit-learnnlptf-idfcosine-similarity

解决方案


推荐阅读