首页 > 解决方案 > 如何计算两个文本的余弦相似度

问题描述

我有两列与此类似

mark-identification     statement_text
one of the top          data is very interesting
over                    this is a towel
have hour               time description

我正在尝试计算两者的余弦相似度

我使用的模型:

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
sent_encode = model.encode(sub1_sample['statement_text'].tolist())
sent_encode = [list(v) for v in sent_encode]
sub1_sample['sent_encode'] = sent_encode

定义 fw:

def gensim_process(text):
    try:
        return gensim.utils.simple_preprocess(text)[0] 
    except:
        return text
df_result['mark-identification'] = df_result['mark-identification'].apply(gensim_process)

计算 cos 相似度 计算标记识别和语句文本的相似度

def get_sim_of_words(fw, text, model):
    fw_encode = model.encode([fw])
    l2_norm_fw = np.sqrt((fw_encode * fw_encode).sum(axis=1))
    fw_encode_norm=fw_encode/l2_norm_fw.reshape(-1,1)
    
    word_of_text = gensim.utils.simple_preprocess(text)
    if len(word_of_text)==0:
        word_of_text = [text]
    
    # calculate cosine similarity 
    text_encode = model.encode(word_of_text)
    l2_norm = np.sqrt((text_encode * text_encode).sum(axis=1))
    text_encode_norm=text_encode/l2_norm.reshape(-1,1)
    
    sim_list = np.matmul(fw_encode_norm,text_encode_norm.T)
    return [np.min(sim_list), np.mean(sim_list), np.median(sim_list), np.max(sim_list)]

结果输出

df_result['ms_stat_info'] = df_result.apply(lambda x:get_sim_of_words(x['mark-identification'],x['statement_text'], model), axis=1)
df_result['sim_score_2'] = df_result['ms_stat_info'].apply(lambda x:x[3])
df_result.head(3)

我当前的代码只能在标记识别下抓取“第一个单词”。无论如何我可以遍历标记识别中的每个单词。计算左句和右句的向量?

标签: pythonpandassimilarity

解决方案


推荐阅读