首页 > 解决方案 > 如何在python中为2个二进制向量做简单的匹配系数相似度?

问题描述

我有一个如下所示的数据框,我想计算相似度匹配系数和 tanimoto 系数以及 Jaccard 系数,看看哪个更适合我的相似度指标。

我没有看到在 sklearn 中调用它们的任何选项,所以想知道是否有人有办法做到这一点

data = {'part1':[0, 1, 0, 0, 1, 0, 1, 0, 1, 0],
        'part2':[0, 1, 0, 1, 0, 0, 1, 1, 0, 1],
        'part3':[0, 1, 0, 0, 1, 0, 1, 0, 1, 0],
        'part4':[0, 1, 0, 1, 0, 0, 1, 1, 0, 1],
        'part5':[0, 1, 1, 0, 1, 0, 1, 0, 1, 0],
        'part6':[0, 1, 1, 1, 0, 0, 1, 1, 0, 1],
        'part7':[0, 1, 1, 0, 1, 0, 1, 0, 1, 0],
        'part8':[0, 1, 1, 1, 0, 0, 1, 1, 0, 1],
        'part9':[0, 1, 1, 0, 1, 0, 1, 0, 1, 0 ],
        'part10':[0, 1, 0, 1, 0, 0, 1, 1 , 0, 1],
        'part11':[0, 1, 1, 0, 1, 0, 1, 0, 1, 0 ],
        'part12':[0, 1, 0, 1, 0, 0, 1, 1 , 0, 0]
        }
        # 'int_combined':[12.0, 10.0, 12.0, 10.0]}
 
# Creates pandas DataFrame.
df = pd.DataFrame(data, index =['test1',
                                'test2',
                                'test3',
                                'test4',
                                'test5',
                                'test6',
                                'test7',
                                'test8',
                                'test9',
                                'test10'
                                ])

到目前为止,这就是我对 Jaccard 系数和余弦相似度的看法

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(12,8)})
from sklearn.metrics.pairwise import pairwise_distances
jac_sim = 1 - pairwise_distances(df, metric = "hamming")
cosine_similarity =1 - pairwise_distances(df, metric = "cosine")
jac_sim = pd.DataFrame(jac_sim, index=df.index, columns=df.index)
cosine_similarity = pd.DataFrame(cosine_similarity, index=df.index, columns=df.index)
fig, axs = plt.subplots(ncols=2)
fig.suptitle('Plotting both Jaccard coefficent and Cosine similarity matrix')
sns.heatmap(jac_sim, annot=True, fmt='.2g', vmin= 0, vmax = 1, center = 0.5, cmap= 'coolwarm', linewidths =2, linecolor='black', square=True, cbar=False, ax = axs[0])
axs[0].set_title('Jaccard similairty matrix')
sns.heatmap(cosine_similarity, annot=True, fmt='.2g', vmin= 0, vmax = 1, center = 0.5, cmap= 'coolwarm', linewidths =2, linecolor='black', square=True, cbar=False, ax = axs[1])
axs[1].set_title('Cosine similarity matrix')

any help will be greatly appreciated

标签: pythonpandassimilaritycosine-similarity

解决方案


推荐阅读