首页 > 解决方案 > 使用余弦相似度将列表与 pandas 中的行进行比较并获得排名

问题描述

我有一个 Pandas 数据框和一个用户输入,我需要将用户输入与数据框中的每一行进行比较,并根据余弦相似度获取数据框中的行排名列表。

Department  Country Age Grade   Score
Math    India   Young   A   97
Math    India   Young   B   86
Math    India   Young   D   68
Science India   Young   A   92
Science India   Young   B   81
Science India   Young   C   76
Social  India   Young   B   88
Social  India   Young   D   62
Social  India   Young   C   72

用户输入:

Country Age Grade   Score
India   Young   B   84
India   Young   D   65
India   Young   A   98

我更愿意将数据框的所有行视为列表,并将用户输入视为列表。说User_list1 = ['India','Young','B','84']并使用余弦相似度将其与数据帧的每一行(将它们视为列表)进行比较,并获得 的 Ranked 输出Department

在我的情况下,输出将是 Ranked list of Department : Out = ['Math','Science','Social']: 这应该基于 Cosine Similarity 结果。

标签: pythonpython-3.xcosine-similarity

解决方案


考虑到上述两个数据帧,

df
   Department Country Age Grade Score
0   Math    India   Young   A   97
1   Math    India   Young   B   86
2   Math    India   Young   D   68
3   Science India   Young   A   92
4   Science India   Young   B   81
5   Science India   Young   C   76
6   Social  India   Young   B   88
7   Social  India   Young   D   62
8   Social  India   Young   C   72

input

Country Age Grade   Score
0   India   Young   B   84
1   India   Young   D   65
2   India   Young   A   98

一种可能的解决方案是,

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
import numpy as np
from collections import OrderedDict
import sys

scikit-learn使用包将分类特征转换为数字,

df['Country'] = le.fit_transform(df['Country'])
df['Age'] = le.fit_transform(df['Age'])
df['Grade'] = le.fit_transform(df['Grade'])
df

输出:

Department Country Age Grade    Score
0       Math    0      0    0   97
1       Math    0      0    1   86
2       Math    0      0    3   68
3      Science  0      0    0   92
4      Science  0      0    1   81
5      Science  0      0    2   76
6      Social   0      0    1   88
7      Social   0      0    3   62
8      Social   0      0    2   72

input['Country'] = le.fit_transform(input['Country'])
input['Age'] = le.fit_transform(input['Age'])
input['Grade'] = le.fit_transform(input['Grade'])
input

输出:

 Country  Age   Grade  Score
0   0       0     1     84
1   0       0     2     65
2   0       0     0     98

定义一个cosine-similarity函数,

def cosine_similarity(a, b):
    nom = np.sum(np.multiply(a, b))
    denom = np.sqrt(np.sum(np.square(a))) * np.sqrt(np.sum(np.square(b)))
    sim = nom / denom
    return sim

dept = list(df['Department'].values)
dept = list(OrderedDict.fromkeys(dept).keys())
results = []
for i in range(len(input)):
    similarity = []
    for j in range(len(df)):
        a = input.iloc[i] 
        b = df.iloc[j, 1:]
        c_sim = cosine_similarity(a, b)
        similarity.append(c_sim)

    max_similarity = []
    for k in range(0, len(df), 3):
        max_3 = max(similarity[k:k+3])
        max_similarity.append(max_3)

    max_idx = max_similarity.index(max(max_similarity))
    results.append(dept[max_idx])
results

输出:

['Math', 'Social', 'Math']

推荐阅读