首页 > 解决方案 > 使用余弦相似度法比较python中pandas数据框多列的文本

问题描述

我正在寻找使用余弦相似度来计算熊猫数据框列之间的相似度。我有 6 个文本列分为 2 个部分,前 3 个列是第一部分 [textA,textB,textC],其余在第二个部分 [text1,text2,text3]。我必须将 sec1 中的每一列与 sec2 的所有列进行比较,并根据通过创建单独的列找到或未找到的匹配返回匹配项、相似性分数和真或假。

试图通过使用下面的代码来实现这一点,但无法完成它与如何对列进行矢量化和计算相似度,有人可以在这方面指导我吗,

count_vectorizer = TfidfVectorizer(stop_words='english')
sparse_matrix = count_vectorizer.fit_transform(data[[]])

doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix, 
              columns=count_vectorizer.get_feature_names(), 
              index=[data.iloc[0:, :]])

list_1 = ['Text A', 'Text B', 'Text C' ]
list_2 = ['Text 1', 'Text 2', 'Text 3']

list_entity = []
list_best_name = []

for col in df.columns:
#print(col)

    for col1 in list_1:
        if col1 in col:
            first_list.append(col)
        
    for col2 in list_2:
        if col2 in col:
            next_list.append(col)
first_list, next_list

def lets_match(x):

    for text1 in next_list:
        for text2 in first_list:
            try:
                if x[text1] in x[text2]:
                    return True
            except:
                continue
    return False
df['output'] = df.apply(lets_match,axis =1)
print(df)

预计输出如下数据的最后 3 列。

下面是csv格式的数据,

Text A, Text B, Text C, Text 1, Text 2, Text 3, Match, Similirity Score, Result
SIDDIS JEWELS INDIA LLP, SANJAY SHRESTHA, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI,  BHUTAN], MEGA INTERNATIONAL COMMERCIAL BANK, SANJAY SHRESTHA, [LION LIMITED,FLAT/RMA5,9/F SILVERCORP INTERNATIONAL TOWER], SANJAY SHRESTHA, 0.53, TRUE
T BANK LIMITED, PUNJAB NATIONAL BANK, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI,  BHUTAN], KINGXIN INTERNATIONAL TRADE CO, PUNJAB BANK, [SILVERCORP INTERNATIONAL TOWER, HONG KONG], , 0.67, FALSE
MEGA INTERNATIONAL COMMERCIAL BANK, SANJAY SHRESTHA, [LION LIMITED,FLAT/RMA5,9/F CORP INTERNATIONAL TOWER], SIDDIS JEWELS INDIA LLP, France, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], , 0.53, FALSE
SIDDIS JEWELS INDIA LLP, Italy, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], SIDDIS JEWELS INDIA LLP, Anil Kumar, [CORP INTERNATIONAL TOWER, HONG KONG], SIDDIS JEWELS INDIA LLP, 0.34, TRUE
BABA DAWOO COMMERCIAL VEHICLES, Syrian Arab Republic, [CORP NATIONAL TOWER, HONG KONG], T BANK LIMITED, Syria, [CORP INTERNATIONAL TOWER, HONG KONG], Syria, 0.95, TRUE
T BANK LIMITED, UAE, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI,  BHUTAN], KINGXIN INTERNATIONAL TRADE CO, Neerav Modi, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], , 0.83, FALSE
ANDANI GLOBAL PTE LTD, North Korea, [LION LIMITED,FLAT/RMA5,9/F CORP INTERNATIONAL TOWER], NTS (ASIA PACIFIC) PTE LTD, North Korea, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], North Korea, 0.53, TRUE
KINGXIN INTERNATIONAL TRADE CO, Neerav Modi, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], ADANI GLOBAL FZE, Syria, [CORP INTERNATIONAL TOWER, HONG KONG], , 0.67, FALSE
AMIAN DIAMONDS NV, Vijay Malya, [CORP INTERNATIONAL TOWER, HONG KONG], AMIAN DIAMONDS NV, Vijay Malya, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI,  BHUTAN], AMIAN DIAMONDS NV, Vijay Malya , 0.53, TRUE
AMIAN DIAMONDS NV, Mohammad Ali, [LION LIMITED,FLAT/RMA5,9/F CORP NATIONAL TOWER], ANDANI GLOBAL FZE, Ali Mohammad, [LION LIMITED,FLAT/RMA5,9/F CORP INTERNATIONAL TOWER], Ali Mohammad, 0.95, TRUE
NET ELECTRONICS L L C, Iran, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], AMIAN DIAMONDS NV, Iran, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI,  BHUTAN], Iran, 0.83, TRUE
GEGA INTERNATIONAL COMMERCIAL BANK, Rajendra Nagar, [CORP INTERNATIONAL TOWER, HONG KONG], SIDDIS JEWELS INDIA LLP, Rajendra Nagar, [CORP INTERNATIONAL ,HONG KONG], Rajendra Nagar, CORP INTERNATIONAL ,HONG KONG, 0.83, TRUE

标签: pythonpandasnlpcosine-similaritytextmatching

解决方案


推荐阅读