python - 使用余弦相似度法比较python中pandas数据框多列的文本
问题描述
我正在寻找使用余弦相似度来计算熊猫数据框列之间的相似度。我有 6 个文本列分为 2 个部分,前 3 个列是第一部分 [textA,textB,textC],其余在第二个部分 [text1,text2,text3]。我必须将 sec1 中的每一列与 sec2 的所有列进行比较,并根据通过创建单独的列找到或未找到的匹配返回匹配项、相似性分数和真或假。
试图通过使用下面的代码来实现这一点,但无法完成它与如何对列进行矢量化和计算相似度,有人可以在这方面指导我吗,
count_vectorizer = TfidfVectorizer(stop_words='english')
sparse_matrix = count_vectorizer.fit_transform(data[[]])
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=[data.iloc[0:, :]])
list_1 = ['Text A', 'Text B', 'Text C' ]
list_2 = ['Text 1', 'Text 2', 'Text 3']
list_entity = []
list_best_name = []
for col in df.columns:
#print(col)
for col1 in list_1:
if col1 in col:
first_list.append(col)
for col2 in list_2:
if col2 in col:
next_list.append(col)
first_list, next_list
def lets_match(x):
for text1 in next_list:
for text2 in first_list:
try:
if x[text1] in x[text2]:
return True
except:
continue
return False
df['output'] = df.apply(lets_match,axis =1)
print(df)
预计输出如下数据的最后 3 列。
下面是csv格式的数据,
Text A, Text B, Text C, Text 1, Text 2, Text 3, Match, Similirity Score, Result
SIDDIS JEWELS INDIA LLP, SANJAY SHRESTHA, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI, BHUTAN], MEGA INTERNATIONAL COMMERCIAL BANK, SANJAY SHRESTHA, [LION LIMITED,FLAT/RMA5,9/F SILVERCORP INTERNATIONAL TOWER], SANJAY SHRESTHA, 0.53, TRUE
T BANK LIMITED, PUNJAB NATIONAL BANK, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI, BHUTAN], KINGXIN INTERNATIONAL TRADE CO, PUNJAB BANK, [SILVERCORP INTERNATIONAL TOWER, HONG KONG], , 0.67, FALSE
MEGA INTERNATIONAL COMMERCIAL BANK, SANJAY SHRESTHA, [LION LIMITED,FLAT/RMA5,9/F CORP INTERNATIONAL TOWER], SIDDIS JEWELS INDIA LLP, France, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], , 0.53, FALSE
SIDDIS JEWELS INDIA LLP, Italy, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], SIDDIS JEWELS INDIA LLP, Anil Kumar, [CORP INTERNATIONAL TOWER, HONG KONG], SIDDIS JEWELS INDIA LLP, 0.34, TRUE
BABA DAWOO COMMERCIAL VEHICLES, Syrian Arab Republic, [CORP NATIONAL TOWER, HONG KONG], T BANK LIMITED, Syria, [CORP INTERNATIONAL TOWER, HONG KONG], Syria, 0.95, TRUE
T BANK LIMITED, UAE, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI, BHUTAN], KINGXIN INTERNATIONAL TRADE CO, Neerav Modi, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], , 0.83, FALSE
ANDANI GLOBAL PTE LTD, North Korea, [LION LIMITED,FLAT/RMA5,9/F CORP INTERNATIONAL TOWER], NTS (ASIA PACIFIC) PTE LTD, North Korea, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], North Korea, 0.53, TRUE
KINGXIN INTERNATIONAL TRADE CO, Neerav Modi, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], ADANI GLOBAL FZE, Syria, [CORP INTERNATIONAL TOWER, HONG KONG], , 0.67, FALSE
AMIAN DIAMONDS NV, Vijay Malya, [CORP INTERNATIONAL TOWER, HONG KONG], AMIAN DIAMONDS NV, Vijay Malya, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI, BHUTAN], AMIAN DIAMONDS NV, Vijay Malya , 0.53, TRUE
AMIAN DIAMONDS NV, Mohammad Ali, [LION LIMITED,FLAT/RMA5,9/F CORP NATIONAL TOWER], ANDANI GLOBAL FZE, Ali Mohammad, [LION LIMITED,FLAT/RMA5,9/F CORP INTERNATIONAL TOWER], Ali Mohammad, 0.95, TRUE
NET ELECTRONICS L L C, Iran, [MCC COMPLEX BUILDING, OPPOSITE TO HOTEL TAJ TASHI], AMIAN DIAMONDS NV, Iran, [FINANCIAL DEPARTMENT,HOTEL TAJ TASHI, BHUTAN], Iran, 0.83, TRUE
GEGA INTERNATIONAL COMMERCIAL BANK, Rajendra Nagar, [CORP INTERNATIONAL TOWER, HONG KONG], SIDDIS JEWELS INDIA LLP, Rajendra Nagar, [CORP INTERNATIONAL ,HONG KONG], Rajendra Nagar, CORP INTERNATIONAL ,HONG KONG, 0.83, TRUE
解决方案
推荐阅读
- node.js - nodemon 开始 `node server.js` TypeError: 标记的不是函数
- javascript - Html base href 重写在 200 之后获取第一个 404
- azure-devops - Azure DevOps 管道构建以发布单个 Web 程序集 DLL
- python - TypeError:bufsize 必须是整数,同时打开任何带有子进程的命令
- flutter - Flutter - 可在整个页面上滑动?
- r - 当值在一个范围内时,使相关矩阵中的单元格为空白
- swift - 当前时间不显示在视图中
- javascript - 为什么 setInterval 和 clearInterval 在 JS 中不起作用?
- embedded - Buildroot - 挂在启动内核
- json - Kotlin 读取 json 未知类型 Spring