python - 使用 Python 模糊匹配同一数据框中的两列
问题描述
我在同一个数据框中有两个数据集,每个数据集都显示了一个公司列表。一个数据集来自 2017 年,另一个来自今年。我正在尝试将两个公司数据集相互匹配,并认为模糊匹配(FuzzyWuzzy)是最好的方法。使用部分比率,我想简单地列出具有以下值的列:去年公司的名称、最高模糊匹配率、与最高分数相关联的今年公司。原始数据框已被赋予变量“数据”,去年公司名称在“公司”列下,今年公司名称在“公司名称”列下。为了完成这项任务,我尝试使用 extractOne 模糊匹配过程创建一个函数,然后将该函数应用于数据框中的每个值/行。
下面是代码:
names_array=[]
ratio_array=[]
def match_names(last_year,this_year):
for row in last_year:
x=process.extractOne(row,this_year)
names_array.append(x[0])
ratio_array.append(x[1])
return names_array,ratio_array
#last year company names dataset
last_year=data['Company'].dropna().values
#this year companydataset
this_year=data['Company name'].values
name_match,ratio_match=match_names(last_year,this_year)
data['this_year']=pd.Series(name_match)
data['match_rating']=pd.Series(ratio_match)
data.to_csv("test.csv")
但是,每次我执行这部分代码时,我创建的两个添加列都不会显示在 csv.xml 中。事实上,尽管计算机显示它是最近创建的,但“test.csv”只是与以前相同的数据框。如果有人能指出问题或以任何方式帮助我,将不胜感激。
编辑(数据框预览):
Company Company name
0 BODYPHLO SPORTIQUE NaN
1 JOSEPH A PERRY NaN
2 PCH RESORT TENNIS SHOP NaN
3 GREYSTONE GOLF CLUB INC. NaN
4 MUSGROVE COUNTRY CLUB NaN
5 CITY OF PELHAM RACQUET CLUB NaN
6 NORTHRIVER YACHT CLUB NaN
7 LAKE FOREST NaN
8 TNL TENNIS PRO SHOP NaN
9 SOUTHERN ATHLETIC CLUB NaN
10 ORANGE BEACH TENNIS CENTER NaN
然后在公司条目(去年公司数据集)结束后,“公司名称”列(今年公司数据集)开始如下:
4168 NaN LEWIS TENNIS
4169 NaN CHUCKS PRO SHOP AT
4170 NaN CHUCK KINYON
4171 NaN LAKE COUNTRY RACQUET CLUB
4172 NaN SPORTS ACADEMY & RAC CLUB
解决方案
考虑到一列仅从另一端开始,您的数据框结构很奇怪,但是我们可以使它起作用。data
让我们为您提供的以下示例数据框:
Company Company name
0 BODYPHLO SPORTIQUE NaN
1 JOSEPH A PERRY NaN
2 PCH RESORT TENNIS SHOP NaN
3 GREYSTONE GOLF CLUB INC. NaN
4 MUSGROVE COUNTRY CLUB NaN
5 CITY OF PELHAM RACQUET CLUB NaN
6 NORTHRIVER YACHT CLUB NaN
7 LAKE FOREST NaN
8 TNL TENNIS PRO SHOP NaN
9 SOUTHERN ATHLETIC CLUB NaN
10 ORANGE BEACH TENNIS CENTER NaN
11 NaN LEWIS TENNIS
12 NaN CHUCKS PRO SHOP AT
13 NaN CHUCK KINYON
14 NaN LAKE COUNTRY RACQUET CLUB
15 NaN SPORTS ACADEMY & RAC CLUB
然后执行匹配:
import pandas as pd
from fuzzywuzzy import process, fuzz
known_list = data['Company name'].dropna()
def find_match(x):
match = process.extractOne(x['Company'], known_list, scorer=fuzz.partial_token_sort_ratio)
return pd.Series([match[0], match[1]])
data[['this year','match_rating']] = data.dropna(subset=['Company']).apply(find_match, axis=1, result_type='expand')
产量:
Company Company name this year \
0 BODYPHLO SPORTIQUE NaN SPORTS ACADEMY & RAC CLUB
1 JOSEPH A PERRY NaN CHUCKS PRO SHOP AT
2 PCH RESORT TENNIS SHOP NaN LEWIS TENNIS
3 GREYSTONE GOLF CLUB INC. NaN LAKE COUNTRY RACQUET CLUB
4 MUSGROVE COUNTRY CLUB NaN LAKE COUNTRY RACQUET CLUB
5 CITY OF PELHAM RACQUET CLUB NaN LAKE COUNTRY RACQUET CLUB
6 NORTHRIVER YACHT CLUB NaN LAKE COUNTRY RACQUET CLUB
7 LAKE FOREST NaN LAKE COUNTRY RACQUET CLUB
8 TNL TENNIS PRO SHOP NaN LEWIS TENNIS
9 SOUTHERN ATHLETIC CLUB NaN SPORTS ACADEMY & RAC CLUB
10 ORANGE BEACH TENNIS CENTER NaN LEWIS TENNIS
match_rating
0 47.0
1 43.0
2 67.0
3 43.0
4 67.0
5 72.0
6 48.0
7 64.0
8 67.0
9 50.0
10 67.0
推荐阅读
- wpf - 如何在 XAML 中指向上下文菜单的父级?
- ios - 代表在 Swift 程序中为零
- google-cloud-data-fusion - 无法使用 Google Data Fusion 连接到 Salesforce
- javascript - 如何将特定时区的 UTC 日期转换为 UTC +0(默认格林威治)?
- java - 如何从奇数和偶数数组中仅对奇数进行排序,并仅显示已排序的奇数?
- java - 如何扩展两个实用程序类?
- java - springboot中@requestmapping的值没有从表单填充到html
- android - 如何修复“Android SDK Build Tools 版本 (27.0.3) 被忽略...”
- javascript - 当我有多个减速器时,如何仅重置一个减速器的状态并防止在仅一个减速器的状态重置时重新加载
- java - 如何验证地图
以编程方式使用 Spring Validator