python - 使用 FuzzyMatching 算法判断 Name 列是否有相似词
问题描述
我目前有一个问题,即如何使用 FuzzyWuzzy 包,它标识“名称”列并确定类似名称是否存在类似名称。我还转置了这份文档,因为这是我正在创建的关于 GMPP 数据的项目所需的格式。任何帮助将不胜感激 :)
当前进程被放入数据框中,因为我认为这是解决此问题的最简单方法。我也尝试过直接从 CSV 文件解决,但没有运气。我也尝试将数据分成两部分并进行比较,但如果数据是动态的,这将无法实现我试图达到的目标,如果添加更多数据,该过程可以运行。
https://drive.google.com/open?id=1Aq_2NTfjRtf9b3L5hNHilctVQfcHynPD 这个链接有我使用的数据文件和代码
import pandas as pd
import numpy as np
#Fuzzywuzzy used for string matching
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
#Reading Data in and creating DB
data = pd.read_csv("data(2).csv", encoding="ISO-8859-1")
print(data.head)
data = data.sort_values('Name')
print(data.head)
df = pd.read_csv("data(2).csv")
org_list = data['Name']
threshold = 1300
def find_match(x):
#fuzz.partial_token_sort_ratio attempts to account for similar strings out
of order
match = process.extract(x, org_list, limit=2,
scorer=fuzz.partial_token_sort_ratio)[1]
match = match if match[1]>threshold else np.nan
return match
df['match found'] = [find_match(row) for row in df['Name']]
print(df)
#Transposing CSV
transposed_data = data.T
transposed_df = df.T
transposed_df.to_csv(r'Desktop\Transposed_Data(1).csv')
transposed_data.to_csv(r'Desktop\Transposed_Data.csv')
我试图实现的一个例子是,如果一个项目名称被称为 HRMC 程序和另一个 HRMC 项目,我希望它识别它们是相似的并将它放在彼此旁边并删除冗余的以保持相同的字母格式
我也尝试过使用词频逆文档频率 (TF-IDF) 并且能够输出一些名称,但它没有完整的关键信息。然而,它确实删除了“the”和其他冗余信息,所以这会是一种前进的方式吗?
import pandas as pd
import numpy as np
#Fuzzywuzzy used for string matching
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from sklearn.feature_extraction.text import TfidfVectorizer
#Reading Data in and creating DB
data = pd.read_csv("data(2).csv", encoding="ISO-8859-1")
#Alphabetical Order
data = data.sort_values('Name')
#Using term frequency–inverse document frequency (TF-IDF)
#Converts raw document into a matrix of TF-IDF features
vectorizer = TfidfVectorizer()
#Collection of text in column Name being gathered
corpus = data['Name'].values
#Counts how many times each word existed in data set
vectorizer.fit_transform(corpus)
#Prints the selected column Names
print(vectorizer.get_feature_names())
def ChunkIterator(filename):
#The number of rows to be read into a dataframe at a single time to fit
into local memory
for chunk in pd.read_csv('data(2).csv'):
for document in chunk['Name'].values:
yield document
解决方案
推荐阅读
- android - Lib 内部的房间持久性
- css - Angular 4在按钮悬停时添加过渡
- ruby-on-rails - Websocket 服务器 On Rails - 仅限客户端
- c# - Active Directory:MSAL (UWP) PublicClientApplication.AcquireTokenAsync(...) 返回异常
- javascript - 如何将图像地图区域 ID 发送到 php 文件?
- eclipse - birt 报告列在导出到 excel 时自动调整大小
- amazon-web-services - 带有 Unauth 角色的 AWS/Cognito/IAM 错误
- python-3.x - 为什么 asyncio 被 processPool 阻塞?
- redmine - 组中的 Redmine API 用户
- html - External CSS is not working in NodeJS