首页 > 解决方案 > 使用 FuzzyMatching 算法判断 Name 列是否有相似词

问题描述

我目前有一个问题,即如何使用 FuzzyWuzzy 包,它标识“名称”列并确定类似名称是否存在类似名称。我还转置了这份文档,因为这是我正在创建的关于 GMPP 数据的项目所需的格式。任何帮助将不胜感激 :)

当前进程被放入数据框中,因为我认为这是解决此问题的最简单方法。我也尝试过直接从 CSV 文件解决,但没有运气。我也尝试将数据分成两部分并进行比较,但如果数据是动态的,这将无法实现我试图达到的目标,如果添加更多数据,该过程可以运行。

https://drive.google.com/open?id=1Aq_2NTfjRtf9b3L5hNHilctVQfcHynPD 这个链接有我使用的数据文件和代码

 import pandas as pd
 import numpy as np
 #Fuzzywuzzy used for string matching
 from fuzzywuzzy import fuzz 
 from fuzzywuzzy import process 

#Reading Data in and creating DB
data = pd.read_csv("data(2).csv", encoding="ISO-8859-1")
print(data.head)
data = data.sort_values('Name')
print(data.head)

df = pd.read_csv("data(2).csv")

org_list = data['Name']

threshold = 1300

def find_match(x):

 #fuzz.partial_token_sort_ratio attempts to account for similar strings out 
of order
   match = process.extract(x, org_list, limit=2, 
   scorer=fuzz.partial_token_sort_ratio)[1]
   match = match if match[1]>threshold else np.nan
   return match

df['match found'] = [find_match(row) for row in df['Name']]

print(df)
#Transposing CSV
transposed_data = data.T
transposed_df = df.T
transposed_df.to_csv(r'Desktop\Transposed_Data(1).csv')
transposed_data.to_csv(r'Desktop\Transposed_Data.csv')

我试图实现的一个例子是,如果一个项目名称被称为 HRMC 程序和另一个 HRMC 项目,我希望它识别它们是相似的并将它放在彼此旁边并删除冗余的以保持相同的字母格式

我也尝试过使用词频逆文档频率 (TF-IDF) 并且能够输出一些名称,但它没有完整的关键信息。然而,它确实删除了“the”和其他冗余信息,所以这会是一种前进的方式吗?

import pandas as pd
import numpy as np
#Fuzzywuzzy used for string matching
from fuzzywuzzy import fuzz 
from fuzzywuzzy import process
from sklearn.feature_extraction.text import TfidfVectorizer

#Reading Data in and creating DB
data = pd.read_csv("data(2).csv", encoding="ISO-8859-1")
#Alphabetical Order
data = data.sort_values('Name')

#Using term frequency–inverse document frequency (TF-IDF)
#Converts raw document into a matrix of TF-IDF features
vectorizer = TfidfVectorizer()
#Collection of text in column Name being gathered
corpus  = data['Name'].values
#Counts how many times each word existed in data set
vectorizer.fit_transform(corpus)
#Prints the selected column Names
print(vectorizer.get_feature_names())

def ChunkIterator(filename):
#The number of rows to be read into a dataframe at a single time to fit 
into local memory
for chunk in pd.read_csv('data(2).csv'):
    for document in chunk['Name'].values:
        yield document

标签: pythonpandasfuzzy-search

解决方案


推荐阅读