首页 > 解决方案 > 用数据集中正确的国家名称替换错误的国家名称

问题描述

示例数据集 我面临一个问题,不知道如何解决它。我有一个包含两个列的大型数据集,即国家和城市名称。由于人为错误,有多个条目的国家和城市名称拼写错误。例如,英格兰写成 Egnald

谁能指导我如何在python中检查和纠正它们?

我能够通过使用下面的代码找到不正确的条目,但我不确定如何使用正确的自动化流程更正它们,因为我无法手动完成

谢谢

这是我到目前为止所做的:

import pycountry as pc

#converting billing country to lower string
df['Billing Country'].str.lower()

input_country_list=list(df['Billing Country'])
input_country_list=[element.upper() for element in input_country_list];
def country_name_check():
pycntrylst = list(pc.countries)
alpha_2 = []
alpha_3 = []
name = []
common_name = []
official_name = []
invalid_countrynames =[]
tobe_deleted = ['IRAN','SOUTH KOREA','NORTH KOREA','SUDAN','MACAU','REPUBLIC 
OF IRELAND']
for i in pycntrylst:
    alpha_2.append(i.alpha_2)
    alpha_3.append(i.alpha_3)
    name.append(i.name)
    if hasattr(i, "common_name"):
        common_name.append(i.common_name)
    else:
        common_name.append("")
    if hasattr(i, "official_name"):
        official_name.append(i.official_name)
    else:
        official_name.append("")
for j in input_country_list:
    if j not in map(str.upper,alpha_2) and j not in map(str.upper,alpha_3) 
and j not in map(str.upper,name) and j not in map(str.upper,common_name) and 
j not in map(str.upper,official_name):
        invalid_countrynames.append(j)
invalid_countrynames = list(set(invalid_countrynames))
invalid_countrynames = [item for item in invalid_countrynames if item not in 
tobe_deleted]
return print(invalid_countrynames)

通过运行上面的代码,我能够得到拼写错误的国家名称,有人可以指导如何用正确的名称替换它们吗?

标签: python

解决方案


您可以使用SequenceMatcherfrom difflib请参见此处)。它有ratio()一个方法,可以让你比较两个字符串的相似度(数字越大表示相似度越高,1.0 表示相同的词):

>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None,'Dog','Cat').ratio()
0.0
>>> SequenceMatcher(None,'Dog','Dogg').ratio()
0.8571428571428571
>>> SequenceMatcher(None,'Cat','Cta').ratio()
0.6666666666666666

我的想法是列出正确的国家名称列表,并将数据框中的每条记录与该列表中的每个项目进行比较,然后选择最相似的,因此您应该得到正确的国家名称。然后您可以将其放入函数中,并将此函数应用于数据框中 Country 列中的所有记录:

>>> #let's say we have following dataframe
>>> df
   number  country
0       1  Austria
1       2  Autrisa
2       3   Egnald
3       4   Sweden
4       5  England
5       6  Swweden
>>>
>>> #let's specify correct names
>>> correct_names = {'Austria','England','Sweden'}
>>>
>>> #let's specify the function that select most similar word
>>> def get_most_similar(word,wordlist):
...     top_similarity = 0.0
...     most_similar_word = word
...     for candidate in wordlist:
...         similarity = SequenceMatcher(None,word,candidate).ratio()
...         if similarity > top_similarity:
...             top_similarity = similarity
...             most_similar_word = candidate
...     return most_similar_word
...
>>> #now apply this function over 'country' column in dataframe
>>> df['country'].apply(lambda x: get_most_similar(x,correct_names))
0    Austria
1    Austria
2    England
3     Sweden
4    England
5     Sweden
Name: country, dtype: object

推荐阅读