首页 > 解决方案 > 如何使用 Python 清理填充有名称的数据框列?

问题描述

我有以下数据框:

df = pd.DataFrame( columns = ['Name']) 
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']

我想清理列以实现以下目标:

df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df

清理后的名称基于以下参考表:

ref = pd.DataFrame( columns = ['Cleaned Names']) 
ref['Cleaned Names'] = ['adam','beth']

我知道模糊匹配,但我不确定这是否是解决问题的最有效方法。

标签: pythonpandasdata-cleaning

解决方案


你可以试试:

lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x})  for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values

解释:

lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x})  for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values

推荐阅读