首页 > 解决方案 > pandas 中的数据清理:如果这些字符串包含在另一列中,则用特定字符串替换空值

问题描述

我目前正在研究一个汽车排放数据集,我想在其中清理/标准化汽车型号名称。数据集相当大,但这里是前 10 行:

cars_em_df = pd.DataFrame({'manufacturer_name_mapped': ['FIAT', 'FIAT','FIAT','FIAT','FIAT','BMW AG','BMW AG','BMW AG','BMW AG','BMW AG'],
'commercial_name':['124 gt multiair auto', '500l wagon pop star t-jet', 
'doblo combi 1.4 95', 'panda  0.9t sge 85 natural power', 'punto 1.4  77 lpg', 'x4 xdrive20d se auto', '216d active tourer b37 f45','220d gran tourer b47 f46','x1 xdrive18d sport','320i xdrive m sport gt auto'],
'fuel_type_mapped':['Petrol', 'Petrol', 'Petrol', 'NG-Biomethane', 'LPG','Diesel','Diesel','Diesel','Diesel','Petrol'],
'file_year':[2018, 2018, 2018, 2018, 2018,2018, 2018, 2018, 2018, 2018], 'emissions': [153,158,165,86,114,131,166,200,151,149], 'commercial_name_cleaned':['124','500',None,'panda','punto','x4',None,None,'x1',None]})  

右侧的“commercial_name_cleaned”列是我第一次清理练习的结果,其中我将“commercial_name”列中的名称与来自不同来源的标准化名称列表相匹配。如您所见,这些都是非常简单和简短的名称。每当我无法匹配模型名称时,我的函数都会返回“无”。

作为第二步,我现在想要执行以下操作:如果它是“无”,则在相邻的“commercial_name”列中搜索特定字符串并将其替换为我指定的模型名称。我试过这个:

    def str_ops(commercial_name_cleaned,commercial_name):
          if commercial_name_cleaned == None:
             if '216' in commercial_name:
                return '2-series'
             elif '220' in commercial_name:
                return '2-series'
             elif '320' in commercial_name:
                return '3-series'

然后我会将此函数应用于数据框:

cars_em_df['commercial_name_cleaned'] = cars_em_df.apply(lambda x: str_ops(str(x.commercial_name_cleaned), str(x.commercial_name)), axis=1)

需要注意的是,如果在'commercial_name'中找不到 '320' 或 '220' 等,则该函数不应更改任何内容,而只返回已在'commercial_name_cleaned'中的值。但是,当我应用该函数时,整个'commercial_name_cleaned'列就变成了 'None' 值。所以这个功能一定有问题。有谁知道如何解决这个问题?

非常感谢您的帮助,谢谢!

标签: pythonpandasdata-cleaning

解决方案


None在列中获取值是commercial_name_cleaned因为您没有从函数返回任何内容str_ops当您没有显式返回任何内容时,会隐None式返回类型。

代替:

def str_ops(commercial_name_cleaned,commercial_name):
    if commercial_name_cleaned == None:
        if '216' in commercial_name:
            return '2-series'
        elif '220' in commercial_name:
            return '2-series'
        elif '320' in commercial_name:
            return '3-series'

和:

def str_ops(commercial_name_cleaned,commercial_name):
    if commercial_name_cleaned == 'None':
        if '216' in commercial_name:
            return '2-series'
        elif '220' in commercial_name:
            return '2-series'
        elif '320' in commercial_name:
            return '3-series'
    else:
        return commercial_name_cleaned

输出:

manufacturer_name_mapped                   commercial_name  ... emissions  commercial_name_cleaned
0                     FIAT              124 gt multiair auto  ...       153                      124
1                     FIAT         500l wagon pop star t-jet  ...       158                      500
2                     FIAT                doblo combi 1.4 95  ...       165                     None
3                     FIAT  panda  0.9t sge 85 natural power  ...        86                    panda
4                     FIAT                 punto 1.4  77 lpg  ...       114                    punto
5                   BMW AG              x4 xdrive20d se auto  ...       131                       x4
6                   BMW AG        216d active tourer b37 f45  ...       166                 2-series
7                   BMW AG          220d gran tourer b47 f46  ...       200                 2-series
8                   BMW AG                x1 xdrive18d sport  ...       151                       x1
9                   BMW AG       320i xdrive m sport gt auto  ...       149                 3-series

推荐阅读