首页 > 解决方案 > 有没有办法自动清理熊猫 DataFrames 的数据?

问题描述

我正在清理机器学习项目的数据,分别用“年龄”和“票价”列的零和平均值替换缺失值。代码如下:

train_data['Age'] = train_data['Age'].fillna(0) 
mean = train_data['Fare'].mean()    
train_data['Fare'] = train_data['Fare'].fillna(mean)

由于我必须对其他数据集多次执行此操作,因此我想通过创建一个通用函数来自动化此过程,该函数将 DataFrame 作为输入并执行修改它并返回修改后的函数的操作。代码如下:

def data_cleaning(df):
    df['Age'] = df['Age'].fillna(0)
    fare_mean = df['Fare'].mean()
    df['Fare'] = df['Fare'].fillna()
    return df

但是,当我通过训练数据 DataFrame 时:

train_data = data_cleaning(train_data)

我收到以下错误:

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: 
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-  
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_42/1440633985.py in <module>
      1 #print(train_data)
----> 2 train_data = data_cleaning(train_data)
      3 cross_val_data = data_cleaning(cross_val_data)

/tmp/ipykernel_42/3053068338.py in data_cleaning(df)
      2     df['Age'] = df['Age'].fillna(0)
      3     fare_mean = df['Fare'].mean()
----> 4     df['Fare'] = df['Fare'].fillna()
      5     return df

/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, 
**kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in fillna(self, value, 
method, axis, inplace, limit, downcast)
   4820             inplace=inplace,
   4821             limit=limit,
-> 4822             downcast=downcast,
   4823         )
   4824 

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in fillna(self, value, 
method, axis, inplace, limit, downcast)
   6311         """
   6312         inplace = validate_bool_kwarg(inplace, "inplace")
-> 6313         value, method = validate_fillna_kwargs(value, method)
   6314 
   6315         self._consolidate_inplace()

/opt/conda/lib/python3.7/site-packages/pandas/util/_validators.py in 
validate_fillna_kwargs(value, method, validate_scalar_dict_value)
        368 
        369     if value is None and method is None:
    --> 370         raise ValueError("Must specify a fill 'value' or 'method'.")
        371     elif value is None and method is not None:
        372         method = clean_fill_method(method)

    ValueError: Must specify a fill 'value' or 'method'.

在一些研究中,我发现我必须使用 apply() 和 map() 函数,但我不确定如何输入列的平均值。此外,这不能很好地扩展,因为我必须在将它们输入到函数之前计算所有的 fillna 值,这很麻烦。因此我想问一下,有没有更好的方法来自动化数据清洗?

标签: pythonpandasdata-cleaning

解决方案


So yes, the other answer explains where the error is coming from.

However, the warning at the beginning has nothing to do with filling NaNs. The warning is telling you that you are modifying a slice of a copy of your dataframe. Change your code to

def data_cleaning(df):
    df['Age'] = df.loc[:, 'Age'].fillna(0)
    fare_mean = df['Fare'].mean()
    df['Fare'] = df.loc[:, 'Fare'].fillna(fare_mean)  # <- and also fix this error
    return df

I suggest also searching that specific warning here, as there are hundreds of posts detailing this warning and how to deal with it. Here's a good one.


推荐阅读