python - 有没有办法自动清理熊猫 DataFrames 的数据?
问题描述
我正在清理机器学习项目的数据,分别用“年龄”和“票价”列的零和平均值替换缺失值。代码如下:
train_data['Age'] = train_data['Age'].fillna(0)
mean = train_data['Fare'].mean()
train_data['Fare'] = train_data['Fare'].fillna(mean)
由于我必须对其他数据集多次执行此操作,因此我想通过创建一个通用函数来自动化此过程,该函数将 DataFrame 作为输入并执行修改它并返回修改后的函数的操作。代码如下:
def data_cleaning(df):
df['Age'] = df['Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df['Fare'].fillna()
return df
但是,当我通过训练数据 DataFrame 时:
train_data = data_cleaning(train_data)
我收到以下错误:
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_42/1440633985.py in <module>
1 #print(train_data)
----> 2 train_data = data_cleaning(train_data)
3 cross_val_data = data_cleaning(cross_val_data)
/tmp/ipykernel_42/3053068338.py in data_cleaning(df)
2 df['Age'] = df['Age'].fillna(0)
3 fare_mean = df['Fare'].mean()
----> 4 df['Fare'] = df['Fare'].fillna()
5 return df
/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args,
**kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in fillna(self, value,
method, axis, inplace, limit, downcast)
4820 inplace=inplace,
4821 limit=limit,
-> 4822 downcast=downcast,
4823 )
4824
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in fillna(self, value,
method, axis, inplace, limit, downcast)
6311 """
6312 inplace = validate_bool_kwarg(inplace, "inplace")
-> 6313 value, method = validate_fillna_kwargs(value, method)
6314
6315 self._consolidate_inplace()
/opt/conda/lib/python3.7/site-packages/pandas/util/_validators.py in
validate_fillna_kwargs(value, method, validate_scalar_dict_value)
368
369 if value is None and method is None:
--> 370 raise ValueError("Must specify a fill 'value' or 'method'.")
371 elif value is None and method is not None:
372 method = clean_fill_method(method)
ValueError: Must specify a fill 'value' or 'method'.
在一些研究中,我发现我必须使用 apply() 和 map() 函数,但我不确定如何输入列的平均值。此外,这不能很好地扩展,因为我必须在将它们输入到函数之前计算所有的 fillna 值,这很麻烦。因此我想问一下,有没有更好的方法来自动化数据清洗?
解决方案
So yes, the other answer explains where the error is coming from.
However, the warning at the beginning has nothing to do with filling NaNs. The warning is telling you that you are modifying a slice of a copy of your dataframe. Change your code to
def data_cleaning(df):
df['Age'] = df.loc[:, 'Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df.loc[:, 'Fare'].fillna(fare_mean) # <- and also fix this error
return df
I suggest also searching that specific warning here, as there are hundreds of posts detailing this warning and how to deal with it. Here's a good one.
推荐阅读
- c - 如何在 C 中生成这种模式?
- javascript - 如何测试 Quasar(作为 Vue CLI 插件)?
- php - PHP - 从上传客户端文件到 PHP 将其保存在 tmp 文件中的过程是什么
- android - 有没有什么方法可以在不制作移动应用程序的情况下开发蓝牙设备在移动设备中弹出?
- csv - 使用 apache-nifi 将值替换为流文件中的变量
- android - 执行后台任务的最佳方法是什么?
- linux - “find -readable” 查找不可读的文件夹
- corda - 在 Docker 容器中包含 RPCClient
- java - 当客户端断开连接时,有没有办法停止服务器端流式传输
- sql - postgres 会自动为每一行生成一个 id 吗?