首页 > 解决方案 > 使用 Sklearn 进行插补将“数字”列更改为“对象”(除了填充缺失数据)

问题描述

在估算之前,我在“X_train”中有数值列: numeric_cols = [col for col in X_train.columns if X_train[col].dtype in ['int64','float64']] numeric_cols

插补后,新数据帧“imputed_X_train_missing”中不再有数值列,所有数值列现在都是“对象”。这是应用 XGBRegressor 时的一个潜在问题。

这是我的代码:

X_valid_missing = X_valid.copy()

my_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

my_imputer.fit(X_train_missing)
imputed_X_train_missing = pd.DataFrame(my_imputer.transform(X_train_missing))
imputed_X_valid_missing = pd.DataFrame(my_imputer.transform(X_valid_missing))

imputed_X_train_missing.columns = X_train_missing.columns
imputed_X_valid_missing.columns = X_valid_missing.columns ```

标签: python

解决方案


当其中一列是“对象”时,问题是输入器。插补后所有列结果为“对象”:

import pandas as pd
from sklearn.impute import SimpleImputer

X_train = [['dddd', 2, 3], ['dddd', np.nan, 6], ['dddd', 5, 9]]
X_test = [[np.nan, 2, 3], ['dddd', np.nan, 6], ['dddd', np.nan, 9]]

col_names = ['c1', 'c2', 'c3']

df_x_train = pd.DataFrame(X_train, columns=col_names)
df_x_test = pd.DataFrame(X_test, columns=col_names)
print(df_x_train.info())


RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 3 non-null object
1 c2 2 non-null float64
2 c3 3 non-null int64

imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp.fit(df_x_train)
imputed_x_train = pd.DataFrame(imp.transform(df_x_train))
imputed_x_train.dtypes`

Now all the columns result object:

0 object
1 object
2 object
dtype: object```

推荐阅读