首页 > 解决方案 > scikit-learn 的 Iterative Imputer 的包装器自定义类,用于与 cross_val_score() 一起使用

问题描述

Scikit-learn 的迭代估算器可以以循环方式估算缺失值。为了评估其与其他传统回归器的性能,可以构建一个简单的管道并从 cross_val_score 获取评分指标。问题是Iterative Imputer没有根据错误的“预测”方法:

AttributeError: 'IterativeImputer' object has no attribute 'predict'

请参阅尝试实现的最小示例:

# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# define scaler, model and pipeline
scaler = StandardScaler() # use any scaler
imputer = IterativeImputer() # with any estimator, default = BayesianRidge()
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer)])

train, test = df.values, df['A'].values 
scores = cross_val_score(pipeline, train, test, cv=10, scoring='r2')
print(scores)

有哪些可能的解决方案?如果需要自定义包装器,应如何编写以包含“预测”方法?

标签: pythonmachine-learningscikit-learnmissing-dataimputation

解决方案


cross_val_score最后需要pipelinewith model(有predict

scaler  = StandardScaler()
imputer = IterativeImputer()
model   = BayesianRidge()  # any model

pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])

cross_val_score没有model任何意义。


我还看到了其他问题 - 与您在中使用的train值有关。testcross_val_score

它应该是Xy而不是traintest但它只是名称,所以它不是那么重要,但重要的是你对变量的赋值。

问题是X应该没有y但你使用train = df.values所以你创建Xy

df_train = pd.DataFrame({
                'X': range(20), 
                'y': range(20),
           })

X_train = df_train[ ['X'] ]  # it needs inner `[]` to create DataFrame, not Series
y_train = df_train[  'y'  ]  # it has to be single column (Series)

scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')

(顺便说一句:你不必使用.values

与更多列相同

df_train = pd.DataFrame({
                'A': range(20), 
                'B': range(20), 
                'y': range(20),
           })

X_train = df_train[ ['A', 'B'] ]
y_train = df_train[ 'y' ]

最少的工作代码,但有假数据(没用)

# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import BayesianRidge

df_train = pd.DataFrame({
                'A': range(100),  # fake data
                'B': range(100),  # fake data
                'y': range(100),  # fake data
           })

df_test = pd.DataFrame({
                'A': range(20),  # fake data
                'B': range(20),  # fake data
                'y': range(20),  # fake data
           })

# define scaler, model and pipeline
scaler  = StandardScaler()
imputer = IterativeImputer()
model   = BayesianRidge()

pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])

X_train = df_train[ ['A', 'B'] ]  # it needs inner `[]` to create DataFrame, not Series
y_train = df_train[ 'y' ]         # it has to be single column (Series)

scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
print(scores)

X_test = df_test[['A', 'B']]
y_test = df_test['y']

scores = cross_val_score(pipeline, X_test, y_test, cv=10, scoring='r2')
print(scores)

推荐阅读