首页 > 解决方案 > 为 sklearn 管道构建 Transformer - 用 KNN 预测替换空值

问题描述

我想编写一个类,根据用其他变量标识的 k 个最近邻居来替换一个目标变量中 DataFrame 的缺失值。此类将“拟合”训练集上的 KNN,该 KNN 稍后将“预测”训练集和测试集的缺失值。

此类必须包含在 sklearn.Pipeline中,这意味着它必须包含 fit() 和 transform() 函数,这些函数将由管道调用。我找不到写下这门课的好方法。

到目前为止我的代码做了什么:

  1. 为 KNN 准备数据:(a) 使用标准技术填充空值 (b) 一个热编码分类变量
  2. 在已删除目标变量的数据帧上拟合 KNN
  3. 预测目标为空的行上的目标变量

我的主要问题是步骤 1.a 和 1.b 创建了不应在测试集上“改装”的临时 DataFrame。

我需要您的帮助以编写方式将我的代码片段。

到目前为止,这是我的代码:

col = 'native-country' #one specific column where nans should be replaced using KNN
n_neighbors = 3

######
#I guess this block should be in a pipeline so that we transform the test set with the same dict as the train set
######
miss = TreatMissingsWithCommons() #this class replaces numerical nans by mean() and categorical nans by most frequent value
miss.fit(data)
data_full = miss.transform(data)

#One Hot Encode categorical variables to pass the data to KNN
ohe = DummyTransformer()
ohe.fit(data_full)
#OHE categorical features on lines where col is not null
data_ohe_full = ohe.transform(data_full[~data[col].isnull()].drop(col, axis=1))

#Fit the classifier on lines where col is null
if data[col].dtype in ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']:
    knn = KNeighborsRegressor(n_neighbors = n_neighbors)
    knn.fit(data_ohe_full, data[col][~data[col].isnull()])
else:
    knn = KNeighborsClassifier(n_neighbors = n_neighbors)
    knn.fit(data_ohe_full, data[col][~data[col].isnull()])

#OHE on lines where col is null, and make the prediction
ohe_nulls = ohe.transform(data_full[data[col].isnull()].drop(col,axis=1))
knn.predict(ohe_nulls)

这里有一些对繁殖的帮助:

data = pd.DataFrame({'age': {0: 39,
  4: 28,
  10777: 53,
  14430: 21,
  19061: 19,
  19346: 39,
  24046: 39,
  25524: 43,
  30902: 18},
 'education-num': {0: 13,
  4: 13,
  10777: 9,
  14430: 7,
  19061: 8,
  19346: 13,
  24046: 4,
  25524: 10,
  30902: 5},
 'native-country': {0: 'United-States',
  4: 'Cuba',
  10777: np.nan,
  14430: 'United-States',
  19061: 'El-Salvador',
  19346: np.nan,
  24046: 'Dominican-Republic',
  25524: 'United-States',
  30902: np.nan},
 'workclass': {0: 'State-gov',
  4: 'Private',
  10777: 'Private',
  14430: np.nan,
  19061: 'Private',
  19346: 'Private',
  24046: 'Private',
  25524: np.nan,
  30902: 'Private'}})

编辑:经过一个美好的夜晚,我澄清了我的想法并得到了解决方案。它非常肮脏,所以我希望得到一些关于我所缺少的良好做法的反馈。

class KnnImputer(TransformerMixin, BaseEstimator):

    def __init__(self, target, n_neighbors = 5):
        self.col = target
        self.n_neighbors = n_neighbors

    def fit(self, X, y=None):

        #this class replaces numerical nans by mean() and categorical nans by most frequent value
        miss = TreatMissingsWithCommons() 
        miss.fit(X)
        self.X_full = miss.transform(X)

        #One Hot Encode categorical variables to pass the data to KNN
        self.ohe = DummyTransformer()
        self.ohe.fit(data_full)
        #Create a Dataframe that does not contain any nulls, categ variables are OHE, with all each rows 
        X_ohe_full = self.ohe.transform(self.X_full[~X[self.col].isnull()].drop(self.col, axis=1))

        #Fit the classifier on lines where col is null
        if X[self.col].dtype in ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']:
            self.knn = KNeighborsRegressor(n_neighbors = self.n_neighbors)
            self.knn.fit(X_ohe_full, X[self.col][~X[self.col].isnull()])
        else:
            self.knn = KNeighborsClassifier(n_neighbors = self.n_neighbors)
            self.knn.fit(X_ohe_full, X[self.col][~X[self.col].isnull()])

        return self

    def transform(self, X, y=None):

        #OHE on lines where col is null, and make the prediction
        ohe_nulls = self.ohe.transform(self.X_full[X[self.col].isnull()].drop(self.col,axis=1))

        #Get prediction for nulls in target
        preds = self.knn.predict(ohe_nulls)

        ## Concatenate non nulls with nulls + target preds
        #Nulls + target preds
        X_nulls = X[X[self.col].isnull()].drop(self.col,axis=1)
        X_nulls[self.col] = preds

        X_imputed = pd.concat([X[~X[self.col].isnull()], X_nulls], ignore_index=True)

        return X_imputed#should return the dataframe with a full target

标签: pythonoopscikit-learnpipeline

解决方案


推荐阅读