首页 > 解决方案 > 为什么目标编码器将某些值编码为 NaN?

问题描述

我正在使用 category_encoders 中的目标编码器对特征进行编码,这是我正在使用的代码:

from category_encoders import TargetEncoder
def encode_large_features(features, X_train, X_test, y_train):
    print('target encoding features ...')
    for _ in features:
        target_encoder = TargetEncoder(_)
        target_encoder.fit(X_train[_], y_train)
        name = _ + '_encoded'
        X_train[name] = target_encoder.transform(X_train[_])
        X_train.drop([_], axis=1, inplace=True)
        X_test[name] = target_encoder.transform(X_test[_])
        X_test.drop([_], axis=1, inplace=True)
    return X_train, X_test

目标编码器将一些值编码为 NaN,我不知道为什么?这是一个例子:

在此处输入图像描述

标签: python

解决方案


面临同样的问题:Raised Issue n Repo

通过构建比库版本更好的自定义 KFold-Target 编码器找到了解决方法。KFold 目标编码器不易受到数据泄漏的影响/过拟合的可能性较小。

这不会在像category_encoder库这样的训练数据集中返回 NaN。

下面的示例: chid 是一个分类列,对其应用 KFoldTargetEncoder。

所需库:

from tqdm import tqdm
from sklearn.model_selection import KFold
from sklearn import base

训练数据集:

class KFoldTargetEncoderTrain(base.BaseEstimator, base.TransformerMixin):

def __init__(self, colnames,targetName,n_fold=5,verbosity=True,discardOriginal_col=False):

    self.colnames = colnames
    self.targetName = targetName
    self.n_fold = n_fold
    self.verbosity = verbosity
    self.discardOriginal_col = discardOriginal_col

def fit(self, X, y=None):
    return self


def transform(self,X):

    assert(type(self.targetName) == str)
    assert(type(self.colnames) == str)
    assert(self.colnames in X.columns)
    assert(self.targetName in X.columns)

    mean_of_target = X[self.targetName].mean()
    kf = KFold(n_splits = self.n_fold, shuffle = False, random_state=2019)

    col_mean_name = self.colnames + '_' + 'Kfold_Target_Enc'
    X[col_mean_name] = np.nan

    for tr_ind, val_ind in kf.split(X):
        X_tr, X_val = X.iloc[tr_ind], X.iloc[val_ind]
        X.loc[X.index[val_ind], col_mean_name] = X_val[self.colnames].map(X_tr.groupby(self.colnames)[self.targetName].mean())

    X[col_mean_name].fillna(mean_of_target, inplace = True)

    if self.verbosity:

        encoded_feature = X[col_mean_name].values
        print('Correlation between the new feature, {} and, {} is {}.'.format(col_mean_name,
                                                                                  self.targetName,
                                                                                  np.corrcoef(X[self.targetName].values, encoded_feature)[0][1]))
    if self.discardOriginal_col:
        X = X.drop(self.targetName, axis=1)
        

    return X

Fit_Transform 训练数据:

targetc_chid = KFoldTargetEncoderTrain('chid','target',n_fold=5)
train_df = targetc_chid.fit_transform(train_df)

测试数据集:

class KFoldTargetEncoderTest(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self,train,colNames,encodedName):
        
        self.train = train
        self.colNames = colNames
        self.encodedName = encodedName
        
    def fit(self, X, y=None):
        return self
    def transform(self,X):
        mean =  self.train[[self.colNames,
                self.encodedName]].groupby(
                                self.colNames).mean().reset_index() 

        dd = {}
        for row in tqdm(mean.itertuples(index=False)):
            dd[row[0]] = row[1]
        X[self.encodedName] = X[self.colNames]
        X[self.encodedName] = X[self.encodedName].map(dd.get)
        return X

适合测试数据:

test_targetc_chid = KFoldTargetEncoderTest(train_df,'chid','chid_Kfold_Target_Enc')
valid_df = test_targetc_chid.fit_transform(valid_df)

推荐阅读