python - 为什么目标编码器将某些值编码为 NaN?
问题描述
我正在使用 category_encoders 中的目标编码器对特征进行编码,这是我正在使用的代码:
from category_encoders import TargetEncoder
def encode_large_features(features, X_train, X_test, y_train):
print('target encoding features ...')
for _ in features:
target_encoder = TargetEncoder(_)
target_encoder.fit(X_train[_], y_train)
name = _ + '_encoded'
X_train[name] = target_encoder.transform(X_train[_])
X_train.drop([_], axis=1, inplace=True)
X_test[name] = target_encoder.transform(X_test[_])
X_test.drop([_], axis=1, inplace=True)
return X_train, X_test
目标编码器将一些值编码为 NaN,我不知道为什么?这是一个例子:
解决方案
面临同样的问题:Raised Issue n Repo
通过构建比库版本更好的自定义 KFold-Target 编码器找到了解决方法。KFold 目标编码器不易受到数据泄漏的影响/过拟合的可能性较小。
这不会在像category_encoder
库这样的训练数据集中返回 NaN。
下面的示例: chid 是一个分类列,对其应用 KFoldTargetEncoder。
所需库:
from tqdm import tqdm
from sklearn.model_selection import KFold
from sklearn import base
训练数据集:
class KFoldTargetEncoderTrain(base.BaseEstimator, base.TransformerMixin):
def __init__(self, colnames,targetName,n_fold=5,verbosity=True,discardOriginal_col=False):
self.colnames = colnames
self.targetName = targetName
self.n_fold = n_fold
self.verbosity = verbosity
self.discardOriginal_col = discardOriginal_col
def fit(self, X, y=None):
return self
def transform(self,X):
assert(type(self.targetName) == str)
assert(type(self.colnames) == str)
assert(self.colnames in X.columns)
assert(self.targetName in X.columns)
mean_of_target = X[self.targetName].mean()
kf = KFold(n_splits = self.n_fold, shuffle = False, random_state=2019)
col_mean_name = self.colnames + '_' + 'Kfold_Target_Enc'
X[col_mean_name] = np.nan
for tr_ind, val_ind in kf.split(X):
X_tr, X_val = X.iloc[tr_ind], X.iloc[val_ind]
X.loc[X.index[val_ind], col_mean_name] = X_val[self.colnames].map(X_tr.groupby(self.colnames)[self.targetName].mean())
X[col_mean_name].fillna(mean_of_target, inplace = True)
if self.verbosity:
encoded_feature = X[col_mean_name].values
print('Correlation between the new feature, {} and, {} is {}.'.format(col_mean_name,
self.targetName,
np.corrcoef(X[self.targetName].values, encoded_feature)[0][1]))
if self.discardOriginal_col:
X = X.drop(self.targetName, axis=1)
return X
Fit_Transform 训练数据:
targetc_chid = KFoldTargetEncoderTrain('chid','target',n_fold=5)
train_df = targetc_chid.fit_transform(train_df)
测试数据集:
class KFoldTargetEncoderTest(base.BaseEstimator, base.TransformerMixin):
def __init__(self,train,colNames,encodedName):
self.train = train
self.colNames = colNames
self.encodedName = encodedName
def fit(self, X, y=None):
return self
def transform(self,X):
mean = self.train[[self.colNames,
self.encodedName]].groupby(
self.colNames).mean().reset_index()
dd = {}
for row in tqdm(mean.itertuples(index=False)):
dd[row[0]] = row[1]
X[self.encodedName] = X[self.colNames]
X[self.encodedName] = X[self.encodedName].map(dd.get)
return X
适合测试数据:
test_targetc_chid = KFoldTargetEncoderTest(train_df,'chid','chid_Kfold_Target_Enc')
valid_df = test_targetc_chid.fit_transform(valid_df)
推荐阅读
- c# - 使用 dotnet core 2.2 读取 wkt、kml 和 shapefile
- terraform - Terraform“用于应用服务的名称“xxx”需要全局唯一且不可用”
- c# - 我的随机化脚本的 Unity 2D 问题
- matlab - 创建与日期对应的字符串
- apache-spark - 如何计算满足最后一个条件的天数?
- css - Bootstrap 4 - 谷歌地图不显示
- javascript - 从选择中使用 Ajax 更新表
- python - 除了大写之外,还有其他函数可以帮助将字符串的所有字符变为大写吗?
- css - 由 GitHub Pages 托管时看不到背景图片
- php - 这是使用 php 7 断言的类不变性的有效示例吗?