首页 > 解决方案 > 交叉验证不适用于自定义元估计器

问题描述

我有一个用两条管道初始化的两阶段元估计器。估计器旨在将观察结果分类为 1、-1 或 0。第一个管道学习区分 0 和 (1, -1),第二个管道学习区分 1 和 -1,去除所有的 0。这是元估计器的代码:

class TwoStageEstimator(BaseEstimator, ClassifierMixin):
    def __init__(self, pipeline_1, pipeline_2):
        self.pipeline_1 = pipeline_1
        self.pipeline_2 = pipeline_2
        
    def fit(self, X, y):
        
        # First-stage training 
        self.pipeline_1 = clone(self.pipeline_1)
        y_train_1 = abs(y)
        self.pipeline = self.pipeline_1.fit(X, y_train_1)
        
        # Second-stage training 
        self.pipeline_2 = clone(self.pipeline_2)
        y_train_2 = y[y != 0]
        X_train_2 = X.loc[y_train != 0, ]
        self.pipeline = self.pipeline_2.fit(X_train_2, y_train_2)

        # Set fit status
        self.is_fit_ = True
        
        return self
    
    def predict(self, X):
        
        # Check is fit had been called
        check_is_fitted(self)

        y = self.pipeline_1.predict(X) * self.pipeline_2.predict(X)
        
        return y

如果我将估算器称为

tsm = TwoStageEstimator(pipeline, pipeline)
prd_stance = tsm.fit(X_train, y_train).predict(X_test)

但是当我尝试使用 CV 时,它会中断。

scores = cross_val_score(
    tsm, X, y, scoring = 'accuracy', cv = ms.StratifiedKFold(n_splits=7, shuffle=True)
)
scores

错误消息似乎表明问题在于拟合中的索引与在 CV 中完成的索引之间存在冲突。

raise IndexingError(
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
...
raise NotImplementedError(
NotImplementedError: iLocation based boolean indexing on an integer type is not available

谁能在这里指出我的解决方案?

标签: python-3.xpandasscikit-learnscikit-learn-pipeline

解决方案


推荐阅读