首页 > 解决方案 > scikit-learn 管道:在 PCA 产生不希望的随机结果后进行归一化

问题描述

我正在运行一个管道,在最终运行逻辑回归之前对输入进行规范化、运行 PCA、对 PCA 因子进行规范化。

但是,我在产生的混淆矩阵上得到了可变的结果。

我发现,如果我删除第三步(“normalise_pca”),我的结果是不变的。

我已经为所有管道步骤设置了 random_state=0 。知道为什么我会得到可变的结果吗?

def exp2_classifier(X_train, y_train):

    estimators = [('robust_scaler', RobustScaler()), 
                  ('reduce_dim', PCA(random_state=0)), 
                  ('normalise_pca', PowerTransformer()), #I applied this as the distribution of the PCA factors were skew
                  ('clf', LogisticRegression(random_state=0, solver="liblinear"))] 
                #solver specified here to suppress warnings, it doesn't seem to effect gridSearch
    pipe = Pipeline(estimators)

    return pipe

exp2_eval = Evaluation().print_confusion_matrix
logit_grid = Experiment().run_experiment(asdp.data, "heavy_drinker", exp2_classifier, exp2_eval);

标签: scikit-learnpca

解决方案


我无法重现您的错误。我尝试了 sklearn 的其他示例数据集,但多次运行得到了一致的结果。因此,方差可能不是由于 normalize_pca

from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler,PowerTransformer
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target

from sklearn.model_selection import train_test_split

X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2, random_state=42)

estimators = [('robust_scaler', RobustScaler()), 
              ('reduce_dim', PCA(random_state=0)), 
              ('normalise_pca', PowerTransformer()), #I applied this as the distribution of the PCA factors were skew
              ('clf', LogisticRegression(random_state=0, solver="liblinear"))] 
            #solver specified here to suppress warnings, it doesn't seem to effect gridSearch
pipe = Pipeline(estimators)

pipe.fit(X_train,y_train)

print('train data :')
print(confusion_matrix(y_train,pipe.predict(X_train)))
print('test data :')
print(confusion_matrix(y_eval,pipe.predict(X_eval)))

输出:

train data :
[[166   3]
 [  4 282]]
test data :
[[40  3]
 [ 3 68]]

推荐阅读