首页 > 解决方案 > Python:我想对逻辑回归和报告分数执行 5 折交叉验证。我使用 LogisticRegressionCV() 还是 cross_val_score()?

问题描述

cross_val_scores 给出的结果与 LogisticRegressionCV 不同,我不知道为什么。

这是我的代码:

seed = 42
test_size = .33
X_train, X_test, Y_train, Y_test = train_test_split(scale(X),Y, test_size=test_size, random_state=seed)

#Below is my model that I use throughout the program.

model = LogisticRegressionCV(random_state=42)
print('Logistic Regression results:')
        
#For cross_val_score below, I just call LogisticRegression (and not LogRegCV) with the same parameters.

scores = cross_val_score(LogisticRegression(random_state=42), X_train, Y_train, scoring='accuracy', cv=5)
print(np.amax(scores)*100)
print("%.2f%% average accuracy with a standard deviation of %0.2f" % (scores.mean() * 100, scores.std() * 100))
        
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(Y_test, predictions)

coef=np.round(model.coef_,2)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

输出是这样的。

Logistic Regression results:
79.90483019359885
79.69% average accuracy with a standard deviation of 0.14
Accuracy: 79.81%

为什么 cross_val_score 的最大准确度高于 LogisticRegressionCV 使用的准确度?

而且,我认识到 cross_val_scores 没有返回模型,这就是我想使用 LogisticRegressionCV 的原因,但我很难理解为什么它表现不佳。同样,我不确定如何从 LogisticRegressionCV 获得预测变量的标准差。

标签: pythonmachine-learningscikit-learndata-science

解决方案


对我来说,可能有几点需要考虑:

  1. 交叉验证通常在您应该模拟验证集时使用(例如,当训练集没有那么大而无法分为训练集、验证集和测试集时)并且使用训练数据。在您的情况下,您正在计算model测试数据的准确性,因此无法准确比较结果。
  2. 根据文档

交叉验证估计器被命名为 EstimatorCV 并且往往大致等同于 GridSearchCV(Estimator(), ...)。使用交叉验证估计器优于规范估计器类以及网格搜索的优势在于,它们可以通过在交叉验证过程的先前步骤中重用预先计算的结果来利用热启动。这通常会导致速度提高。

如果您查看此代码段,您会发现确实发生了这样的事情:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split

data = load_breast_cancer()
X, y = data['data'], data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

estimator = LogisticRegression(random_state=42, solver='liblinear')
grid = {
    'C': np.power(10.0, np.arange(-10, 10)), 
}

gs = GridSearchCV(estimator, param_grid=grid, scoring='accuracy', cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_)                        # 0.953846153846154

lrcv = LogisticRegressionCV(Cs=list(np.power(10.0, np.arange(-10, 10))),
                        cv=5, scoring='accuracy', solver='liblinear', random_state=42)
lrcv.fit(X_train, y_train)
print(lrcv.scores_[1].mean(axis=0).max())    # 0.953846153846154

我建议也看看这里,以便了解详细信息lrcv.scores_[1].mean(axis=0).max()

  1. 最终,为了得到相同的结果,cross_val_score最好写:

     score = cross_val_score(gs.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
     score.mean()                             # 0.953846153846154
    

推荐阅读