首页 > 解决方案 > Scikit-learn:在超参数调整后对整个数据集使用交叉验证

问题描述

我在 scikit-learn 中使用决策树对垃圾邮件进行分类。在阅读了这里和其他地方的各种帖子后,我将我的初始数据集拆分为训练和测试,并使用交叉验证对训练集进行了超参数调整。据我了解,应该在训练和测试上都计算分数,以检查模型是否过拟合;考虑到测试集的分数很好,我可以排除这种情况并呈现从整个数据集中获得的分数吗?或者我应该展示我的测试集的结果吗?这是训练/测试集的代码:

scores = cross_val_score(tree, x_train, y_train, cv=10)
scores_pre = cross_val_score(tree, x_train, y_train, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, x_train, y_train, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, x_train, y_train, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.97 (+/- 0.02)
Precision: 0.98 (+/- 0.02)
F-Measure: 0.98 (+/- 0.01)
Recall: 0.98 (+/- 0.02)

scores = cross_val_score(tree, x_test, y_test, cv=10)
scores_pre = cross_val_score(tree, x_test, y_test, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, x_test, y_test, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, x_test, y_test, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.95 (+/- 0.03)
Precision: 0.96 (+/- 0.02)
F-Measure: 0.96 (+/- 0.02)
Recall: 0.97 (+/- 0.03)

这是整个数据集的代码:

scores = cross_val_score(tree, X, y, cv=10)
scores_pre = cross_val_score(tree, X, y, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, X, y, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, X, y, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.97 (+/- 0.04)
Precision: 0.98 (+/- 0.03)
F-Measure: 0.98 (+/- 0.03)
Recall: 0.98 (+/- 0.03)

标签: pythonscikit-learn

解决方案


不,您最终报告的分数应该始终只在测试集上,实际上是验证集。


推荐阅读