首页 > 解决方案 > 应该对原始数据还是拆分数据执行交叉验证分数?

问题描述

当我想通过交叉验证评估我的模型时,我应该对原始数据(在训练和测试中未拆分的数据)还是在训练/测试数据上执行交叉验证?

我知道训练数据用于拟合模型,测试用于评估。如果我使用交叉验证,我是否仍应将数据拆分为训练和测试?

features = df.iloc[:,4:-1]
results = df.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

clf = LogisticRegression()
model = clf.fit(x_train, y_train)

accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)

或者我应该这样做:

features = df.iloc[:,4:-1]
results = df.iloc[:,-1]

clf = LogisticRegression()
model = clf.fit(features, results)

accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)

或者也许是不同的东西?

标签: pythonmachine-learningscikit-learncross-validation

解决方案


你的两种方法都是错误的。

  • 在第一个中,您将交叉验证应用于测试集,这是没有意义的

  • 在第二个中,您首先使用整个数据拟合模型,然后执行交叉验证,这再次毫无意义。此外,该方法是多余的(该方法clf不使用您的拟合,该cross_val_score方法自己进行拟合)

由于您没有进行任何超参数调整(即您似乎只对性能评估感兴趣),因此有两种方法:

  • 使用单独的测试集
  • 使用交叉验证

第一种方式(测试集):

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

clf = LogisticRegression()
model = clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

accuracy_test = accuracy_score(y_test, y_pred)

第二种方式(交叉验证):

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle

clf = LogisticRegression()

# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')

# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)

推荐阅读