首页 > 解决方案 > How does test_size relate when used in python sklearn for a 10 fold cross validation

问题描述

I am trying to implement a ML algorithm in which I would like to use a 10 fold cross validation process but I would just like to get confirmation if my procedure is correct.

I am doing a binary classification and have about 50 samples of each class in each of the 10 folders that I created, called fold 1, fold 2, and so on.

My sklearn command is:

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.3, random_state=1000)

Am I totally wrong here and this procedure is actually just doing a 30% test and 70% train process? For the 10 fold cross validation, I should be using:

from sklearn.model_selection import KFold
kf = KFold(n_splits=2, random_state=42, shuffle=True)

Thanks!

标签: python-3.xtraining-datasklearn-pandas

解决方案


Am I totally wrong here and this procedure is actually just doing a 30% test and 70% train process?

Yes, setting test_size=0.3 gives you a 30% test size and a 70% train size. We know this from reading the documentation.

test_size float or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split

If you're repeating this 10 times with different random_state, then there will be some repeated elements in the test set among the 10 repetitions. The purpose of k-fold cross-validation is to create k disjoint sets, and each set used in turn as a holdout. Your procedure is not a cross-validation, because the sets you've produced by this procedure will never be disjoint (you can prove this with the pigeonhole principle).

kf = KFold(n_splits=2, random_state=42, shuffle=True)

This isn't a 10-fold CV because n_splits=2. We know this from reading the documentation. The argument n_splits should be the number of folds. You've said you want 10 splits.


推荐阅读