首页 > 解决方案 > 来自 KFold 拆分索引的实际数据

问题描述

假设我有以下数据:

y = np.ones(10)
y[-5:] = 0
X = pd.DataFrame({'a':np.random.randint(10,20, size=(10)),
                  'b':np.random.randint(80,90, size=(10))})
X    
    a   b
0   11  82
1   19  82
2   15  80
3   15  86
4   14  82
5   18  87
6   13  83
7   12  83
8   10  82
9   18  87

将其拆分为 5 倍给出以下索引:

kf =  KFold()
data = list(kf.split(X,y))
data
[(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1])),
 (array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3])),
 (array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7])),
 (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))]

但我想进一步准备data ,以便将其组织为包含以下格式的实际值:

data =
   [(train1,trainlabel1,test1,testlabel1),
    (train2,trainlabel2,test2,testlabel2),
     ..,
    (train5,trainlabel5,test5,testlabel5)]

预期输出(来自给定的 MWE):

[array([
        (array([[15,80],[15,86],[14,82],[18,87],[13,83],[12,83],[10,82],[18,87]]), array([[1],[1],[1],[0],[0],[0],[0],[0])]), #fold1 train/label
        (array([[11,82],[19,82]]), array([[1],[1]])),  #fold1 test/label

        (array([[11,82],[19,82],[14,82],[18,87],[13,83],[12,83],[10,82],[18,87]]),array([[1],[1],[1],[0],[0],[0],[0],[0]])), #fold2 train/label
        (array([[15,80],[15,86]]),array([[1],[1]])) #fold2 test/label

        ....
])]

标签: pythonmachine-learningscikit-learncross-validationk-fold

解决方案


如您所知,KFold().split(data)按折叠返回选定的索引。要选择带有索引列表的 Pandas.DataFrame 行,最简单的方法是loc 方法

for train_idx, test_idx in KFold(n_splits=2).split(X):
   x_train = X.loc[train_idx]
   x_test = X.loc[test_idx]

   y_train = y.loc[train_idx]
   y_test = y.loc[test_idx]

然后,您可以将子集数据框添加到列表中


推荐阅读