首页 > 解决方案 > 具有大数据框的 Kfold 的训练/验证拆分策略

问题描述

我有一个时间序列训练集(商品销售),我想用更原则性的时间序列方法训练一个模型。尝试使用前向链接,其中的过程将是这样的:

- fold 1 : training (1), test [2]
- fold 2 : training (1 2), test [3]
- fold 3 : training (1 2 3), test [4]
- fold 4 : training (1 2 3 4), test [5]
- fold 5 : training (1 2 3 4 5), test [6]

但是我编码的训练时间太长了。我不知道是因为我有太多数据(超过一百万行)还是因为我错误地训练了我的数据:

# Preparing data for modeling

## Setting feature and target variables
X = data[features].fillna(value=0)
y = data[target].fillna(value=0)

# splitting train and test
X_train = X[:int(X.shape[0]*0.7)]
X_test = X[int(X.shape[0]*0.7):]
y_train = y[:int(X.shape[0]*0.7)]
y_test = y[int(X.shape[0]*0.7):]

>>>X_train.shape
(1126386, 153)

以下是以下内容的摘录data[features]

    month   shop_id item_id item_price  lat lon type_Аксессуары type_Билеты (Цифра) type_Доставка товара    type_Игровые консоли    ... sub_type_Стандартные издания    sub_type_Сувениры   sub_type_Сувениры (в навеску)   sub_type_Сумки, Альбомы, Коврики д/мыши sub_type_Фигурки    sub_type_Художественная литература  sub_type_Цифра  sub_type_Чистые носители (шпиль)    sub_type_Чистые носители (штучные)  sub_type_Элементы питания
0   1   0   32  221.0   NaN NaN 0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
1   1   0   33  347.0   NaN NaN 0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0

我试图用我在这篇文章中找到的解释如何使用 Scikit-learn 分割时间序列

from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor
tscv = TimeSeriesSplit(n_splits=5)
i = 1
score = []
for tr_index, val_index in tscv.split(X_train):
    print(tr_index, val_index)
    X_tr, X_val = X_train.iloc[tr_index,:], X_train.iloc[val_index,:]
    y_tr, y_val = y_train.iloc[tr_index,:], y_train.iloc[val_index,:]
    for mf in np.linspace(100, 150, 6):
        for ne in np.linspace(50, 100, 6):
            for md in np.linspace(20, 40, 5):
                for msl in np.linspace(30, 100, 8):
                    rfr = RandomForestRegressor(
                        max_features=int(mf),
                        n_estimators=int(ne),
                        max_depth=int(md),
                        min_samples_leaf=int(msl))
                    rfr.fit(X_tr, y_tr)
                    score.append([i,
                                  mf, 
                                  ne,
                                  md, 
                                  msl, 
                                  rfr.score(X_val, y_val)])
    i += 1

但是我从来没有通过第一轮。

标签: python-3.xdatevalidationscikit-learntrain-test-split

解决方案


推荐阅读