首页 > 解决方案 > 学习曲线的训练规模

问题描述

我想知道learning_curve()我申请的结果:

X_train1_be.shape
> (1360, 2)
y_train1_be.shape
> (1360, 2)

train_sizes, train_scores, test_scores = learning_curve(grid_best
                                                        , X_train1_be
                                                        , y_train1_be
                                                        , n_jobs=n_jobs
                                                        , scoring = 'neg_mean_squared_error'
                                                        , cv=TimeSeriesSplit(n_splits = 5)
                                                        , verbose=2
                                                        , shuffle = False
                                                        , train_sizes = [1
                                                                         , round(len(X_train1_be)/10)
                                                                         , round(len(X_train1_be)/5)
                                                                         , round(len(X_train1_be)/3)
                                                                         , round(len(X_train1_be)/2)
                                                                         , round(len(X_train1_be)/1)
                                                                        ]
                                                        )

但这会导致

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-178-9216e6224b3b> in <module>
     12                                                                          , round(len(X_train1_be)/3)
     13                                                                          , round(len(X_train1_be)/2)
---> 14                                                                          , round(len(X_train1_be)/1)
     15                                                                         ]
     16                                                         )

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in learning_curve(estimator, X, y, groups, train_sizes, cv, scoring, exploit_incremental_learning, n_jobs, pre_dispatch, verbose, shuffle, random_state, error_score)
   1257     # use the first 'n_max_training_samples' samples.
   1258     train_sizes_abs = _translate_train_sizes(train_sizes,
-> 1259                                              n_max_training_samples)
   1260     n_unique_ticks = train_sizes_abs.shape[0]
   1261     if verbose > 0:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _translate_train_sizes(train_sizes, n_max_training_samples)
   1341                              % (n_max_training_samples,
   1342                                 n_min_required_samples,
-> 1343                                 n_max_required_samples))
   1344 
   1345     train_sizes_abs = np.unique(train_sizes_abs)

ValueError: train_sizes has been interpreted as absolute numbers of training samples and must be within (0, 230], but is within [1, 1360].

相比之下,以下工作:

grid_best = grid_result.best_estimator_
train_sizes, train_scores, test_scores = learning_curve(grid_best
                                                        , X_train1_be
                                                        , y_train1_be
                                                        , n_jobs=n_jobs
                                                        , scoring = 'neg_mean_squared_error'
                                                        , cv=TimeSeriesSplit(n_splits = 5)
                                                        , verbose=2
                                                        , shuffle = False
                                                        , train_sizes = np.linspace(0.001, 1, 10))

> [learning_curve] Training set sizes: [  1  25  51  76 102 127 153 178 204 230]

根据此链接,它应该首先按照我尝试的方式工作:

确定训练集大小 让我们首先确定我们想要使用哪些训练集大小来生成学习曲线。最小值为 1。最大值由训练集中的实例数给出。我们的训练集有 9568 个实例,所以最大值是 9568。但是,我们还没有搁置验证集。我们将使用 80:20 的比例来完成这项工作,最终得到一个包含 7654 个实例 (80%) 的训练集和一个包含 1914 个实例 (20%) 的验证集。鉴于我们的训练集将有 7654 个实例,我们可以用来生成学习曲线的最大值是 7654。对于我们的例子,在这里,我们使用这六个大小:

train_sizes = [1, 100, 500, 2000, 5000, 7654]

标签: pythonkerasneural-network

解决方案


似乎这是前段时间已经提出的问题:github.com/scikit-learn/scikit-learn/issues/7834 意思是,目前不可能,而且事情似乎不会很快改变。

对我来说,一个规避方法是将数据集相乘,以使第一个保留包含整个数据集。


推荐阅读