首页 > 解决方案 > 准确性对 LSTM 和 cross_val_predict 来说真的很糟糕

问题描述

我正在尝试验证 LSTM 预测的股票市场预测时间序列的分数(链接到数据集https://www.kaggle.com/camnugent/sandp500,我正在使用 AAL 股票)。数据具有以下形状:

    open    high
0   15.07   15.12
1   14.89   15.01
2   14.45   14.51
3   14.30   14.94
4   14.94   14.96
... ... ...
1254    54.00   54.64
1255    53.49   53.99
1256    51.99   52.39
1257    49.32   51.50
1258    50.91   51.98
1259 rows × 2 columns

在使用model.fit和model.predict的时候,我可以看到结果并不好,但至少表明遵循真实数据。(图像仅显示预测,因此训练是数据集的80%)

绿色是预测

现在,当使用 cross_val_predict 或 cross_val_score 时,结果非常糟糕,就像 0.30 最终下降到 0.003。完整的代码是:

import numpy as np
import math
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_predict
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

from tscv import GapKFold
from keras.wrappers.scikit_learn import KerasClassifier

sc = MinMaxScaler()
# define parameters
prevision_days = 5
verbose, epochs, batch_size = 1, 20, 50
size_test = 0.2  #20%

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# load the dataset file
original_dataset = pd.read_csv('..\\dataset\\all_stocks_5yr.csv')
original_dataset.loc[original_dataset['low'].isnull(),'low'] = original_dataset['close']
original_dataset.loc[original_dataset['open'].isnull(),'open'] = original_dataset['close']
original_dataset.loc[original_dataset['high'].isnull(),'high'] = original_dataset['close']
dataset = original_dataset[original_dataset.Name == 'AAL'].drop(['date', 'volume', 'Name'], axis=1)

dataset = dataset[['open','high']]

#breaking in train/test
test_size = -1*int(prevision_days * round((math.floor(len(dataset)*size_test))/prevision_days))

dataset_scaled = sc.fit_transform(dataset)

#Preparing the data
data = []
target = []
for i in range(prevision_days, len(dataset_scaled)):
    data.append(dataset_scaled[i-prevision_days:i, 0])
    target.append(dataset_scaled[i, 0])
data, target = np.array(data), np.array(target)
data = np.reshape(data, (data.shape[0], data.shape[1], 1))

# Function to create model, required for KerasClassifier
def create_model():
    # create model
    model = Sequential()

    model.add(LSTM(units = 50, return_sequences = True, input_shape = (data.shape[1], 1)))
    model.add(Dropout(0.2))

    model.add(LSTM(units = 50, return_sequences = True))
    model.add(Dropout(0.2))

    model.add(LSTM(units = 50, return_sequences = True))
    model.add(Dropout(0.2))

    model.add(LSTM(units = 50))
    model.add(Dropout(0.2))

    model.add(Dense(units = 1))

    model.compile(optimizer = 'adam', loss = 'mean_squared_error', metrics=['accuracy'])

    return model

model = KerasClassifier(build_fn=create_model, epochs=epochs, batch_size=batch_size, verbose=verbose)
results = cross_val_predict(model, data, target, cv=5)
print(results)

结果是:

[[0.30032859]
 [0.30032859]
 [0.30032859]
 ...
 [0.00306681]
 [0.00306681]
 [0.00306681]]

知道什么可能导致这些结果吗?我已经将 epoch 增加到 50,batch_size 也增加到 50,但是结果完全一样,这也很奇怪。

非常感谢,若奥

标签: pythonnumpytensorflowmachine-learningkeras

解决方案


推荐阅读