首页 > 解决方案 > Transforming data from Pandas dataframe to time series training data for keras LSTM

问题描述

I'm using Keras along with Hyperas to with with an LSTM machine on predicting valuations for prices. I'm having problems with formatting my data from the Pandas DataFrame to use for training and testing data in the LSTM model.

This is how I read and split the data at the moment:

def data():
    maxlen = 100
    max_features = 20000
    #read the data
    df = DataFrame(pd.read_json('eth_usd_polo.json'))

    #normalize data
    scaler = MinMaxScaler(feature_range=(-1,1))
    df[['weightedAverage']] = scaler.fit_transform(df[['weightedAverage']])
    X = df[df.columns[-1:]]
    Y = df['weightedAverage']
    X_train, X_test, y_train, y_test = train_test_split(X, Y , test_size=0.33)


    return X_train, X_test, y_train, y_test, max_features, maxlen

From the dataframe I'm really only interested in the "weightedAverage" column and it's corresponding prices. Since I'm doing a univariate time series forecasting.

And this is where I build the model:

def create_model(X_train, X_test, y_train, y_test, max_features, maxlen):
    #Build the model
    model = Sequential()
    model.add(LSTM(input_shape=(10, 1), return_sequences=True, units=20))
    model.add(Dropout(1))
    model.add(LSTM(20, return_sequences=False))
    #model.add(Flatten())
    model.add(Dropout(0.2))
    model.add(Dense(units=1))
    #model.add(Activation("linear"))

    #compile
    model.compile(loss='categorical_crossentropy', metrics=['accuracy'],
                  optimizer={{choice(['rmsprop', 'adam', 'sgd'])}})

    #the monitor and earlystopping for the model training
    #monitor = EarlyStopping(monitor ='val_loss', patience=5,verbose=1, mode='auto')

    #fit everything together
    #model.fit(x_train ,y_train, validation_data=(x_test, y_test), callbacks =[monitor], verbose=2, epochs=1000)
    model.fit(X_train, y_train,
        batch_size={{choice([64, 128])}},
        epochs=1,
        verbose=2,
        validation_data=(X_test, y_test))

    score, acc = model.evaluate(X_test, y_test, verbose=0)

    print('Test accuracy:', acc)
    return {'loss': -acc, 'status': STATUS_OK, 'model': model}

The problems seems to arise in how I extract and handle the data from the Pandas DF. The returning data (X_train, X_test etc) should be in the form of:

(25000, 10)
[[ data data data .... data data]
 [ data data data .... data data]
.
.
.
[ data data data .... data data]]

Instead it's formatted to:

   (7580, 1)
        weightedAverage
12420       255.151685
20094       871.386896
12099       300.802114

I thought the train_test_split function would help me in splitting and formatting my data to the correct size, but it doesn't seem to do what I want from it.

Any help is greatly appreciated with this!

标签: pythonpandastensorflowkeraslstm

解决方案


经过大量的摆弄和反复试验,我得到了它的工作。现在,我的 LSTM 机器的数据格式精美,并且运行良好。

它现在还可以处理多变量输入,我希望这将提高预测的质量。

def data():
    maxlen = 10
    steps = 10
    #read the data
    print('Loading data...')
    df = (pd.read_json('eth_usd_polo.json'))
    df = df.drop('date', axis=1)
    #normalize data
    scalerList = []
    for head in df.dtypes.index:
        scaler = MinMaxScaler(feature_range=(-1,1))
        df[[head]] = scaler.fit_transform(df[[head]])

        scalerList.append(scaler)
    Xtemp = np.array(df)
    X = np.zeros((len(Xtemp)-maxlen-steps,maxlen,len(Xtemp[0])))
    Y = np.zeros((len(X),steps))
    for i in range(0, len(X)):
        for j in range(steps):
            Y[i][j] = Xtemp[maxlen+i+j][6]

        for j in range(len(X[0])):
            for k in range(len(X[0][0])):
                X[i][len(X[0])-1-j][k] = Xtemp[maxlen+i-j-1][k]
    X_train, X_test, y_train, y_test = train_test_split(X, Y , test_size=0.33, shuffle=True)    
    return X_train, X_test, y_train, y_test, maxlen, steps

推荐阅读