python - Transforming data from Pandas dataframe to time series training data for keras LSTM
问题描述
I'm using Keras along with Hyperas to with with an LSTM machine on predicting valuations for prices. I'm having problems with formatting my data from the Pandas DataFrame to use for training and testing data in the LSTM model.
This is how I read and split the data at the moment:
def data():
maxlen = 100
max_features = 20000
#read the data
df = DataFrame(pd.read_json('eth_usd_polo.json'))
#normalize data
scaler = MinMaxScaler(feature_range=(-1,1))
df[['weightedAverage']] = scaler.fit_transform(df[['weightedAverage']])
X = df[df.columns[-1:]]
Y = df['weightedAverage']
X_train, X_test, y_train, y_test = train_test_split(X, Y , test_size=0.33)
return X_train, X_test, y_train, y_test, max_features, maxlen
From the dataframe I'm really only interested in the "weightedAverage" column and it's corresponding prices. Since I'm doing a univariate time series forecasting.
And this is where I build the model:
def create_model(X_train, X_test, y_train, y_test, max_features, maxlen):
#Build the model
model = Sequential()
model.add(LSTM(input_shape=(10, 1), return_sequences=True, units=20))
model.add(Dropout(1))
model.add(LSTM(20, return_sequences=False))
#model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(units=1))
#model.add(Activation("linear"))
#compile
model.compile(loss='categorical_crossentropy', metrics=['accuracy'],
optimizer={{choice(['rmsprop', 'adam', 'sgd'])}})
#the monitor and earlystopping for the model training
#monitor = EarlyStopping(monitor ='val_loss', patience=5,verbose=1, mode='auto')
#fit everything together
#model.fit(x_train ,y_train, validation_data=(x_test, y_test), callbacks =[monitor], verbose=2, epochs=1000)
model.fit(X_train, y_train,
batch_size={{choice([64, 128])}},
epochs=1,
verbose=2,
validation_data=(X_test, y_test))
score, acc = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', acc)
return {'loss': -acc, 'status': STATUS_OK, 'model': model}
The problems seems to arise in how I extract and handle the data from the Pandas DF. The returning data (X_train, X_test etc) should be in the form of:
(25000, 10)
[[ data data data .... data data]
[ data data data .... data data]
.
.
.
[ data data data .... data data]]
Instead it's formatted to:
(7580, 1)
weightedAverage
12420 255.151685
20094 871.386896
12099 300.802114
I thought the train_test_split
function would help me in splitting and formatting my data to the correct size, but it doesn't seem to do what I want from it.
Any help is greatly appreciated with this!
解决方案
经过大量的摆弄和反复试验,我得到了它的工作。现在,我的 LSTM 机器的数据格式精美,并且运行良好。
它现在还可以处理多变量输入,我希望这将提高预测的质量。
def data():
maxlen = 10
steps = 10
#read the data
print('Loading data...')
df = (pd.read_json('eth_usd_polo.json'))
df = df.drop('date', axis=1)
#normalize data
scalerList = []
for head in df.dtypes.index:
scaler = MinMaxScaler(feature_range=(-1,1))
df[[head]] = scaler.fit_transform(df[[head]])
scalerList.append(scaler)
Xtemp = np.array(df)
X = np.zeros((len(Xtemp)-maxlen-steps,maxlen,len(Xtemp[0])))
Y = np.zeros((len(X),steps))
for i in range(0, len(X)):
for j in range(steps):
Y[i][j] = Xtemp[maxlen+i+j][6]
for j in range(len(X[0])):
for k in range(len(X[0][0])):
X[i][len(X[0])-1-j][k] = Xtemp[maxlen+i-j-1][k]
X_train, X_test, y_train, y_test = train_test_split(X, Y , test_size=0.33, shuffle=True)
return X_train, X_test, y_train, y_test, maxlen, steps
推荐阅读
- pgadmin - 哪个版本的 pgAdmin 开始使用主密码?
- vba - 显示随机值的 MS Access 组合框
- java - 带有@Bean 的@ConfigurationProperties 不会从类路径加载默认值
- r - 从 Rcpp 中的 beta 分布生成样本的更好方法
- mysql - 如何在 RDS MySQL 中限制非 SSL 连接
- c# - How to change FontStyle back in RichTextBox?
- wpf - 如何在 XAML 中将直线剪裁成椭圆
- r - Rbind 使用 R 将文件名/模式包含在源文件的列中的文件
- javascript - 我如何获得 Axios 发布请求正在进行中
- html - 我是这个 html 和 css 编码的新手,它无法识别 mp3,我该如何编码才能识别它?