首页 > 解决方案 > 如何将 Pandas Dataframe 转换为 Keras RNN 以解决多变量分类问题

问题描述

我有一个熊猫数据框,我想制作一个循环神经网络模型。谁能向我解释我们如何将熊猫数据帧转换为序列?

我检查了几个地方以及它只解释的所有地方,RNN 如何处理简单数组,而不是 pandas 数据框。我的目标变量是“标签”列,它确实有 5 个变量。

下面是我的代码,当我尝试执行 model.fit 时出现错误。我在这里附上一张图片来检查。

import numpy
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
from sklearn import metrics
# fix random seed for reproducibility
numpy.random.seed(7)

AllDataSelFeLabEncDataframe
    Flow_IAT_Max    Fwd_IAT_Std   Pkt_Len_Max   Fwd_Pkt_Len_Std   Label
0   591274.0        11125.35538   32             0.0                3
1   633973.0        12197.74612   32             0.0                3
2   591242.0        12509.82212   32             0.0                3
3   2.0             0.0           0              0.0                2
4   1.0             0.0           0              0.0                2
5   460.0           0.000000      0              0.000000           1
6   10551.0         311.126984    326            188.216188         1
7   476.0           0.000000      0              0.000000           1
8   4380481.0       2185006.405   935            418.144712         0
9   4401241.0       2192615.483   935            418.144712         0
10  3364844.0       1675797.985   935            418.144712         0
11  4380481.0       2185006.405   935            418.144712         0
12  43989.0         9929.900528    0             0.0                4

# define y variable, i.e., what I want to predict
y_col='Label' 

X = AllDataSelFeLabEnc.drop(y_col,axis=1).copy()
y = AllDataSelFeLabEnc[[y_col]].copy() 
# the double brakets here are to keep the y in dataframe format, otherwise it will be pandas Series
print(X.shape,y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

length = 500


n_input = 25 #how many samples/rows/timesteps to look in the past in order to forecast the next sample
n_features= X_train.shape[1] # how many predictors/Xs/features we have to predict y
b_size = 32 # Number of timeseries samples in each batch


# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(5000, embedding_vecor_length, input_length=length))
model.add(LSTM(150, activation='relu', input_shape=(n_input, n_features)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(model.summary())


model.fit(X_train, y_train, epochs=3, batch_size=64)

[![Error I'm getting][1]][1]


# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))


y_pred = model.predict(X_test)

# Print the confusion matrix
print(metrics.confusion_matrix(y_test,y_pred))

# Print the precision and recall, among other metrics
print(metrics.classification_report(y_test, y_pred, digits=3))

标签: pythonpandaskerasrecurrent-neural-network

解决方案


来自 LSTM 的 keras 文档

输入:具有形状 [batch, timesteps, feature] 的 3D 张量。

所以在你的情况下,需要的是 [32, 25, 4] 或 [n_features, n_input, n_features]

我认为数据帧不可能进行这种表示,除非将输入数据转换为 Dataframe数组。

所以这是用 numpy 做的方法,我认为这是最简单和有效的方法 -

# .loc includes the last element too, so we subtract 1
# the math handles the end case. When the data samples are not a multiple of timestamps you a want to use in a shot 
x = X_train.loc[:(len(X_train)//n_input)*n_input-1, INPUT_FEATURES].to_numpy()
X_train = np.reshape(x, (len(X_train)//n_input, n_input, n_features))

笔记

上面的代码不执行滚动窗口,而是窗口切片,即,如果你有 50 个样本,你只会得到 2 个样本而不是 26 个样本 1-25、2-26、3-27 等等 26-50


推荐阅读