python - 如何为具有明确时间戳的时间序列数据构建数据框?
问题描述
对于我的实验,我有一个格式化的csv文件,它看起来像一个矩阵 [NxM],其中 N = 40 总周期数(时间戳)和 M = 1440 像素。对于每个周期,我有 1440 个像素值对应于 1440 个像素。如下所示:
timestamps[row_index] | feature1 | feature2 | ... | feature1439 | feature1440 |
-----------------------------------------------------------------
1 | 1.00 | 0.32 | 0.30 | 0.30 | 0.30 |
2 | 0.35 | 0.33 | 0.30 | 0.30 | 0.30 |
3 | 1.00 | 0.33 | 0.30 | 0.30 | 0.30 |
... | .... | .... | .... | .... | .... |
| -1.00 | 0.26 | 0.30 | 0.30 | 0.30 |
| 0.67 | 0.03 | 0.30 | 0.30 | 0.30 |
30 | 0.75 | 0.42 | 0.30 | 0.30 | 0.30 |
________________________________________________________________________________
31 | -0.36 | 0.42 | 0.30 | 0.30 | 0.30 |
... | .... | .... | .... | .... | .... |
40 | 1.00 | 0.34 | 0.30 | 0.30 | -1.00 |
我想将数据集分成训练集和测试集,这样:
训练集包含 [1-30] 时间戳信息
测试集包含 [31-40] 时间戳信息
问题是我在训练 NN 后无法获得正确的连续图,这很可能是由于我使用过train_test_split
但从未尝试过的不良数据拆分技术TimeSeriesSplit
,如下所示:
trainX, testX, trainY, testY = train_test_split(trainX,trainY, test_size=0.2 , shuffle=False)
考虑到我已经使用shuffle=False
并期望数据末尾的0.2将被视为测试数据,我可以正确绘制它们但仍然无法访问被视为测试数据的周期数,因此当我绘制它时开始从 0 开始!而不是继续训练数据的最后一个周期!
我想知道是否最好将数据传递给pd.DataFrame
并尝试pd.Timestamp
根据这篇文章对数据进行切片!它有帮助还是没有必要?
更新-完整代码: 我的列标记遵循以下模式,只需预测 1440 列中的 960 列:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from keras.layers import Dense , Activation , BatchNormalization
from keras.layers import Dropout
from keras.layers import LSTM,SimpleRNN
from keras.models import Sequential
from keras.optimizers import Adam, RMSprop
data_train = pd.read_csv("D:\train.csv", header=None)
#select interested columns to predict 980 out of 1440
j=0
index=[]
for i in range(1439):
if j==2:
j=0
continue
else:
index.append(i)
j+=1
Y_train= data_train[index]
data_train = data_train.values
print("data_train size: {}".format(Y_train.shape))
创造历史
def create_dataset(dataset,data_train,look_back=1):
dataX,dataY = [],[]
print("Len:",len(dataset)-look_back-1)
for i in range(len(dataset)-look_back-1):
a = dataset[i:(i+look_back), :]
dataX.append(a)
dataY.append(data_train[i + look_back, :])
return np.array(dataX), np.array(dataY)
look_back = 10
trainX,trainY = create_dataset(data_train,Y_train, look_back=look_back)
#testX,testY = create_dataset(data_test,Y_test, look_back=look_back)
trainX, testX, trainY, testY = train_test_split(trainX,trainY, test_size=0.2)
print("train size: {}".format(trainX.shape))
print("train Label size: {}".format(trainY.shape))
print("test size: {}".format(testX.shape))
print("test Label size: {}".format(testY.shape))
Len: 29
train size: (23, 10, 1440)
train Label size: (23, 960)
test size: (6, 10, 1440)
test Label size: (6, 960)
RNN、LSTM、GRU 实现类似
# create and fit the SimpleRNN model
model_RNN = Sequential()
model_RNN.add(SimpleRNN(units=1440, input_shape=(trainX.shape[1], trainX.shape[2])))
model_RNN.add(Dense(960))
model_RNN.add(BatchNormalization())
model_RNN.add(Activation('tanh'))
model_RNN.compile(loss='mean_squared_error', optimizer='adam')
callbacks = [
EarlyStopping(patience=10, verbose=1),
ReduceLROnPlateau(factor=0.1, patience=3, min_lr=0.00001, verbose=1)]
hist_RNN=model_RNN.fit(trainX, trainY, epochs =50, batch_size =20,validation_data=(testX,testY),verbose=1, callbacks=callbacks)
最后,我期望以下输出图:
Y_RNN_Test_pred=model_RNN.predict(testX)
test_RNN= pd.DataFrame.from_records(Y_RNN_Test_pred)
test_RNN.to_csv('New/ttest_RNN_history.csv', sep=',', header=None, index=None)
test_MSE=mean_squared_error(testY, Y_RNN_Test_pred)
plt.plot(trainY[:,0],'b-',label='Train data')
plt.plot(testY[:,0],'c-',label='Test data')
plt.plot(Y_RNN_Test_pred[:,0],'r-',label='prediction')
解决方案
索引只是一个小问题。
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
df = pd.read_csv('Train.csv', header=None)
# I'm not sure what the label-column is, so i use df[0]
# and exclude this colums in the data via df.loc[:,df.columns!=0]
trainX,testX,trainY,testY = train_test_split(df.loc[:,df.columns!=0],df[0], test_size=0.2, shuffle=False)
plt.plot(trainY)
plt.plot(testY)
看起来不错。:-)
所以现在我们要预测:
from sklearn.svm import SVR
reg = SVR(C=1, gamma='auto')
reg.fit(trainX, trainY)
predY = reg.predict(testX)
plt.plot(trainY)
plt.plot(testY)
plt.plot(predY)
索引是错误的 :-( 让我们解决这个问题,例如使用testY
:
plt.plot(trainY)
plt.plot(testY)
plt.plot(testY.index,predY)
编辑
更通用的解决方案是获取火车数据集的长度范围并将其设置为索引,与testY
and相同predY
,只是具有不同的起始值(长度为trainY
)
trainY.index = range(len(trainY))
testY.index = range(len(trainY), len(trainY)+len(testY))
#Maybe convert to DataFrame first
predY = pd.DataFrame(predY)
predY.index = range(len(trainY), len(trainY)+len(predY))
plt.plot(trainY)
plt.plot(testY)
plt.plot(predY)
根据您的新代码进行编辑
trainY.index = range(len(trainY))
testY.index = range(len(trainY), len(trainY)+len(testY))
test_RNN.index = range(len(trainY), len(trainY)+len(test_RNN))
plt.plot(trainY,'b-',label='Train data')
plt.plot(testY,'c-',label='Test data')
plt.plot(test_RNN,'r-',label='prediction')
编辑 2
好的,让我们逐步浏览您的代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from keras.layers import Dense , Activation , BatchNormalization
from keras.layers import Dropout
from keras.layers import LSTM,SimpleRNN
from keras.models import Sequential
from keras.optimizers import Adam, RMSprop
data_train = pd.read_csv("Train.csv", header=None)
#select interested columns to predict 980 out of 1440
实际上,您只选择 960 列进行预测,见下文。
#j=0
#index=[]
#for i in range(1439):
# if j==2:
# j=0
# continue
# else:
# index.append(i)
# j+=1
idx2 = [i for i in list(range(1440)) if i%3!=2]
如果我理解你的循环正确,你只想取两个值中的每三个。所以列表理解要快一点idx2 = [i for i in list(range(1440)) if i%3!=2]
。您可能还希望包含所有列?所以使用1440
而不是1439
.
Y_train= data_train[index]
data_train = data_train.values
print("data_train size: {}".format(Y_train.shape))
在您的代码中,形状Y_train
为(40,960)
. 所以,你想预测 690 个变量,对吧?如果是这样,“干净”的方法是从data_train
(并制作X_train
)中删除这些列:
index2 = [i for i in list(range(1440)) if i%3==2]
X_train = data_train[index2]
现在让我们检查形状:
print("X_train size: {}".format(X_train.shape))
print("Y_train size: {}".format(Y_train.shape))
>X_train size: (40, 480)
>Y_train size: (40, 960)
似乎是对的...... ;-)
我对下一部分进行了一些修改: - 您不需要1
在范围内减去 ( for i in range(len(dataset)-look_back):
。与其他一些编程语言不同,Python 不包含最后一个值,因此例如,如果您这样做list(range(0,3))
,列表将是[0,1,2]
。可能这些是您的缺少 10 个值(最后一个值)...
- 我也取了values
fromY_train
def create_dataset(dataset,data_train,look_back=1):
dataX,dataY = [],[]
for i in range(len(dataset)-look_back):
a = dataset[i:(i+look_back), :]
dataX.append(a)
dataY.append(data_train[i+look_back, :])
return np.array(dataX), np.array(dataY)
look_back = 10
trainX,trainY = create_dataset(X_train.values, Y_train.values, look_back=look_back)
trainX, testX, trainY, testY = train_test_split(trainX,trainY, test_size=0.2)
print("train size: {}".format(trainX.shape))
print("train Label size: {}".format(trainY.shape))
print("test size: {}".format(testX.shape))
print("test Label size: {}".format(testY.shape))
>train size: (24, 10, 480)
>train Label size: (24, 960)
>test size: (6, 10, 480)
>test Label size: (6, 960)
我必须在 training 中添加两个导入from keras.callbacks import EarlyStopping, ReduceLROnPlateau
,所以:
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
# create and fit the SimpleRNN model
model_RNN = Sequential()
model_RNN.add(SimpleRNN(units=1440, input_shape=(trainX.shape[1], trainX.shape[2])))
model_RNN.add(Dense(960))
model_RNN.add(BatchNormalization())
model_RNN.add(Activation('tanh'))
model_RNN.compile(loss='mean_squared_error', optimizer='adam')
callbacks = [
EarlyStopping(patience=10, verbose=1),
ReduceLROnPlateau(factor=0.1, patience=3, min_lr=0.00001, verbose=1)]
hist_RNN=model_RNN.fit(trainX, trainY, epochs =50, batch_size =20,validation_data=(testX,testY),verbose=1, callbacks=callbacks)
做出预测(未修改):
Y_RNN_Test_pred=model_RNN.predict(testX)
test_RNN= pd.DataFrame.from_records(Y_RNN_Test_pred)
#test_RNN.to_csv('New/ttest_RNN_history.csv', sep=',', header=None, index=None)
test_MSE=mean_squared_error(testY, Y_RNN_Test_pred)
并在 x 轴上绘制带有修改的数据,如上所述:
x_start = range(look_back, look_back+len(trainY))
x_train_start = range(look_back + len(trainY), look_back + len(trainY)+len(testY))
x_pred_start = range(look_back + len(trainY), look_back +len(trainY)+len(Y_RNN_Test_pred))
plt.plot(x_start, trainY[:,0],'b-',label='Train data')
plt.plot(x_train_start, testY[:,0],'c-',label='Test data')
plt.plot(x_pred_start, Y_RNN_Test_pred[:,0],'r-',label='prediction')
推荐阅读
- python - 使用 zip 将值从列表附加到另一个
- video - ffprobe 输出视频:png
- firebase - Firestore 权限在生产电子应用程序中被拒绝,但请求在开发中通过
- machine-learning - 对于更深层次的 CNN 层学习更复杂的特征,是否有理论解释/量化?
- c# - ASP.NET Core 3 没有为方案注册登录管理器
- azure-devops - 将 Azure DevOps 内部版本号设置为 Gitversion MajorMinorPatch 号
- r - 如何从 R 中的 lm 模型中提取估计值和标准误差作为线性增量的度量?
- r - 在 R 中使用 REST API 向 JIRA 问题单添加评论
- wpf - 我试图滑出并在带有按钮的菜单栏中单击 WPF
- arrays - 结构内的结构 Equatable 计数重复项 Swift