首页 > 解决方案 > 情节预测的索引和日期问题

问题描述

我有一个数据框:

import yfinance as yf
df = yf.download('AAPL',
                 start='2001-01-01',
                 end='2005-12-31',
                 progress=False)

然后我将它分成比例为 80:20 的训练测试集。这是一些检查我的训练集和测试集索引的代码。

train_df.index

输出是

在此处输入图像描述

test_df.index

输出是

在此处输入图像描述

从训练数据中得到模型后,我用 252 个测试数据进行预测,结果是

在此处输入图像描述

如何将预测输出更改为日期时间 %Y%m%d 索引而不是整数索引的数据帧?我已经阅读了这个 stackoverflow 中的许多文章和答案,但我还没有找到解决方案。

标签: pythondataframedatetimeindexingprediction

解决方案


您可以做的一件事是在模型训练/推理之前简单地保存日期​​时间索引,然后将其重新加入 RangeIndex。

IE:

time_index = df.reset_index()[['utc']] #replace utc with your index name
df = df.reset_index()

训练模型,然后加入 RangeIndex。然后将 index 设置回 DatetimeIndex。

prediction = prediction.join(time_index)
prediction.set_index('utc', inplace=True)

工作示例:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':np.arange(10)}, index=pd.date_range('2021-01-01', '2021-01-10'))
df.index.name = 'Date'
#Save the time_index but indexed by RangeIndex to allow for join after prediction
time_index = df.reset_index()[['Date']]

#Some arbitrary prediction dataframe with a RangeIndex
prediction = pd.DataFrame({'predictions':np.arange(0,10)})

#joins prediction and time_index on the RangeIndex
prediction = prediction.join(time_index)

#Sets index to the time_index
prediction.set_index('Date', inplace=True)

您现在将拥有一个如下所示的数据框:

            predictions
Date
2021-01-01            0
2021-01-02            1
2021-01-03            2
2021-01-04            3
2021-01-05            4
2021-01-06            5
2021-01-07            6
2021-01-08            7
2021-01-09            8
2021-01-10            9

只是为了驱动这个家,这里是一个使用您的数据源的具体示例:

import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = yf.download('AAPL',
                 start='2001-01-01',
                 end='2005-12-31',
                 progress=False)

#Save the time_index but indexed by RangeIndex to allow for join after prediction
time_index = df.reset_index()[['Date']]
df = df.reset_index()

#Assuming we predict Volume
y = df[['Volume']]
X = df.drop(columns=['Volume', 'Date'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

#Predict values, transpose to fit into dataframe
predicted_values = model.predict(X_test).T[0]

#Create prediction dataframe
prediction = pd.DataFrame({'y-pred':predicted_values}, index=X_test.index)

#join test or true data to prediction for comparison
prediction = prediction.join(y_test)

#joins prediction and time_index on the RangeIndex
prediction = prediction.join(time_index)

#Sets index to the time_index
prediction.set_index('Date', inplace=True)

这导致:


                  y-pred      Volume
Date
2001-07-26  3.893012e+08   369140800
2004-12-20  1.191681e+09  1168126400
2005-02-17  8.905975e+08  1518473600
2002-12-03  2.004725e+08   227869600
2005-10-10  8.430103e+08   50750560

推荐阅读