python - 如何为时间序列数据添加滞后?
问题描述
我有一个数据框 df,每个产品 id 都有电视点数据:
| start_date | end_date | id | f1 | f2
0 | 2020-01-01 | 2020-01-02 | 1 | 111 | 222
1 | 2020-01-05 | 2020-01-07 | 1 | 111 | 222
2 | 2020-01-01 | 2020-01-02 | 3 | 333 | 444
3 | 2020-01-05 | 2020-01-07 | 3 | 555 | 666
现在我想添加 0 到 2 天的延迟以用作预测模型中的特征。
然后应将日期范围“start_date”+“end_date”分解为“日期”列,以便我有一个“日期”列而不是日期范围。
但我不知道如何实现这一目标。
最终结果应如下所示:
| date | id | f1_lag_0 | f2_lag_0 | f1_lag_1 | f2_lag_1 | f1_lag_2 | f2_lag_2
0 | 2020-01-01 | 1 | 111 | 222 | 111 | 222 | 111 | 222
1 | 2020-01-02 | 1 | 111 | 222 | 111 | 222 | 111 | 222
2 | 2020-01-03 | 1 | NaN | NaN | 111 | 222 | 111 | 222
3 | 2020-01-04 | 1 | NaN | NaN | NaN | NaN | 111 | 222
0 | 2020-01-05 | 1 | 111 | 222 | 111 | 222 | 111 | 222
1 | 2020-01-06 | 1 | 111 | 222 | 111 | 222 | 111 | 222
2 | 2020-01-07 | 1 | 111 | 222 | 111 | 222 | 111 | 222
3 | 2020-01-08 | 1 | NaN | NaN | 111 | 222 | 111 | 222
4 | 2020-01-09 | 1 | NaN | NaN | NaN | NaN | 111 | 222
0 | 2020-01-01 | 3 | 333 | 444 | 333 | 444 | 333 | 444
1 | 2020-01-02 | 3 | 333 | 444 | 333 | 444 | 333 | 444
2 | 2020-01-03 | 3 | NaN | NaN | 333 | 444 | 333 | 444
3 | 2020-01-04 | 3 | NaN | NaN | NaN | NaN | 333 | 444
0 | 2020-01-05 | 3 | 555 | 666 | 555 | 666 | 555 | 666
1 | 2020-01-06 | 3 | 555 | 666 | 555 | 666 | 555 | 666
2 | 2020-01-07 | 3 | 555 | 666 | 555 | 666 | 555 | 666
3 | 2020-01-08 | 3 | NaN | NaN | 555 | 666 | 555 | 666
4 | 2020-01-09 | 3 | NaN | NaN | NaN | NaN | 555 | 666
创建虚拟df的代码:
df = pd.DataFrame(
{
"start_date": [
"2020-01-01",
"2020-01-05",
"2020-01-01",
"2020-01-06",
],
"end_date": [
"2020-01-02",
"2020-01-07",
"2020-01-02",
"2020-01-07"
],
"id": ["1", "1", "3", "3"],
"feature1": ["111", "111", "333", "555"],
"feature2": ["222", "222", "444", "666"],
}
)
解决方案
采用:
#list of features
cols = ['feature1','feature2']
#convert both columnsto datetimes
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
#add new days to difference
N = 1
dif = df['end_date'].sub(df['start_date']).dt.days + 1 + N
#repeat index by difference
df = df.loc[df.index.repeat(dif)].copy()
#add tiemdeltas to start datetimes
df['start_date'] += pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
每组最后一次使用班次:
for j, i in enumerate(range(2, -1, -1)):
df[[f'f1_lag_{j}', f'f2_lag_{j}']] = df.groupby(level=0)[cols].shift(-i)
df = (df.drop(cols, axis=1)
.drop('end_date', axis=1)
.rename(columns={'start_date':'date'})
.reset_index(drop=True))
print (df)
date id f1_lag_0 f2_lag_0 f1_lag_1 f2_lag_1 f1_lag_2 f2_lag_2
0 2020-01-01 a 111 222 111 222 111 222
1 2020-01-02 a 111 222 111 222 111 222
2 2020-01-03 a NaN NaN 111 222 111 222
3 2020-01-04 a NaN NaN NaN NaN 111 222
4 2020-01-05 a 111 222 111 222 111 222
5 2020-01-06 a 111 222 111 222 111 222
6 2020-01-07 a 111 222 111 222 111 222
7 2020-01-08 a NaN NaN 111 222 111 222
8 2020-01-09 a NaN NaN NaN NaN 111 222
9 2020-01-01 b 333 444 333 444 333 444
10 2020-01-02 b 333 444 333 444 333 444
11 2020-01-03 b NaN NaN 333 444 333 444
12 2020-01-04 b NaN NaN NaN NaN 333 444
13 2020-01-06 b 555 666 555 666 555 666
14 2020-01-07 b 555 666 555 666 555 666
15 2020-01-08 b NaN NaN 555 666 555 666
16 2020-01-09 b NaN NaN NaN NaN 555 666
推荐阅读
- php - cURL 域可以使用浏览器访问,但不能使用 cURL 访问
- c# - 从 JSON 创建 Unity 变量文本
- ios - 您的 iOS 分发证书将在 30 天后失效
- html - 哪个先出现,主要标签还是部分标签?
- java - 使用监听器和不使用监听器创建范围报告有什么区别
- excel - Excel VBA 自定义分数公式
- java - Jetty 如何将一个 servlet 添加到多个 ServletContextHandler 或将 ContainerRequestFilter 应用于 ContextHandlerCollection
- ios - 从自定义 TableViewCell 中的 didSelectItemAt 自定义 CollectionViewCell 更改视图控制器
- jquery - jquery最接近()和addClass()的问题
- javascript - 如何更改 Angular Material 上 mat-icon 的大小?