首页 > 解决方案 > 熊猫每个给定的时间间隔插入行

问题描述

具有以下DF:

    id           init_time          event_time  value
0    0 2020-10-21 00:21:25 2020-10-21 00:21:30      1
1    0 2020-10-21 00:21:25 2020-10-21 00:21:35      3
2    0 2020-10-21 00:21:25 2020-10-21 00:23:25      7
3    1 2020-10-21 15:21:25 2020-10-21 15:21:30      3
4    1 2020-10-21 15:21:25 2020-10-21 15:22:25      6
5    1 2020-10-21 15:21:25 2020-10-21 16:21:25      9
6    1 2020-10-21 15:21:25 2020-10-21 18:21:25     10
7    1 2020-10-21 15:21:25 2020-10-21 18:30:25     10
8    2 2020-10-21 08:11:20 2020-10-21 08:21:25      1
9    2 2020-10-21 08:11:20 2020-10-21 09:11:20      2
10   2 2020-10-21 08:11:20 2020-10-21 09:21:25      3
11   2 2020-10-21 08:11:20 2020-10-21 10:12:12      4
12   2 2020-10-21 08:11:20 2020-10-21 10:30:00      5
13   2 2020-10-21 08:11:20 2020-10-21 11:50:48      6
14   2 2020-10-21 08:11:20 2020-10-21 11:59:15      7

我想按 分组id,然后基于init_time每隔 X 间隔插入一行并重复 Y 次。请注意,init_time这对于所有行都是相同的,并且event_time是创建行的时间。


例如,如果Xis1hYis5次,则结果 DF 将是:

    id           init_time          event_time  value
0    0 2020-10-21 00:21:25 2020-10-21 00:21:25     -- ⎫
1    0 2020-10-21 00:21:25 2020-10-21 00:21:30      1 ⎪
2    0 2020-10-21 00:21:25 2020-10-21 00:21:35      3 ⎬ 1H interval (00:21:25 - 01:21:25)
3    0 2020-10-21 00:21:25 2020-10-21 00:23:25      7 ⎪
4    0 2020-10-21 00:21:25 2020-10-21 01:21:25     -- ⎭
5    0 2020-10-21 00:21:25 2020-10-21 02:21:25     --  
6    0 2020-10-21 00:21:25 2020-10-21 03:21:25     --
7    0 2020-10-21 00:21:25 2020-10-21 04:21:25     --

8    1 2020-10-21 15:21:25 2020-10-21 15:21:25     -- ⎫
9    1 2020-10-21 15:21:25 2020-10-21 15:21:30      3 ⎪
10   1 2020-10-21 15:21:25 2020-10-21 15:22:25      6 ⎬ 1H interval (15:21:25 - 16:21:25)
11   1 2020-10-21 15:21:25 2020-10-21 16:21:25     -- ⎭
12   1 2020-10-21 15:21:25 2020-10-21 16:21:25      9
13   1 2020-10-21 15:21:25 2020-10-21 17:21:25     --
14   1 2020-10-21 15:21:25 2020-10-21 18:21:25     --
15   1 2020-10-21 15:21:25 2020-10-21 18:21:25     10
16   1 2020-10-21 15:21:25 2020-10-21 18:30:25     10
17   1 2020-10-21 15:21:25 2020-10-21 19:21:25     --

18   2 2020-10-21 08:11:20 2020-10-21 08:11:20     -- ⎫
19   2 2020-10-21 08:11:20 2020-10-21 08:21:25      1 ⎬ 1H interval (08:11:20 - 09:11:20)
20   2 2020-10-21 08:11:20 2020-10-21 09:11:20     -- ⎭
21   2 2020-10-21 08:11:20 2020-10-21 09:11:20      2
22   2 2020-10-21 08:11:20 2020-10-21 09:21:25      3
23   2 2020-10-21 08:11:20 2020-10-21 10:11:20     --
24   2 2020-10-21 08:11:20 2020-10-21 10:12:12      4
25   2 2020-10-21 08:11:20 2020-10-21 10:30:00      5
26   2 2020-10-21 08:11:20 2020-10-21 11:11:20     --
27   2 2020-10-21 08:11:20 2020-10-21 11:50:48      6
28   2 2020-10-21 08:11:20 2020-10-21 11:59:15      7
29   2 2020-10-21 08:11:20 2020-10-21 12:11:20     --

任何帮助,将不胜感激 :)

更新: 这是我到目前为止所做的:

df2 = pd.DataFrame()
groups = df.groupby('id')['init_time'].first()
for g in groups:
  tmp = pd.DataFrame(index=pd.date_range(g, periods=5, freq="H"))
  tmp['value'] = -1
  df2 = df2.append(tmp,ignore_index=False)
df2 = df2.reset_index(drop=False)
df2['event_time'] = df2.pop('index')

代码创建缺失的行:

    value          event_time
0      -1 2020-10-21 00:21:25
1      -1 2020-10-21 01:21:25
2      -1 2020-10-21 02:21:25
3      -1 2020-10-21 03:21:25
4      -1 2020-10-21 04:21:25
5      -1 2020-10-21 15:21:25
6      -1 2020-10-21 16:21:25
7      -1 2020-10-21 17:21:25
8      -1 2020-10-21 18:21:25
9      -1 2020-10-21 19:21:25
10     -1 2020-10-21 08:11:20
11     -1 2020-10-21 09:11:20
12     -1 2020-10-21 10:11:20
13     -1 2020-10-21 11:11:20
14     -1 2020-10-21 12:11:20

现在我需要弄清楚如何合并 DF,或者找到更好的选择

Update2:我正在寻找优化@Serge Ballesta 给出的解决方案,以获得性能时间。

标签: pythonpandas

解决方案


我找不到聪明的方法。所以我只提议:

  1. 每列获取一行数据(假设id是唯一的)idinit_time
  2. 添加一个event_time从 a 构建的新列date_range并连接所有内容
  3. 将其与初始数据框连接,排序并构建一个新索引

代码可能是(假设数据框是df并且xxx_time列具有datetime64[ns]dtypes):

# extract init_time per id
df1 = df[['id', 'init_time']].drop_duplicates()

# build the new event_time data
df2 = pd.concat(pd.DataFrame({'id': [s[1]['id']] * 5,
                              'init_time': [s[1]['init_time']] * 5,
                              'event_time': pd.date_range(s[1]['init_time'],
                                                          periods=5,
                                                          freq='1H')})
                for s in df1.iterrows())

# concat with the initial dataframe
final = pd.concat((df, df2)).sort_values(['id', 'event_time']).reset_index(
    drop=True)

使用您的示例数据,它提供:

    id           init_time          event_time  value
0    0 2020-10-21 00:21:25 2020-10-21 00:21:25    NaN
1    0 2020-10-21 00:21:25 2020-10-21 00:21:30    1.0
2    0 2020-10-21 00:21:25 2020-10-21 00:21:35    3.0
3    0 2020-10-21 00:21:25 2020-10-21 00:23:25    7.0
4    0 2020-10-21 00:21:25 2020-10-21 01:21:25    NaN
5    0 2020-10-21 00:21:25 2020-10-21 02:21:25    NaN
6    0 2020-10-21 00:21:25 2020-10-21 03:21:25    NaN
7    0 2020-10-21 00:21:25 2020-10-21 04:21:25    NaN
8    1 2020-10-21 15:21:25 2020-10-21 15:21:25    NaN
9    1 2020-10-21 15:21:25 2020-10-21 15:21:30    3.0
10   1 2020-10-21 15:21:25 2020-10-21 15:22:25    6.0
11   1 2020-10-21 15:21:25 2020-10-21 16:21:25    9.0
12   1 2020-10-21 15:21:25 2020-10-21 16:21:25    NaN
13   1 2020-10-21 15:21:25 2020-10-21 17:21:25    NaN
14   1 2020-10-21 15:21:25 2020-10-21 18:21:25   10.0
15   1 2020-10-21 15:21:25 2020-10-21 18:21:25    NaN
16   1 2020-10-21 15:21:25 2020-10-21 18:30:25   10.0
17   1 2020-10-21 15:21:25 2020-10-21 19:21:25    NaN
18   2 2020-10-21 08:11:20 2020-10-21 08:11:20    NaN
19   2 2020-10-21 08:11:20 2020-10-21 08:21:25    1.0
20   2 2020-10-21 08:11:20 2020-10-21 09:11:20    2.0
21   2 2020-10-21 08:11:20 2020-10-21 09:11:20    NaN
22   2 2020-10-21 08:11:20 2020-10-21 09:21:25    3.0
23   2 2020-10-21 08:11:20 2020-10-21 10:11:20    NaN
24   2 2020-10-21 08:11:20 2020-10-21 10:12:12    4.0
25   2 2020-10-21 08:11:20 2020-10-21 10:30:00    5.0
26   2 2020-10-21 08:11:20 2020-10-21 11:11:20    NaN
27   2 2020-10-21 08:11:20 2020-10-21 11:50:48    6.0
28   2 2020-10-21 08:11:20 2020-10-21 11:59:15    7.0
29   2 2020-10-21 08:11:20 2020-10-21 12:11:20    NaN

推荐阅读