首页 > 解决方案 > 满足条件时的分组时间差

问题描述

我正在使用结构如下的结构化日志数据(这里是模拟数据的 pastebin 片段,以便于修改):

import pandas as pd

df = pd.read_csv("https://pastebin.com/raw/qrqTMrGa")
print(df)

     id        date  info_a_cnt  info_b_cnt  has_err
0   123  2020-01-01         123          32        0
1   123  2020-01-02           2          43        0
2   123  2020-01-03          43           4        1
3   123  2020-01-04          43           4        0
4   123  2020-01-05          43           4        0
5   123  2020-01-06          43           4        0
6   123  2020-01-07          43           4        1
7   123  2020-01-08          43           4        0
8   232  2020-01-04          56           4        0
9   232  2020-01-05          97           1        0
10  232  2020-01-06          23          74        0
11  232  2020-01-07          91          85        1
12  232  2020-01-08          91          85        0
13  232  2020-01-09          91          85        0
14  232  2020-01-10          91          85        1

变量非常简单:

现在,我想对数据框进行分组,id以创建一个变量来存储错误事件之前的剩余天数。所需的数据框应如下所示:

     id        date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123  2020-01-01         123          32        0            2
1   123  2020-01-02           2          43        0            1
2   123  2020-01-03          43           4        1            0
3   123  2020-01-04          43           4        0            3
4   123  2020-01-05          43           4        0            2
5   123  2020-01-06          43           4        0            1
6   123  2020-01-07          43           4        1            0
7   232  2020-01-04          56           4        0            3
8   232  2020-01-05          97           1        0            2
9   232  2020-01-06          23          74        0            1
10  232  2020-01-07          91          85        1            0
11  232  2020-01-08          91          85        0            2
12  232  2020-01-09          91          85        0            1
13  232  2020-01-10          91          85        1            0

我很难用正确的分组函数找出正确的实现。

编辑:

在处理每日粒度的日期时,以下所有答案都非常有效。我想知道如何使下面的@jezrael 解决方案适应包含时间戳的数据帧(日志将以15分钟的间隔进行批处理):

东风

df = pd.read_csv("https://pastebin.com/raw/YZukAhBz")
print(df)

     id                 date  info_a_cnt  info_b_cnt  has_err
0   123  2020-01-01 12:00:00         123          32        0
1   123  2020-01-01 12:15:00           2          43        0
2   123  2020-01-01 12:30:00          43           4        1
3   123  2020-01-01 12:45:00          43           4        0
4   123  2020-01-01 13:00:00          43           4        0
5   123  2020-01-01 13:15:00          43           4        0
6   123  2020-01-01 13:30:00          43           4        1
7   123  2020-01-01 13:45:00          43           4        0
8   232  2020-01-04 17:00:00          56           4        0
9   232  2020-01-05 17:15:00          97           1        0
10  232  2020-01-06 17:30:00          23          74        0
11  232  2020-01-07 17:45:00          91          85        1
12  232  2020-01-08 18:00:00          91          85        0
13  232  2020-01-09 18:15:00          91          85        0
14  232  2020-01-10 18:30:00          91          85        1

我想知道如何调整@jezrael 的答案以达到以下目标:

     id                 date  info_a_cnt  info_b_cnt  has_err  mins_to_err
0   123  2020-01-01 12:00:00         123          32        0           30
1   123  2020-01-01 12:15:00           2          43        0           15
2   123  2020-01-01 12:30:00          43           4        1            0
3   123  2020-01-01 12:45:00          43           4        0           45
4   123  2020-01-01 13:00:00          43           4        0           30
5   123  2020-01-01 13:15:00          43           4        0           15
6   123  2020-01-01 13:30:00          43           4        1            0
7   123  2020-01-01 13:45:00          43           4        0           60
8   232  2020-01-04 17:00:00          56           4        0           45
9   232  2020-01-05 17:15:00          97           1        0           30
10  232  2020-01-06 17:30:00          23          74        0           15
11  232  2020-01-07 17:45:00          91          85        1            0
12  232  2020-01-08 18:00:00          91          85        0           30
13  232  2020-01-09 18:15:00          91          85        0           15
14  232  2020-01-10 18:30:00          91          85        1            0

标签: pythonpandasdataframepandas-groupby

解决方案


使用GroupBy.cumcountwith ascending=Falseby columnid和 helper Series with Series.cumsumbut form back - 所以添加了索引Series.iloc

g = f['has_err'].iloc[::-1].cumsum().iloc[::-1]
df['days_to_err'] = df.groupby(['id', g])['has_err'].cumcount(ascending=False)
print(df)
     id        date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123  2020-01-01         123          32        0            2
1   123  2020-01-02           2          43        0            1
2   123  2020-01-03          43           4        1            0
3   123  2020-01-04          43           4        0            3
4   123  2020-01-05          43           4        0            2
5   123  2020-01-06          43           4        0            1
6   123  2020-01-07          43           4        1            0
7   123  2020-01-08          43           4        0            0
8   232  2020-01-04          56           4        0            3
9   232  2020-01-05          97           1        0            2
10  232  2020-01-06          23          74        0            1
11  232  2020-01-07          91          85        1            0
12  232  2020-01-08          91          85        0            2
13  232  2020-01-09          91          85        0            1
14  232  2020-01-10          91          85        1            0

编辑:要计算日期差异的累积总和,请使用自定义 lambda 函数GroupBy.transform

df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
                       .transform(lambda x: x.diff().dt.days.cumsum())
                       .fillna(0)
                       .to_numpy()[::-1])
print(df)
     id       date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123 2020-01-01         123          32        0          2.0
1   123 2020-01-02           2          43        0          1.0
2   123 2020-01-03          43           4        1          0.0
3   123 2020-01-04          43           4        0          3.0
4   123 2020-01-05          43           4        0          2.0
5   123 2020-01-06          43           4        0          1.0
6   123 2020-01-07          43           4        1          0.0
7   123 2020-01-08          43           4        0          0.0
8   232 2020-01-04          56           4        0          3.0
9   232 2020-01-05          97           1        0          2.0
10  232 2020-01-06          23          74        0          1.0
11  232 2020-01-07          91          85        1          0.0
12  232 2020-01-08          91          85        0          2.0
13  232 2020-01-09          91          85        0          1.0
14  232 2020-01-10          91          85        1          0.0

EDIT1:使用Series.dt.total_seconds除以60

#some data sample cleaning
df = pd.read_csv("https://pastebin.com/raw/YZukAhBz", parse_dates=['date'])
df['date'] = df['date'].apply(lambda x: x.replace(month=1, day=1))
print(df)

df['days_to_err'] = (df.groupby(['id', df['has_err'].iloc[::-1].cumsum()])['date']
                       .transform(lambda x: x.diff().dt.total_seconds().div(60).cumsum())
                       .fillna(0)
                       .to_numpy()[::-1])
print(df)


     id                date  info_a_cnt  info_b_cnt  has_err  days_to_err
0   123 2020-01-01 12:00:00         123          32        0         30.0
1   123 2020-01-01 12:15:00           2          43        0         15.0
2   123 2020-01-01 12:30:00          43           4        1          0.0
3   123 2020-01-01 12:45:00          43           4        0         45.0
4   123 2020-01-01 13:00:00          43           4        0         30.0
5   123 2020-01-01 13:15:00          43           4        0         15.0
6   123 2020-01-01 13:30:00          43           4        1          0.0
7   123 2020-01-01 13:45:00          43           4        0          0.0
8   232 2020-01-01 17:00:00          56           4        0         45.0
9   232 2020-01-01 17:15:00          97           1        0         30.0
10  232 2020-01-01 17:30:00          23          74        0         15.0
11  232 2020-01-01 17:45:00          91          85        1          0.0
12  232 2020-01-01 18:00:00          91          85        0         30.0
13  232 2020-01-01 18:15:00          91          85        0         15.0
14  232 2020-01-01 18:30:00          91          85        1          0.0

推荐阅读