首页 > 解决方案 > Pandas - 排序、分组、添加列以查看移动日期窗口

问题描述

我很难理解如何实现移动窗口并创建一个带有二进制值的新列来检查条件(即1每个员工在过去 1 天、2 天、3 天等中是否有)。我很擅长排序和分组,但不太确定如何将日期列用作滑动窗口,同时考虑到某些日期的格式不同(即"2nd Shift")。

我想要完成的是让它查看该列,如果在过去 x 天'outcome'中至少有一个,则输入 1 。1但需要以groupby 'employee_id'正确的日期顺序回顾 x 天(或者我想如果它的排序正确,可以回顾 x 行,看起来排序确实如此)

这是一个示例数据集

employee_id = ['405395']*15 + ['405399'] *15
    
time_et = ['04-03-2021 09:07 PM','04-04-2021 08:37 PM','04-05-2021 09:38 PM',
'04-10-2021 07:07 PM','04-11-2021 01:07 PM','04-12-2021 08:10 PM',
'04-06-2021 04:07 PM','04-08-2021 07:07 PM','04-09-2021 07:07 PM',
'04-13-2021 08:10 PM','04-14-2021 02:10 PM','04-14-2021 2nd Shift',
'04-17-2021 09:07 PM','04-18-2021 04:07 PM','04-19-2021 09:38 PM',
'04-03-2021 09:07 PM','04-04-2021 08:37 PM','04-05-2021 09:38 PM',
'04-06-2021 04:07 PM','04-08-2021 2nd Shift','04-08-2021 02:07 PM',
'04-10-2021 07:07 PM','04-11-2021 01:07 PM','04-12-2021 08:10 PM',
'04-13-2021 08:10 PM','04-14-2021 02:10 PM','04-16-2021 09:38 PM',
'04-17-2021 09:07 PM','04-18-2021 04:07 PM','04-19-2021 09:38 PM']

outcome =[1,1,0,0,0,1,0,1,0,0,0,1,0,0,0,
1,1,0,1,0,0,0,0,0,1,0,1,1,1,0]

df = pd.DataFrame({'employee_id':employee_id, 'time_et':time_et, 'outcome':outcome})

sort_values 似乎部分有序,所以存在这个问题。我对 groupby 然后是滑动窗口感到困扰。我可以通过迭代来做到这一点,但这对于大型数据集似乎并不有效。

df = df.sort_values(['employee_id', 'time_et']).reset_index(drop=True)似乎工作

print(df)
   employee_id               time_et  outcome
0       405395   04-03-2021 09:07 PM        1
1       405395   04-04-2021 08:37 PM        1
2       405395   04-05-2021 09:38 PM        0
3       405395   04-06-2021 04:07 PM        0
4       405395   04-08-2021 07:07 PM        1
5       405395   04-09-2021 07:07 PM        0
6       405395   04-10-2021 07:07 PM        0
7       405395   04-11-2021 01:07 PM        0
8       405395   04-12-2021 08:10 PM        1
9       405395   04-13-2021 08:10 PM        0
10      405395   04-14-2021 02:10 PM        0
11      405395  04-14-2021 2nd Shift        1
12      405395   04-17-2021 09:07 PM        0
13      405395   04-18-2021 04:07 PM        0
14      405395   04-19-2021 09:38 PM        0
15      405399   04-03-2021 09:07 PM        1
16      405399   04-04-2021 08:37 PM        1
17      405399   04-05-2021 09:38 PM        0
18      405399   04-06-2021 04:07 PM        1
19      405399   04-08-2021 02:07 PM        0
20      405399  04-08-2021 2nd Shift        0
21      405399   04-10-2021 07:07 PM        0
22      405399   04-11-2021 01:07 PM        0
23      405399   04-12-2021 08:10 PM        0
24      405399   04-13-2021 08:10 PM        1
25      405399   04-14-2021 02:10 PM        0
26      405399   04-16-2021 09:38 PM        1
27      405399   04-17-2021 09:07 PM        1
28      405399   04-18-2021 04:07 PM        1
29      405399   04-19-2021 09:38 PM        0

我如何获得下一组列来获得这样的东西?

注意见最后我的尝试:我几乎拥有它

print(df.to_string())
   employee_id               time_et  outcome  1_day  2_day  3_day  4_day  5_day
0       405395   04-03-2021 09:07 PM        1    NaN    NaN    NaN    NaN    NaN
1       405395   04-04-2021 08:37 PM        1    1.0    1.0    1.0    1.0    1.0
2       405395   04-05-2021 09:38 PM        0    1.0    1.0    1.0    1.0    1.0
3       405395   04-06-2021 04:07 PM        0    0.0    1.0    1.0    1.0    1.0
4       405395   04-08-2021 07:07 PM        1    0.0    0.0    1.0    1.0    1.0
5       405395   04-09-2021 07:07 PM        0    1.0    1.0    1.0    1.0    1.0
6       405395   04-10-2021 07:07 PM        0    0.0    1.0    1.0    1.0    1.0
7       405395   04-11-2021 01:07 PM        0    0.0    0.0    1.0    1.0    1.0
8       405395   04-12-2021 08:10 PM        1    0.0    0.0    0.0    1.0    1.0
9       405395   04-13-2021 08:10 PM        0    1.0    1.0    1.0    1.0    1.0
10      405395   04-14-2021 02:10 PM        0    0.0    1.0    1.0    1.0    1.0
11      405395  04-14-2021 2nd Shift        1    0.0    0.0    1.0    1.0    1.0
12      405395   04-17-2021 09:07 PM        0    1.0    1.0    1.0    1.0    1.0
13      405395   04-18-2021 04:07 PM        0    0.0    1.0    1.0    1.0    1.0
14      405395   04-19-2021 09:38 PM        0    0.0    0.0    0.0    1.0    1.0
15      405399   04-03-2021 09:07 PM        1    NaN    NaN    NaN    NaN    NaN
16      405399   04-04-2021 08:37 PM        1    1.0    1.0    1.0    1.0    1.0
17      405399   04-05-2021 09:38 PM        0    1.0    1.0    1.0    1.0    1.0
18      405399   04-06-2021 04:07 PM        1    0.0    1.0    1.0    1.0    1.0
19      405399   04-08-2021 02:07 PM        0    1.0    1.0    1.0    1.0    1.0
20      405399  04-08-2021 2nd Shift        0    0.0    1.0    1.0    1.0    1.0
21      405399   04-10-2021 07:07 PM        0    0.0    0.0    1.0    1.0    1.0
22      405399   04-11-2021 01:07 PM        0    0.0    0.0    0.0    1.0    1.0
23      405399   04-12-2021 08:10 PM        0    0.0    0.0    0.0    1.0    1.0
24      405399   04-13-2021 08:10 PM        1    0.0    0.0    0.0    0.0    1.0
25      405399   04-14-2021 02:10 PM        0    1.0    1.0    1.0    1.0    1.0
26      405399   04-16-2021 09:38 PM        1    0.0    1.0    1.0    1.0    1.0
27      405399   04-17-2021 09:07 PM        1    1.0    1.0    1.0    1.0    1.0
28      405399   04-18-2021 04:07 PM        1    1.0    1.0    1.0    1.0    1.0
29      405399   04-19-2021 09:38 PM        0    1.0    1.0    1.0    1.0    1.0

我的尝试:这几乎得到了它,除了每个的第一行employee_id不为空。那么我怎样才能让那些为空呢?

df = pd.DataFrame({'employee_id':employee_id, 'time_et':time_et, 'outcome':outcome})
df = df.sort_values(['employee_id','time_et']).reset_index(drop=True)

df['bool'] = df['outcome'].map({0: False, 1: True})

for x in [1,2,3,4,5]:
    df['%s_day' %x] = (df.groupby('employee_id')['bool'].rolling(x, min_periods=x).sum().shift(1).astype(bool).reset_index(0, drop=True))    
    df['%s_day' %x] = df['%s_day' %x].map({False: 0, True: 1})

标签: pythonpandaspandas-groupbysliding-window

解决方案


要使用滚动时间帧,首先设置df.index为时间值,以便我们可以适当地对数据帧进行排序。

df.index = pd.to_datetime(
  df.time_et.str.replace('2nd Shift', '11:59 PM', regex=False).rename('idx')
)
df = df.sort_values(['employee_id', 'idx'])

然后使用 rolling & groupby 在循环中分配其他列。

相关的rolling论点是

  1. window:一个整数,指定窗口开始时要回看的行数
  2. 关闭:用于指定是否包含窗口结束(在您的情况下,当前行被排除)
  3. min_periods:计算产生值所需的最小窗口大小。这默认为该window值,因此我们必须手动设置它。
for i in [1,2,3,4,5]:
  df[f'{i}_day'] = df.groupby('employee_id').rolling(
    window=i,
    min_periods=1,
    closed='left'
  ).outcome.max().values

这会产生以下数据框,您可以删除索引,因为滚动计算已经执行。

                    employee_id               time_et  outcome  1_day  2_day  3_day  4_day  5_day
idx
2021-04-03 21:07:00      405395   04-03-2021 09:07 PM        1    NaN    NaN    NaN    NaN    NaN
2021-04-04 20:37:00      405395   04-04-2021 08:37 PM        1    1.0    1.0    1.0    1.0    1.0
2021-04-05 21:38:00      405395   04-05-2021 09:38 PM        0    1.0    1.0    1.0    1.0    1.0
2021-04-06 16:07:00      405395   04-06-2021 04:07 PM        0    0.0    1.0    1.0    1.0    1.0
2021-04-08 19:07:00      405395   04-08-2021 07:07 PM        1    0.0    0.0    1.0    1.0    1.0
2021-04-09 19:07:00      405395   04-09-2021 07:07 PM        0    1.0    1.0    1.0    1.0    1.0
2021-04-10 19:07:00      405395   04-10-2021 07:07 PM        0    0.0    1.0    1.0    1.0    1.0
2021-04-11 13:07:00      405395   04-11-2021 01:07 PM        0    0.0    0.0    1.0    1.0    1.0
2021-04-12 20:10:00      405395   04-12-2021 08:10 PM        1    0.0    0.0    0.0    1.0    1.0
2021-04-13 20:10:00      405395   04-13-2021 08:10 PM        0    1.0    1.0    1.0    1.0    1.0
2021-04-14 14:10:00      405395   04-14-2021 02:10 PM        0    0.0    1.0    1.0    1.0    1.0
2021-04-14 23:59:00      405395  04-14-2021 2nd Shift        1    0.0    0.0    1.0    1.0    1.0
2021-04-17 21:07:00      405395   04-17-2021 09:07 PM        0    1.0    1.0    1.0    1.0    1.0
2021-04-18 16:07:00      405395   04-18-2021 04:07 PM        0    0.0    1.0    1.0    1.0    1.0
2021-04-19 21:38:00      405395   04-19-2021 09:38 PM        0    0.0    0.0    1.0    1.0    1.0
2021-04-03 21:07:00      405399   04-03-2021 09:07 PM        1    NaN    NaN    NaN    NaN    NaN
2021-04-04 20:37:00      405399   04-04-2021 08:37 PM        1    1.0    1.0    1.0    1.0    1.0
2021-04-05 21:38:00      405399   04-05-2021 09:38 PM        0    1.0    1.0    1.0    1.0    1.0
2021-04-06 16:07:00      405399   04-06-2021 04:07 PM        1    0.0    1.0    1.0    1.0    1.0
2021-04-08 14:07:00      405399   04-08-2021 02:07 PM        0    1.0    1.0    1.0    1.0    1.0
2021-04-08 23:59:00      405399  04-08-2021 2nd Shift        0    0.0    1.0    1.0    1.0    1.0
2021-04-10 19:07:00      405399   04-10-2021 07:07 PM        0    0.0    0.0    1.0    1.0    1.0
2021-04-11 13:07:00      405399   04-11-2021 01:07 PM        0    0.0    0.0    0.0    1.0    1.0
2021-04-12 20:10:00      405399   04-12-2021 08:10 PM        0    0.0    0.0    0.0    0.0    1.0
2021-04-13 20:10:00      405399   04-13-2021 08:10 PM        1    0.0    0.0    0.0    0.0    0.0
2021-04-14 14:10:00      405399   04-14-2021 02:10 PM        0    1.0    1.0    1.0    1.0    1.0
2021-04-16 21:38:00      405399   04-16-2021 09:38 PM        1    0.0    1.0    1.0    1.0    1.0
2021-04-17 21:07:00      405399   04-17-2021 09:07 PM        1    1.0    1.0    1.0    1.0    1.0
2021-04-18 16:07:00      405399   04-18-2021 04:07 PM        1    1.0    1.0    1.0    1.0    1.0
2021-04-19 21:38:00      405399   04-19-2021 09:38 PM        0    1.0    1.0    1.0    1.0    1.0

推荐阅读