首页 > 解决方案 > Groupby 序列计数和序列持续时间

问题描述

我有如下所示的数据框。

   ID   Status      Date    Cost    Duration
0   1   F       2017-06-22  500     nan
1   1   M       2017-07-22  100     30.00
2   1   P       2017-10-22  100     92.00
3   1   F       2018-06-22  600     243.00
4   1   M       2018-08-22  150     61.00
5   1   P       2018-10-22  120     61.00
6   1   F       2019-03-22  750     151.00
7   2   M       2017-06-29  200     nan
8   2   F       2017-09-29  600     92.00
9   2   F       2018-01-29  500     122.00
10  2   M       2018-03-29  100     59.00
11  2   P       2018-08-29  100     153.00
12  2   M       2018-10-29  100     61.00
13  2   F       2018-12-29  500     61.00
14  3   M       2017-03-20  300     nan
15  3   F       2018-06-20  700     457.00
16  3   P       2018-08-20  100     61.00
17  3   M       2018-10-20  250     61.00
18  3   F       2018-11-20  100     31.00
19  3   P       2018-12-20  100     30.00
20  3   F       2019-03-20  600     90.00
22  4   M       2017-08-10  800     nan
23  4   F       2018-06-10  100     304.00
24  4   P       2018-08-10  120     61.00
25  4   F       2018-10-10  500     61.00
26  4   M       2019-01-10  200     92.00
27  4   F       2019-06-10  600     151.00

数据框已根据 ID 和日期排序

持续时间 = 该 ID 从上一个状态到当前状态所用的天数。

我想使用 pandas 函数准备下面的数据框。

ID  Nof_F-F  Nof_F-M  Nof_F-P Nof_M-F Nof_M-M  Nof_M-P Nof_P-F Nof_P-M
1   0        2        0       0       0        2       2       0
2   1        0        0       2       0        1       0       1
3   0        0        2       2       0        0       1       1
4   0        1        1       2       0        0       1       0

其中 Nof_F-F = 连续 FF 的数量,依此类推。

我也想准备以下数据框

    ID  Avg_F-F  Avg_F-M  Avg_F-P Avg_M-F Avg_M-M  Avg_M-P Avg_P-F Avg_P-M
    1   nan      45.5     nan     nan     nan      76.5    197     nan
    2   122      nan      nan     76.5    nan      153     nan     61
    3   nan      nan      45.5    244     nan      nan     90      61
    4   nan      92       61      227.5   nan      nan     61      nan

Avg_F-F = 连续 FF 的平均天数,所有其他列也是如此。

标签: pythonpandaspandas-groupby

解决方案


由于已经排序,请使用shift+where创建链接。然后groupby+unstack

df['link'] = df.Status.shift().where(df.ID.eq(df.ID.shift())) + '-' + df.Status

df.groupby(['ID', 'link']).agg('size').unstack().fillna(0)
#link  F-F  F-M  F-P  M-F  M-P  P-F  P-M
#ID                                     
#1     0.0  2.0  0.0  0.0  2.0  2.0  0.0
#2     1.0  1.0  0.0  2.0  1.0  0.0  1.0
#3     0.0  0.0  2.0  2.0  0.0  1.0  1.0
#4     0.0  1.0  1.0  2.0  0.0  1.0  0.0

df.groupby(['ID', 'link']).Duration.mean().unstack()
#link    F-F   F-M   F-P    M-F    M-P    P-F   P-M
#ID                                                
#1       NaN  45.5   NaN    NaN   76.5  197.0   NaN
#2     122.0  59.0   NaN   76.5  153.0    NaN  61.0
#3       NaN   NaN  45.5  244.0    NaN   90.0  61.0
#4       NaN  92.0  61.0  227.5    NaN   61.0   NaN

为确保所有链接都存在于输出中,请在取消reindex堆叠后使用。要么自己形成列表,要么使用unique来获取所有存在的值。

u = df.Status.unique()
cols = [f'{x}-{y}' for x in u for y in u]
#['F-F', 'F-M', 'F-P', 'M-F', 'M-M', 'M-P', 'P-F', 'P-M', 'P-P']

df.groupby(['ID', 'link']).Duration.mean().unstack().reindex(cols, axis=1)

#link    F-F   F-M   F-P    M-F  M-M    M-P    P-F   P-M  P-P
#ID                                                          
#1       NaN  45.5   NaN    NaN  NaN   76.5  197.0   NaN  NaN
#2     122.0  59.0   NaN   76.5  NaN  153.0    NaN  61.0  NaN
#3       NaN   NaN  45.5  244.0  NaN    NaN   90.0  61.0  NaN
#4       NaN  92.0  61.0  227.5  NaN    NaN   61.0   NaN  NaN

推荐阅读