首页 > 解决方案 > 使用 groupby 重采样聚合函数的正确方法

问题描述

我有一些数据,我试图先按“名称”分组,然后按“transaction_date”重新采样

transaction_date    name    revenue
01/01/2020          ADIB    30419
01/01/2020          ADIB    1119372
01/01/2020          ADIB    1272170
01/01/2020          ADIB    43822
01/01/2020          ADIB    24199

我遇到的问题是以两种不同的方式编写 groupby resample 返回两个不同的结果

1-- df.groupby("name").resample("M", on="transaction_date").sum()[['revenue']].head(12)

2-- df.groupby("name").resample("M", on="transaction_date").aggregate({'revenue':'sum'}).head(12)

第一种方法返回我正在寻找的值。

我不明白为什么这两种方法返回不同的结果。这是一个错误吗?

Result 1
name    transaction_date    revenue 
ADIB    2020-01-31          39170943.0
        2020-02-29          48003966.0
        2020-03-31          32691641.0
        2020-04-30          11979337.0
        2020-05-31          35510726.0
        2020-06-30          25677857.0
        2020-07-31          12437122.0
        2020-08-31          4348936.0
        2020-09-30          10547188.0
        2020-10-31          5287406.0
        2020-11-30          4288930.0
        2020-12-31          17066105.0

Result 2
name    transaction_date    revenue
ADIB    2020-01-31          64128331.0
        2020-02-29          54450014.0
        2020-03-31          45636192.0
        2020-04-30          25016777.0
        2020-05-31          11941744.0
        2020-06-30          15703151.0
        2020-07-31          5517526.0
        2020-08-31          4092618.0
        2020-09-30          4333433.0
        2020-10-31          3944117.0
        2020-11-30          6528058.0
        2020-12-31          5718196.0

标签: pythonpandaspandas-groupby

解决方案


Indeed, it's either a bug or an extremely strange behavior. Consider the following data:

input: 

        date   revenue name
0 2020-10-27  0.744045  n_1
1 2020-10-29  0.074852  n_1
2 2020-11-21  0.560182  n_2
3 2020-12-29  0.208616  n_2
4 2020-05-03  0.325044  n_0

gb = df.groupby("name").resample("M", on="date")

gb.aggregate({'revenue':'sum'})

==>
              revenue
name date                
n_0  2020-12-31  0.325044
n_1  2020-05-31  0.744045
     2020-06-30  0.000000
     2020-07-31  0.000000
     2020-08-31  0.000000
     2020-09-30  0.000000
     2020-10-31  0.074852
n_2  2020-10-31  0.560182
     2020-11-30  0.208616


print(gb.sum()[['revenue']])
==>
                  revenue
name date                
n_0  2020-05-31  0.325044
n_1  2020-10-31  0.818897
n_2  2020-11-30  0.560182
     2020-12-31  0.208616

As one can see, it seems that aggregate produces the wrong results. For example, it takes data from Oct and attaches it to May.

Here's an even simpler example:

Data frame:

        date  revenue name
0 2020-02-24        9  n_1
1 2020-05-12        8  n_2
2 2020-03-28        9  n_2
3 2020-01-14        2  n_0

gb = df.groupby("name").resample("M", on="date")

res1 = gb.sum()[['revenue']]

==>
name date               
n_0  2020-01-31        2
n_1  2020-02-29        9
n_2  2020-03-31        9
     2020-04-30        0
     2020-05-31        8

res2 = gb.aggregate({'revenue':'sum'})

==>
name date               
n_0  2020-05-31        2
n_1  2020-01-31        9
n_2  2020-02-29        8
     2020-03-31        9

I opened a bug about it: https://github.com/pandas-dev/pandas/issues/35173


推荐阅读