python - 使用 groupby 重采样聚合函数的正确方法
问题描述
我有一些数据,我试图先按“名称”分组,然后按“transaction_date”重新采样
transaction_date name revenue
01/01/2020 ADIB 30419
01/01/2020 ADIB 1119372
01/01/2020 ADIB 1272170
01/01/2020 ADIB 43822
01/01/2020 ADIB 24199
我遇到的问题是以两种不同的方式编写 groupby resample 返回两个不同的结果
1-- df.groupby("name").resample("M", on="transaction_date").sum()[['revenue']].head(12)
2-- df.groupby("name").resample("M", on="transaction_date").aggregate({'revenue':'sum'}).head(12)
第一种方法返回我正在寻找的值。
我不明白为什么这两种方法返回不同的结果。这是一个错误吗?
Result 1
name transaction_date revenue
ADIB 2020-01-31 39170943.0
2020-02-29 48003966.0
2020-03-31 32691641.0
2020-04-30 11979337.0
2020-05-31 35510726.0
2020-06-30 25677857.0
2020-07-31 12437122.0
2020-08-31 4348936.0
2020-09-30 10547188.0
2020-10-31 5287406.0
2020-11-30 4288930.0
2020-12-31 17066105.0
Result 2
name transaction_date revenue
ADIB 2020-01-31 64128331.0
2020-02-29 54450014.0
2020-03-31 45636192.0
2020-04-30 25016777.0
2020-05-31 11941744.0
2020-06-30 15703151.0
2020-07-31 5517526.0
2020-08-31 4092618.0
2020-09-30 4333433.0
2020-10-31 3944117.0
2020-11-30 6528058.0
2020-12-31 5718196.0
解决方案
Indeed, it's either a bug or an extremely strange behavior. Consider the following data:
input:
date revenue name
0 2020-10-27 0.744045 n_1
1 2020-10-29 0.074852 n_1
2 2020-11-21 0.560182 n_2
3 2020-12-29 0.208616 n_2
4 2020-05-03 0.325044 n_0
gb = df.groupby("name").resample("M", on="date")
gb.aggregate({'revenue':'sum'})
==>
revenue
name date
n_0 2020-12-31 0.325044
n_1 2020-05-31 0.744045
2020-06-30 0.000000
2020-07-31 0.000000
2020-08-31 0.000000
2020-09-30 0.000000
2020-10-31 0.074852
n_2 2020-10-31 0.560182
2020-11-30 0.208616
print(gb.sum()[['revenue']])
==>
revenue
name date
n_0 2020-05-31 0.325044
n_1 2020-10-31 0.818897
n_2 2020-11-30 0.560182
2020-12-31 0.208616
As one can see, it seems that aggregate produces the wrong results. For example, it takes data from Oct and attaches it to May.
Here's an even simpler example:
Data frame:
date revenue name
0 2020-02-24 9 n_1
1 2020-05-12 8 n_2
2 2020-03-28 9 n_2
3 2020-01-14 2 n_0
gb = df.groupby("name").resample("M", on="date")
res1 = gb.sum()[['revenue']]
==>
name date
n_0 2020-01-31 2
n_1 2020-02-29 9
n_2 2020-03-31 9
2020-04-30 0
2020-05-31 8
res2 = gb.aggregate({'revenue':'sum'})
==>
name date
n_0 2020-05-31 2
n_1 2020-01-31 9
n_2 2020-02-29 8
2020-03-31 9
I opened a bug about it: https://github.com/pandas-dev/pandas/issues/35173
推荐阅读
- asp.net-core-mvc - 从 wwwroot 文件夹显示 word 文档而不是下载:ASP.Net Core
- python - 在 python 中使用 matplotlib-basemap 制作地球仪
- javascript - 如何在生产中使用 deno
- ios - 通过 URL 在 Whatsapp 上分享图像
- javascript - 如何在反引号或反引号内转义嵌套的单引号
- kong - ruarocks 安装 kong 后未找到 kong 命令
- c# - 阅读某些网站时出现 HtmlAgilityPack 问题
- vba - 使用“isempty”删除表格行
- android - 无法添加到可变列表中
- spring-security - Jwt 弹簧安全匹配器