python - Python:计算一段时间内 Pandas 数据框中的累积量
问题描述
目标:计算自 2020-01-01 以来的累计收入。
我有一个 python 字典,如下所示
data = [{"game_id":"Racing","user_id":"ABC123","amt":5,"date":"2020-01-01"},
{"game_id":"Racing","user_id":"ABC123","amt":1,"date":"2020-01-04"},
{"game_id":"Racing","user_id":"CDE123","amt":1,"date":"2020-01-04"},
{"game_id":"DH","user_id":"CDE123","amt":100,"date":"2020-01-03"},
{"game_id":"DH","user_id":"CDE456","amt":10,"date":"2020-01-02"},
{"game_id":"DH","user_id":"CDE789","amt":5,"date":"2020-01-02"},
{"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"},
{"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"}]
上面的同一个字典看起来像一个表
game_id user_id amt activity date
'Racing', 'ABC123', 5, '2020-01-01'
'Racing', 'ABC123', 1, '2020-01-04'
'Racing', 'CDE123', 1, '2020-01-04'
'DH', 'CDE123', 100, '2020-01-03'
'DH', 'CDE456', 10, '2020-01-02'
'DH', ' CDE789', 5, '2020-01-02'
'DH', 'CDE456', 1, '2020-01-03'
'DH', 'CDE456', 1, '2020-01-03'
年龄计算为交易日期与 2020-01-01 之间的差异。付款人总数是每场比赛的付款人数量。
我正在尝试创建一个数据框,其中包含从第一笔交易之日到交易第二天的每一天的累积结果。例如:对于 game_id Racing,我们在 2020 年 1 月 1 日从金额 5 开始,因此年龄为 0。在 2020 年 1 月 2 日,金额仍为 5,因为那天我们没有交易。在 2020 年 1 月 3 日,金额为 5。但在 2020 年 1 月 4 日,金额为 7,因为我们在这一天有 2 笔交易。
预期产出
Game Age Cum_rev Total_unique_payers_per_game
Racing 0 5 2
Racing 1 5 2
Racing 2 5 2
Racing 3 7 2
DH 0 0 3
DH 1 15 3
DH 2 117 3
DH 3 117 3
如何在 python 中使用窗口函数,就像我们在 SQL 中使用一样。有没有更好的方法来解决这个问题?
解决方案
这里非常复杂的部分是填写日期。我使用了申请,但我不确定这是最好的方法
import pandas as pd
data = [{"game_id":"Racing","user_id":"ABC123","amt":5,"date":"2020-01-01"},
{"game_id":"Racing","user_id":"ABC123","amt":1,"date":"2020-01-04"},
{"game_id":"Racing","user_id":"CDE123","amt":1,"date":"2020-01-04"},
{"game_id":"DH","user_id":"CDE123","amt":100,"date":"2020-01-03"},
{"game_id":"DH","user_id":"CDE456","amt":10,"date":"2020-01-02"},
{"game_id":"DH","user_id":"CDE789","amt":5,"date":"2020-01-02"},
{"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"},
{"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"}]
df = pd.DataFrame(data)
# we want datetime not object
df["date"] = df["date"].astype("M8[us]")
# we will need to merge this at the end
grp = df.groupby("game_id")['user_id']\
.nunique()\
.reset_index(name="Total_unique_payers_per_game")
# sum amt per game_id date
df = df.groupby(["game_id", "date"])["amt"].sum().reset_index()
# dates from 2020-01-01 till the max date in df
dates = pd.DataFrame({"date": pd.date_range("2020-01-01", df["date"].max())})
# add missing dates
def expand_dates(x):
x = pd.merge(dates, x.drop("game_id", axis=1), how="left")
x["amt"] = x["amt"].fillna(0)
return x
df = df.groupby("game_id")\
.apply(expand_dates)\
.reset_index().drop("level_1", axis=1)
df["Cum_rev"] = df.groupby("game_id")['amt'].transform("cumsum")
# this is equivalent as long as data is sorted
# df["Cum_rev"] = df.groupby("game_id")['amt'].cumsum()
# merge unique payers per game
df = pd.merge(df, grp, how="left")
# dates difference
df["Age"] = "2020-01-01"
df["Age"] = df["Age"].astype("M8[us]")
df["Age"] = (df["date"]-df["Age"]).dt.days
# then you can eventually filter
df = df[["game_id", "Age",
"Cum_rev", "Total_unique_payers_per_game"]]\
.rename(columns={"game_id":"Game"})
推荐阅读
- javascript - 无法使用 AngularJS 1 在 routeProvider 中使用查询参数进行路由
- java - 根据正确的月份在 ListView 中显示值
- python - 为指定的 n 创建所有布尔组合
- php - Braintree 使用测试信用卡进行 3D 安全交易
- javascript - JavaScript 变量 - 最佳性能
- java - 从文本文件读取无法正常工作
- c# - 分配驱动器号后是否可以禁止 Windows 通知
- amazon-web-services - AWS - 多个 ObjectCreate 的单一电子邮件通知
- javascript - JavaScript:NaN 的值是多少
- java - Hibernate 很难使用 MySQL 创建表