python - 计算熊猫组中不是 NaN 的值
问题描述
这是对这个问题的跟进
我有一个df
这样的:
country product date_install date_purchase date_authentication user_id
BR yearly 2020-11-01 2020-11-01 2020-11-01 10660236
CA monthly 2020-11-01 trialed trialed 0649441
US yearly 2020-11-01 trialed 2020-11-01 10660272
IT monthly 2020-11-01 2020-11-01 2020-11-01 10657634
AE monthly 2020-11-01 2020-11-01 2020-11-01 10661442
IT monthly 2020-11-01 trialed trialed 10657634
AE monthly 2020-11-01 trialed 2020-11-05 10661442
我试图得到:
country product date_install installs purchases registrations ratio
US daily 2021-02-05 100 20 30 0.2
US monthly 2021-02-05 100 50 40 0.5
US yearly 2021-02-05 100 50 20 0.5
US trialed 2021-02-05 100 0 45 0
# the next day
US daily 2021-02-06 500 50 300 0.1
US monthly 2021-02-06 500 100 267 0.2
US yearly 2021-02-06 500 250 123 0.5
US trialed 2021-02-06 500 0 312 0
# the rest of the countries & the rest of the days
我正在尝试获取购买/安装的比率以及每个国家/地区、产品、日期的实际安装、注册和购买数量。date_install
是安装日期,date_authentication
是注册日期,date_purchase
确定购买日期以及购买已经发生,trialed
值 indate_purchase
表示没有为具有 user_id 的用户进行购买,trialed
indate_authentication
表示用户尚未注册。
计数installs
需要是当天总安装量的总和,与registrations
.
使用并尝试更新jezrael 的答案后:
df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)
df['date_authentication'] = df['date_authentication'].replace('trialed', np.nan)
print(df['date_install'].count())
print(df['date_authentication'].count())
print(df['date_purchase'].count())
# 2496159
# 112535
# 24311
exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count'),registrations = ('date_authentication','count')))
exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum')
exp['registrations'] = exp.groupby(['country','date_install'])['registrations'].transform('sum')
exp['ratio'] = exp['purchases'].div(exp['installs'])
exp = exp.reset_index()
exp
但是exp
对于每个指标都有相同的计数,而很明显installs>registrations>purchases
:
print(exp['installs'].count())
print(exp['purchases'].count())
print(exp['registrations'].count())
# 5035
# 5035
# 5035
我的错误在哪里?我正在尝试计算每个for的date_x
事件数,并通过 in 中的值进行记录,其中值是日期而不是日期?country, product, date_install
install
registration
purchase
date_install
date_authentication
date_purchase
nan / trialed
更新
print(exp.isna().sum())
country 0
product 0
date_install 0
date_authentication 0
installs 0
purchases 0
registrations 0
ratio_from_install 0
ratio_from_registration 0
sum
返回预期的内容:
print(exp['installs'].sum())
print(exp['registrations'].sum())
print(exp['purchases'].sum())
# 143090
# 95860
# 13136
exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','sum'), purchases = ('date_purchase','sum'),registrations = ('date_authentication','sum')))
类型错误:+ 不支持的操作数类型:“int”和“str”
我怎样才能得到真实的数字sum
?
解决方案
推荐阅读
- spring - 如何将弹簧安全设置从弹簧转移到弹簧靴
- jquery-ui - jquery ui-sortable 事件,beforeStop 得到错误的 index() 值
- java - 您可以扩展 IntSummaryStatistics 类吗?
- java - 匿名可运行实现并将其发布到处理程序
- javascript - 有没有办法在移动设备上强制重新加载来自不同来源的 iframe?
- javascript - 在排序另一个数组之前复制数组的顺序
- mql4 - RMI 指标中的自定义参数是什么?
- python - Tkinter 文本小部件使列大小变形
- javascript - 计算具有给定字段值问题的文档
- sql - 从 SQL Server 到 Oracle 使用点表示法与 OpenQuery