首页 > 解决方案 > 计算熊猫组中不是 NaN 的值

问题描述

这是对这个问题的跟进

我有一个df这样的:

country product     date_install    date_purchase     date_authentication           user_id
BR      yearly      2020-11-01      2020-11-01        2020-11-01                    10660236
CA      monthly     2020-11-01      trialed           trialed                            0649441
US      yearly      2020-11-01      trialed           2020-11-01                       10660272
IT      monthly     2020-11-01      2020-11-01        2020-11-01                    10657634
AE      monthly     2020-11-01      2020-11-01        2020-11-01                   10661442
IT      monthly     2020-11-01      trialed           trialed                       10657634
AE      monthly     2020-11-01      trialed           2020-11-05                    10661442

我试图得到:

country product     date_install        installs    purchases     registrations  ratio
US      daily       2021-02-05          100         20            30             0.2
US      monthly     2021-02-05          100         50            40             0.5
US      yearly      2021-02-05          100         50            20             0.5             
US      trialed     2021-02-05          100         0             45              0    
# the next day
US      daily       2021-02-06          500         50            300            0.1
US      monthly     2021-02-06          500         100           267            0.2
US      yearly      2021-02-06          500         250           123            0.5             
US      trialed     2021-02-06          500         0             312            0    
# the rest of the countries & the rest of the days

我正在尝试获取购买/安装的比率以及每个国家/地区、产品、日期的实际安装、注册和购买数量。date_install是安装日期,date_authentication是注册日期,date_purchase确定购买日期以及购买已经发生,trialed值 indate_purchase表示没有为具有 user_id 的用户进行购买,trialedindate_authentication表示用户尚未注册。

计数installs需要是当天总安装量的总和,与registrations.

使用并尝试更新jezrael 的答案后:

df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)
df['date_authentication'] = df['date_authentication'].replace('trialed', np.nan)

print(df['date_install'].count())
print(df['date_authentication'].count())
print(df['date_purchase'].count())

# 2496159
# 112535
# 24311

exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count'),registrations = ('date_authentication','count')))
exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum')
exp['registrations'] = exp.groupby(['country','date_install'])['registrations'].transform('sum')
exp['ratio'] = exp['purchases'].div(exp['installs'])
exp = exp.reset_index()
exp

但是exp对于每个指标都有相同的计数,而很明显installs>registrations>purchases

print(exp['installs'].count())
print(exp['purchases'].count())
print(exp['registrations'].count())

# 5035
# 5035
# 5035

我的错误在哪里?我正在尝试计算每个for的date_x事件数,并通过 in 中的值进行记录,其中值是日期而不是日期?country, product, date_installinstallregistrationpurchasedate_installdate_authenticationdate_purchasenan / trialed

更新

print(exp.isna().sum())
country                    0
product                    0
date_install               0
date_authentication        0
installs                   0
purchases                  0
registrations              0
ratio_from_install         0
ratio_from_registration    0

sum返回预期的内容:

print(exp['installs'].sum())
print(exp['registrations'].sum())
print(exp['purchases'].sum())

# 143090
# 95860
# 13136
exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','sum'), purchases = ('date_purchase','sum'),registrations = ('date_authentication','sum')))

类型错误:+ 不支持的操作数类型:“int”和“str”

我怎样才能得到真实的数字sum

标签: pythonpandas

解决方案


推荐阅读