首页 > 解决方案 > Dplyr 将活动日期数据重塑为每月级别

问题描述

我有一个包含用户 ID、订阅开始和当前月份、活动日期和活动编号的 df。如果用户有多个活动,他们可以出现多次。下面是一个简短的玩具示例:

USER_ID      SUB_START     CURRENT_MONTH     ACTIVITY_DATE    ACTIVITY_NUMBER
  0102       2020-04-01     2020-08-01        2020-02-05            1
  0102       2020-04-01     2020-08-01        2020-03-10            2
  0102       2020-04-01     2020-08-01        2020-07-01            3
  2190       2019-05-10     2020-08-01        2017-01-02            1
  2190       2019-05-10     2020-08-01        2017-10-02            2
  0121       2020-07-13     2020-08-01        2018-01-04            1
  0121       2020-07-13     2020-08-01        2019-02-10            2
  0121       2020-07-13     2020-08-01        2020-01-02            3
  0121       2020-07-13     2020-08-01        2020-04-10            4

我想要完成的是按月分组,然后显示该月具有有效订阅的唯一 ID 的数量以及该月前 13 个月内具有活动日期的唯一 ID 的数量。所以这个玩具数据集的输出看起来像:

  MONTH       ACTIVE_COUNT    ACTIVITY_COUNT
2019-05-01         1               0          *user 2190 active with no activity within past 13 mo
2019-06-01         1               0          *user 2190 active with no activity within past 13 mo
2019-07-01         1               0          *user 2190 active with no activity within past 13 mo
2019-08-01         1               0          *user 2190 active with no activity within past 13 mo
2019-09-01         1               0          *user 2190 active with no activity within past 13 mo
2019-10-01         1               0          *user 2190 active with no activity within past 13 mo
2019-11-01         1               0          *user 2190 active with no activity within past 13 mo
2019-12-01         1               0          *user 2190 active with no activity within past 13 mo
2020-01-01         1               0          *user 2190 active with no activity within past 13 mo
2020-02-01         1               0          *user 2190 active with no activity within past 13 mo
2020-03-01         1               0          *user 2190 active with no activity within past 13 mo
2020-04-01         2               1          *user 2190 and 0102 active and 0102 has a qualifying activity
2020-05-01         2               1          *user 2190 and 0102 active and 0102 has a qualifying activity
2020-06-01         2               1          *user 2190 and 0102 active and 0102 has a qualifying activity
2020-07-01         3               2          *user 2190,0102,0121 all active and 0102 and 0121 have qualifying activities

到目前为止,我已经根据以前的项目汇总了以下代码,该项目为我提供了每个用户以及每个月在他们的 SUB_START 和 CURRENT_MONTH 之间的一行。问题是它为每个 ACTIVITY_DATE 重复该过程,因此每个 USER_ID 都有多个活动月份组。我希望尝试为每个用户活跃的每个月设置一行,然后添加一列以说明该用户在该月的 13 个月内是否有 ACTIVITY_DATE。

df_monthly <- df %>%
                  select(USER_ID,SUB_START, CURRENT_MONTH, ACTIVITY_DATE) %>%
                  mutate(across(where(is.character), ~ floor_date(as.Date(.x) - 1, "months") + 1)) %>%
                  rowwise() %>%
                  mutate(MONTH = list(seq(SUB_START,CURRENT_MONTH, by = "+1 month"))) %>%
                  unnest(MONTH) %>%
                  mutate(MONTH2 = floor_date(MONTH, unit="month"))

标签: rdplyr

解决方案


您需要循环“当前月份”,然后对于其中的每一个,计算每个用户的最后一次活动,最后计算订阅和最近活动的用户数。

这应该做大约你想要的:

library(tidyverse)
library(lubridate)
# recreate your dataframe
df <- "USER_ID,      SUB_START,     CURRENT_MONTH ,   ACTIVITY_DATE ,   ACTIVITY_NUMBER
0102,       2020-04-01,     2020-08-01,        2020-02-05            ,1
0102,       2020-04-01,     2020-08-01 ,       2020-03-10            ,2
0102,       2020-04-01,     2020-08-01,      2020-07-01            ,3
2190,       2019-05-10,     2020-08-01,        2017-01-02            ,1
2190,       2019-05-10,     2020-08-01,        2017-10-02            ,2
0121,       2020-07-13,     2020-08-01,        2018-01-04            ,1
0121,       2020-07-13,     2020-08-01,        2019-02-10            ,2
0121,       2020-07-13,     2020-08-01,        2020-01-02            ,3
0121,       2020-07-13,     2020-08-01,        2020-04-10            ,4" %>%
  str_remove_all(" ") %>%
  read_csv()

seq(min(df$SUB_START), max(df$CURRENT_MONTH), by = "+1 month") %>%
  map_dfr(~ df %>%
            group_by(USER_ID, SUB_START) %>%
            summarize(LAST_ACTIVITY = max(ACTIVITY_DATE), .groups="drop") %>%
            mutate(TIME_SINCE_LAST = .x - LAST_ACTIVITY) %>%
            summarize(n_users_subscribed = sum(SUB_START <= .x),
                      n_recently_active = sum(TIME_SINCE_LAST < dmonths(13) &
                                                TIME_SINCE_LAST >= 0)) %>%
            add_column(month = .x)
  )

与您的示例数据的一个区别是,我不计算 2020-07-01 的用户 0121,因为他们在 13 日加入,您可能需要进行四舍五入(floor_date也许在处理之前应用您的?)。

注意:您的嵌套方法也应该有效,我无法尝试(可能是因为您在读取数据框时有字符),但您可能只需要在取消嵌套之前预处理嵌套的数据框,以仅保留每个用户的最后一个活动日期。


推荐阅读