首页 > 解决方案 > 总结 tibble 到多行输出

问题描述

假设我在 R 中有以下小标题:

activation_date | country | campaign | revenue | users
======================================================
1               | 1       | 1        | R_1     | U_1
2               | 1       | 1        | R_2     | U_2
3               | 1       | 1        | R_3     | U_3
1               | 1       | 2        | R_4     | U_4
2               | 1       | 2        | R_5     | U_5
3               | 1       | 2        | R_6     | U_6
1               | 2       | 3        | R_7     | U_7
2               | 2       | 3        | R_8     | U_8
3               | 2       | 3        | R_9     | U_9

我想按国家对这个 tibble 进行分组并总结其数据以将此 tibble 作为其输出:

country | campaign | ltv
==========================
1       | 1        | ltv_1
1       | 2        | ltv_2
2       | 3        | ltv_3

但是,我希望同时ltv_1 使用 ltv_2all of R_1throughR_6U_1throughU_6进行联合计算,并ltv_3使用R_7throughR_9U_7throughU_9进行计算。

我不能group_by“国家”和summarise,因为这摆脱了我想保留的“活动”列,但我也不能group_by同时使用“国家”和“活动”,因为那样我将无法使用前三行帮助计算ltv_2也不是后三行帮助计算ltv_1

一种可能的方法是按“国家”分组并使用该group_modify函数生成分组的输出小标题。但是,该功能处于“实验”阶段,因此我不想过分依赖它。有没有不同的、既定的方法来做到这一点?


一个示例输入小标题是:

# A tibble: 9 x 5
  activation_date country campaign revenue users
            <dbl>   <dbl>    <dbl>   <dbl> <dbl>
1               1       1        1       1    11
2               2       1        1       2    12
3               3       1        1       3    13
4               1       1        2       4    14
5               2       1        2       5    15
6               3       1        2       6    16
7               1       2        3       7    17
8               2       2        3       8    18
9               3       2        3       9    19

它的输出是:

# A tibble: 3 x 3
  country campaign   ltv
    <dbl>    <dbl> <dbl>
1       1        1 0.213
2       1        2 0.296
3       2        3 0.444

使用该group_modify函数生成它的代码是:

test_tibble = tribble (~ activation_date, ~ country, ~ campaign, ~ revenue, ~ users,
                       1, 1, 1, 1, 11,
                       2, 1, 1, 2, 12,
                       3, 1, 1, 3, 13,
                       1, 1, 2, 4, 14,
                       2, 1, 2, 5, 15,
                       3, 1, 2, 6, 16,
                       1, 2, 3, 7, 17,
                       2, 2, 3, 8, 18,
                       3, 2, 3, 9, 19)

test_function = function (activation_date, campaign, revenue, users) {
    total_ltv = sum (revenue) / sum (users)
    campaign_ltv = double (0)
    campaign_names = unique (campaign)
    for (c in campaign_names) {
        campaign_ltv = c (campaign_ltv, sum (revenue [campaign == c]) / sum (users [campaign == c]))
    }

    return (tibble (campaign = campaign_names,
                    ltv = campaign_ltv / 2 + total_ltv / 2))
}


test_tibble %>%
    group_by (country) %>%
    group_modify (~ test_function (.x$activation_date, .x$campaign, .x$revenue, .x$users)) %>%
    ungroup

标签: rdplyrtibble

解决方案


选项1 -

有点冗长但透明的方法是使用joins. 但是,考虑到其中的代码,它并不是那么冗长test_function。-

test_tibble %>% 
  group_by(country, campaign) %>% 
  summarize(campaign_ltv = sum(revenue)/sum(users)) %>% 
  inner_join(
    test_tibble %>% 
      group_by(country) %>% 
      summarise(total_ltv = sum(revenue)/sum(users)),
    by = "country"
  ) %>% 
  mutate(ltv = (total_ltv + campaign_ltv)/2) %>% 
  ungroup()

# A tibble: 3 x 5
  country campaign campaign_ltv total_ltv   ltv
    <dbl>    <dbl>        <dbl>     <dbl> <dbl>
1       1        1        0.167     0.259 0.213
2       1        2        0.333     0.259 0.296
3       2        3        0.444     0.444 0.444

选项 2) -

test_function将输出包装在 alist中以作为嵌套 tibble 并使用unnest.

test_tibble %>%
  group_by (country) %>%
  mutate(
    ltv = list(test_function(activation_date, campaign, revenue, users))
  ) %>%
  select(country, ltv) %>% 
  filter(row_number() == 1) %>% 
  unnest() %>% 
  ungroup()

# A tibble: 3 x 3
  country campaign   ltv
    <dbl>    <dbl> <dbl>
1       1        1 0.213
2       1        2 0.296
3       2        3 0.444

选项 3) -

df %>% 
  group_by(country) %>% 
  tidyr::complete(nesting(country, campaign), nesting(revenue, users)) %>% 
  group_by(campaign, add = TRUE)
  # now you have all revenue and users for each country-campaign
  # for total_ltv: use revenue and users as is
  # for campaign_ltv: use revenue and users where activation_date is not NA

# A tibble: 15 x 5
# Groups:   country, campaign [3]
   country campaign revenue users activation_date
     <int>    <int> <chr>   <chr>           <int>
 1       1        1 R_1     U_1                 1
 2       1        1 R_2     U_2                 2
 3       1        1 R_3     U_3                 3
 4       1        1 R_4     U_4                NA
 5       1        1 R_5     U_5                NA
 6       1        1 R_6     U_6                NA
 7       1        2 R_1     U_1                NA
 8       1        2 R_2     U_2                NA
 9       1        2 R_3     U_3                NA
10       1        2 R_4     U_4                 1
11       1        2 R_5     U_5                 2
12       1        2 R_6     U_6                 3
13       2        3 R_7     U_7                 1
14       2        3 R_8     U_8                 2
15       2        3 R_9     U_9                 3

演示test_tibble-

test_tibble %>% 
  group_by(country) %>% 
  tidyr::complete(nesting(country, campaign), nesting(revenue, users)) %>% 
  group_by(campaign, add = TRUE) %>% 
  summarise(
    ltv = sum(revenue)/sum(users)/2 + 
      sum(revenue[!is.na(activation_date)])/sum(users[!is.na(activation_date)])/2
  ) %>% 
  ungroup()

# A tibble: 3 x 3
  country campaign   ltv
    <dbl>    <dbl> <dbl>
1       1        1 0.213
2       1        2 0.296
3       2        3 0.444

推荐阅读