首页 > 解决方案 > 是否可以用 dplyr 计算条件 cumsum

问题描述

我正在尝试计算玩家在白天与夜间游戏中获得的历史命中次数。例如,给定一个玩家有 5 场比赛,从最旧的比赛到最近的比赛排序,第一行的 dn_hits 列(白天)将为零,因为这是第一场比赛。第二行的 dn_hits 将查看第二场比赛是白天比赛还是夜间比赛,然后在 hits 列上执行向后看的 cumsum() - 将白天或晚上发生的所有命中相加,视情况而定是。这将发生在组中的每一行。

我在下面有一个示例数据框和示例输出。我还包含了一些我认为我需要做的计算的伪代码

您可以在下面的输出中看到:

第 1 行:玩家 AJ 的第一行的 dn_hits 列是 0(没有之前的游戏或对 cumsum 的命中);

第 2 行:玩家 AJ 的第二行是 2(AJ 的第二场比赛是一天比赛,他的第一场比赛也是如此。因此,我们有条件地累计第一场比赛命中数 = 2,其中 dn = "day");

第 3 行:玩家 AJ 的第三行是 0(第三场比赛是夜间比赛,在第 3 场比赛之前,球员 AJ 只参加过 (2) 场白天比赛,因此 dn = "night" 的条件命中次数为 0)

这可以用 dplyr 来完成,还是 purrr 的工作。

 library(tidyverse)

df <- tibble(game=c(seq(1:5),seq(1,5)),name=c("AJ","AJ","AJ","AJ","AJ","CJ","CJ","CJ","CJ","CJ"),
hits = c(2,1,0,1,3,2,1,4,1,0), dn=c("Day","Day","Night","Night","Night","Night","Day","Night","Night","Day"))


output  <- tibble(game=c(seq(1:5),seq(1,5)),name=c("AJ","AJ","AJ","AJ","AJ","CJ","CJ","CJ","CJ","CJ"),
hits = c(2,1,0,1,3,2,1,4,1,0), dn=c("Day","Day","Night","Night","Night","Night","Day","Night","Night","Day"), dn_hits=c(0,2,0,0,1,0,0,2,6,1)) 

# Original tibble

df
#> # A tibble: 10 x 4
#>     game name   hits dn   
#>    <int> <chr> <dbl> <chr>
#>  1     1 AJ        2 Day  
#>  2     2 AJ        1 Day  
#>  3     3 AJ        0 Night
#>  4     4 AJ        1 Night
#>  5     5 AJ        3 Night
#>  6     1 CJ        2 Night
#>  7     2 CJ        1 Day  
#>  8     3 CJ        4 Night
#>  9     4 CJ        1 Night
#> 10     5 CJ        0 Day

# Desired Output

output
#> # A tibble: 10 x 5
#>     game name   hits dn    dn_hits
#>    <int> <chr> <dbl> <chr>   <dbl>
#>  1     1 AJ        2 Day         0
#>  2     2 AJ        1 Day         2
#>  3     3 AJ        0 Night       0
#>  4     4 AJ        1 Night       0
#>  5     5 AJ        3 Night       1
#>  6     1 CJ        2 Night       0
#>  7     2 CJ        1 Day         0
#>  8     3 CJ        4 Night       2
#>  9     4 CJ        1 Night       6
#> 10     5 CJ        0 Day         1


# This is what I think needs to happen but not sure how to implement it. 

#df %>% 
#group_by(name) %>%
#arrange(name, desc(game)) %>% 
#mutate(dn_hits = cumsum(dn = [dn on current row],hits, 0))

编辑:我还尝试了以下方法:

df %>% 
group_by(name) %>%
arrange(name, desc(game)) %>% 
mutate(dn_hits = map_int(dn, ~ cumsum(if_else(.x == dn, hits, 0))))

但我收到以下错误:

Error: Problem with `mutate()` input `dn_hits`. x `false` must be a double vector, not an integer vector. i Input `dn_hits` is `map_int(dn, ~cumsum(if_else(.x == dn, hits, 0L)))`. i The error occurred in group 1: name = "AJ".

标签: rdplyrtidyversepurrr

解决方案


您可以group_by通过dn制作“有条件的”累积总和:

df %>% 
  group_by(name, dn) %>% 
  mutate(dn_hits = cumsum(hits)-hits)

推荐阅读