首页 > 解决方案 > case_when 和 lag 中的 dplyr 行为

问题描述

我有一个数据集,其中包含 studyid、年份和两个标志:事件和流行。我希望在事件标志为真后的所有年份中流行变量都为真(1)(并且事件变量只能为真一次)。case_when 和 lag 似乎是完美的组合,但如果在第 N 年将事件设置为 1,则仅在 N+1 将流行设置为 TRUE,并在 N+1 反转为 0。这不是预期的行为。

这是示例代码:

library(tidyverse)

# make a fake dataset
testdat <- tribble(
  ~studyid, ~datestring, ~incident,
  "1", "2000-01-01", 0,
  "1", "2001-01-01", 1,
  "1", "2002-01-01", 0,
  "1", "2003-01-01", 0,
  "2", "2003-01-01", 0,
  "2", "2004-01-01", 1,
  "2", "2005-01-01", 0,
  "2", "2006-01-01", 0
) %>% mutate(
  prevalent = 0,
  date = lubridate::ymd(datestring)
) %>% group_by(studyid) %>% 
  arrange(studyid, date) %>% 
  mutate(prevalent = case_when(
    #logic is, if prevalent in year N-1, the prevalent in year N
    # if incident in year N-1, then prevalent in year N
    # otherwise not prevalent (because never incident)
    dplyr::lag(prevalent, 1L)==1 ~1,
    dplyr::lag(incident, 1L)==1 ~1,
    TRUE ~ 0
  ) #close case_when
  ) #close mutate
testdat

输出是:

# A tibble: 8 x 5
# Groups:   studyid [2]
  studyid datestring incident prevalent date      
  <chr>   <chr>         <dbl>     <dbl> <date>    
1 1       2000-01-01        0         0 2000-01-01
2 1       2001-01-01        1         0 2001-01-01
3 1       2002-01-01        0         1 2002-01-01
4 1       2003-01-01        0         0 2003-01-01
5 2       2003-01-01        0         0 2003-01-01
6 2       2004-01-01        1         0 2004-01-01
7 2       2005-01-01        0         1 2005-01-01
8 2       2006-01-01        0         0 2006-01-01
> 

期望的输出是:

studyid=1, year=2003  prevalent ==1 (not 0)
studyid=2, year=2006  prevalent ==1 (not 0)

我怀疑这与 case_when 如何与 dplyr::lag 交互有关。如何改进逻辑/语法以获得所需的结果?

非常感谢,

标签: rdplyrlag

解决方案


您正在寻找类似于最后一个观察结果的东西,例如zoo::na.locfor tidyr::fill,但我将使用一些简单的东西,例如:

library(dplyr)
testdat %>% 
   mutate(date = lubridate::ymd(datestring)) %>% group_by(studyid) %>% 
   arrange(studyid, date) %>% mutate(prevalent=cumsum(lag(incident,default = 0)==1))

# A tibble: 8 x 5
# Groups:   studyid [2]
  studyid datestring incident date       prevalent
  <chr>   <chr>         <dbl> <date>         <int>
1 1       2000-01-01        0 2000-01-01         0
2 1       2001-01-01        1 2001-01-01         0
3 1       2002-01-01        0 2002-01-01         1
4 1       2003-01-01        0 2003-01-01         1
5 2       2003-01-01        0 2003-01-01         0
6 2       2004-01-01        1 2004-01-01         0
7 2       2005-01-01        0 2005-01-01         1
8 2       2006-01-01        0 2006-01-01         1

推荐阅读