首页 > 解决方案 > R:提取满足多个条件的 ID 数量

问题描述

我想在新开发的疾病的数据集中识别这些 ID。该数据集采用日记的形式,人们每天在日记中回答关于他们是否患有这种疾病的“是/否”问题。

ID <- c(1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
Date <- c("2020-03-10","2020-03-11","2020-03-12","2020-03-13","2020-03-14","2020-03-12","2020-03-13","2020-03-14","2020-03-15","2020-03-16","2020-03-17","2020-03-18", "2020-03-12","2020-03-13","2020-03-14","2020-03-15","2020-03-16","2020-03-17","2020-03-18","2020-03-19","2020-03-20")
Disease <- c("No","No","Yes","Yes","Yes","No","No","No", "Yes","Yes","Yes","No","Yes","Yes","No","No","No","Yes","Yes","Yes","Yes")

df <- data.frame(ID, Date, Disease)

df
ID   Date         Disease
1    2020-03-10   No
1    2020-03-11   No
1    2020-03-12   Yes
1    2020-03-13   Yes
1    2020-03-14   Yes
2    2020-03-12   No
2    2020-03-13   No
2    2020-03-14   No
2    2020-03-15   Yes
2    2020-03-16   Yes
2    2020-03-17   Yes
2    2020-03-18   No
3    2020-03-12   Yes
3    2020-03-13   Yes
3    2020-03-14   No
3    2020-03-15   No
3    2020-03-16   No
3    2020-03-17   Yes
3    2020-03-18   Yes
3    2020-03-19   Yes
3    2020-03-20   Yes

但是,要被定性为“新发疾病”,该人必须满足以下条件: 1. 该人必须至少连续两天“是” 2. 该人必须回答“否” ” 在第一个“是”之前至少连续 3 天。

作为输出,我希望有多少人满足这些条件。所以在上面数据集的提取中,这将是两个(ID 2+3)。

有谁知道如何实现这一目标?在此先感谢您的时间!

标签: r

解决方案


这样做的一个稍微凌乱的方法是使用该dplyr::lag()函数。

 library(tidyverse)
 library(lubridate)
 df %>% 
    mutate(Date = ymd(Date)) %>%
    group_by(ID) %>% 
    mutate(day_1 = lag(Disease, 1, order_by = Date), 
           day_2 = lag(Disease, 2, order_by = Date), 
           day_3 = lag(Disease, 3, order_by = Date), 
           day_4 = lag(Disease, 4, order_by = Date)) %>% 
    filter(day_1 == "No" & day_2 == "No" & day_3 == "No" & day_4 == "Yes" &        Disease == "Yes")
    distinct(ID) %>% 
    summarise("Number of patients matching the condition" = n())

这会按 ID 对行进行分组,因此所有计算都是针对每个人单独计算的。然后它会在前一天、前一天等列中获取最近 4 天的疾病值。然后,检查数据集中的每一行是否符合条件。然后获取唯一的 ID 并计算它们。


推荐阅读