首页 > 解决方案 > 在R中按几个条件组合数据

问题描述

我想根据两个条件合并两个数据:

  1. 按相同ID(仅保留第一个数据中的ID)
  2. 如果 date_mid(来自 dat2)在 date_begin 和 date_end(都来自 dat1)之间,则粘贴结果(来自 dat2),如果不是,记为“NA”

此外,如果组合数据中的 ID 已经有结果(健康或生病),我想删除这些行。在下面的示例中,我想删除第 3 行和第 12 行。

第一个数据(dat1):

dat1 <- tibble(ID = c(paste0(rep("A"), 1:10), "A2", "A10"), 
           date_begin = seq(as.Date("2020/1/1"), by = "month", length.out = 12), 
           date_end = date_begin + 365)
dat1

# A tibble: 12 x 3
   ID    date_begin date_end  
   <chr> <date>     <date>    
 1 A1    2020-01-01 2020-12-31
 2 A2    2020-02-01 2021-01-31
 3 A3    2020-03-01 2021-03-01
 4 A4    2020-04-01 2021-04-01
 5 A5    2020-05-01 2021-05-01
 6 A6    2020-06-01 2021-06-01
 7 A7    2020-07-01 2021-07-01
 8 A8    2020-08-01 2021-08-01
 9 A9    2020-09-01 2021-09-01
10 A10   2020-10-01 2021-10-01
11 A2    2020-11-01 2021-11-01
12 A10   2020-12-01 2021-12-01

第二个数据(dat2):

dat2 <- tibble(ID = c(paste0(rep("A"), 1:4), paste0(rep("A"), 9:15), "A2"), 
               date_mid = seq(as.Date("2020/1/1"), by = "month", length.out = 12) + 100, 
               result = rep(c("healthy", "sick"), length = 12))
dat2

# A tibble: 12 x 3
   ID    date_mid   result 
   <chr> <date>     <chr>  
 1 A1    2020-04-10 healthy
 2 A2    2020-05-11 sick   
 3 A3    2020-06-09 healthy
 4 A4    2020-07-10 sick   
 5 A9    2020-08-09 healthy
 6 A10   2020-09-09 sick   
 7 A11   2020-10-09 healthy
 8 A12   2020-11-09 sick   
 9 A13   2020-12-10 healthy
10 A14   2021-01-09 sick   
11 A15   2021-02-09 healthy
12 A2    2021-03-11 sick 

我试过left_join如下:

left_join(dat1, dat2, by = "ID") %>% 
  mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA))

# A tibble: 14 x 5
   ID    date_begin date_end   date_mid   result 
   <chr> <date>     <date>     <date>     <chr>  
 1 A1    2020-01-01 2020-12-31 2020-04-10 healthy
 2 A2    2020-02-01 2021-01-31 2020-05-11 sick   
 3 A2    2020-02-01 2021-01-31 2021-03-11 NA     
 4 A3    2020-03-01 2021-03-01 2020-06-09 healthy
 5 A4    2020-04-01 2021-04-01 2020-07-10 sick   
 6 A5    2020-05-01 2021-05-01 NA         NA     
 7 A6    2020-06-01 2021-06-01 NA         NA     
 8 A7    2020-07-01 2021-07-01 NA         NA     
 9 A8    2020-08-01 2021-08-01 NA         NA     
10 A9    2020-09-01 2021-09-01 2020-08-09 NA     
11 A10   2020-10-01 2021-10-01 2020-09-09 NA     
12 A2    2020-11-01 2021-11-01 2020-05-11 NA     
13 A2    2020-11-01 2021-11-01 2021-03-11 sick   
14 A10   2020-12-01 2021-12-01 2020-09-09 NA  

正如我所提到的,我想删除 ID A2 的第 3 行和第 12 行,因为 A2 在第 2 行和第 13 行中已经有健康或生病的结果。

我想要的确切结果是这样的(只有 2 行 A2):

# A tibble: 12 x 5
   ID    date_begin date_end   date_mid   result 
   <chr> <date>     <date>     <date>     <chr>  
 1 A1    2020-01-01 2020-12-31 2020-04-10 healthy
 2 A2    2020-02-01 2021-01-31 2020-05-11 sick   
 3 A3    2020-03-01 2021-03-01 2020-06-09 healthy
 4 A4    2020-04-01 2021-04-01 2020-07-10 sick   
 5 A5    2020-05-01 2021-05-01 NA         NA     
 6 A6    2020-06-01 2021-06-01 NA         NA     
 7 A7    2020-07-01 2021-07-01 NA         NA     
 8 A8    2020-08-01 2021-08-01 NA         NA     
 9 A9    2020-09-01 2021-09-01 2020-08-09 NA     
10 A10   2020-10-01 2021-10-01 2020-09-09 NA     
11 A2    2020-11-01 2021-11-01 2021-03-11 sick   
12 A10   2020-12-01 2021-12-01 2020-09-09 NA  

任何指针表示赞赏,谢谢。

标签: rmergeleft-join

解决方案


如果加入后的结果中有不止一行,ID则只保留非 NA 行。这可以写成dplyr——

library(dplyr)
library(lubridate)

left_join(dat1, dat2, by = "ID") %>% 
  mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA)) %>%
  group_by(ID, date_begin, date_end)  %>%
  filter(if(n() > 1) !is.na(result) else TRUE) %>%
  ungroup


#   ID    date_begin date_end   date_mid   result 
#   <chr> <date>     <date>     <date>     <chr>  
# 1 A1    2020-01-01 2020-12-31 2020-04-10 healthy
# 2 A2    2020-02-01 2021-01-31 2020-05-11 sick   
# 3 A3    2020-03-01 2021-03-01 2020-06-09 healthy
# 4 A4    2020-04-01 2021-04-01 2020-07-10 sick   
# 5 A5    2020-05-01 2021-05-01 NA         NA     
# 6 A6    2020-06-01 2021-06-01 NA         NA     
# 7 A7    2020-07-01 2021-07-01 NA         NA     
# 8 A8    2020-08-01 2021-08-01 NA         NA     
# 9 A9    2020-09-01 2021-09-01 2020-08-09 NA     
#10 A10   2020-10-01 2021-10-01 2020-09-09 NA     
#11 A2    2020-11-01 2021-11-01 2021-03-11 sick   
#12 A10   2020-12-01 2021-12-01 2020-09-09 NA     

推荐阅读