r - 在R中按几个条件组合数据
问题描述
我想根据两个条件合并两个数据:
- 按相同ID(仅保留第一个数据中的ID)
- 如果 date_mid(来自 dat2)在 date_begin 和 date_end(都来自 dat1)之间,则粘贴结果(来自 dat2),如果不是,记为“NA”
此外,如果组合数据中的 ID 已经有结果(健康或生病),我想删除这些行。在下面的示例中,我想删除第 3 行和第 12 行。
第一个数据(dat1):
dat1 <- tibble(ID = c(paste0(rep("A"), 1:10), "A2", "A10"),
date_begin = seq(as.Date("2020/1/1"), by = "month", length.out = 12),
date_end = date_begin + 365)
dat1
# A tibble: 12 x 3
ID date_begin date_end
<chr> <date> <date>
1 A1 2020-01-01 2020-12-31
2 A2 2020-02-01 2021-01-31
3 A3 2020-03-01 2021-03-01
4 A4 2020-04-01 2021-04-01
5 A5 2020-05-01 2021-05-01
6 A6 2020-06-01 2021-06-01
7 A7 2020-07-01 2021-07-01
8 A8 2020-08-01 2021-08-01
9 A9 2020-09-01 2021-09-01
10 A10 2020-10-01 2021-10-01
11 A2 2020-11-01 2021-11-01
12 A10 2020-12-01 2021-12-01
第二个数据(dat2):
dat2 <- tibble(ID = c(paste0(rep("A"), 1:4), paste0(rep("A"), 9:15), "A2"),
date_mid = seq(as.Date("2020/1/1"), by = "month", length.out = 12) + 100,
result = rep(c("healthy", "sick"), length = 12))
dat2
# A tibble: 12 x 3
ID date_mid result
<chr> <date> <chr>
1 A1 2020-04-10 healthy
2 A2 2020-05-11 sick
3 A3 2020-06-09 healthy
4 A4 2020-07-10 sick
5 A9 2020-08-09 healthy
6 A10 2020-09-09 sick
7 A11 2020-10-09 healthy
8 A12 2020-11-09 sick
9 A13 2020-12-10 healthy
10 A14 2021-01-09 sick
11 A15 2021-02-09 healthy
12 A2 2021-03-11 sick
我试过left_join
如下:
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA))
# A tibble: 14 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A2 2020-02-01 2021-01-31 2021-03-11 NA
4 A3 2020-03-01 2021-03-01 2020-06-09 healthy
5 A4 2020-04-01 2021-04-01 2020-07-10 sick
6 A5 2020-05-01 2021-05-01 NA NA
7 A6 2020-06-01 2021-06-01 NA NA
8 A7 2020-07-01 2021-07-01 NA NA
9 A8 2020-08-01 2021-08-01 NA NA
10 A9 2020-09-01 2021-09-01 2020-08-09 NA
11 A10 2020-10-01 2021-10-01 2020-09-09 NA
12 A2 2020-11-01 2021-11-01 2020-05-11 NA
13 A2 2020-11-01 2021-11-01 2021-03-11 sick
14 A10 2020-12-01 2021-12-01 2020-09-09 NA
正如我所提到的,我想删除 ID A2 的第 3 行和第 12 行,因为 A2 在第 2 行和第 13 行中已经有健康或生病的结果。
我想要的确切结果是这样的(只有 2 行 A2):
# A tibble: 12 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
4 A4 2020-04-01 2021-04-01 2020-07-10 sick
5 A5 2020-05-01 2021-05-01 NA NA
6 A6 2020-06-01 2021-06-01 NA NA
7 A7 2020-07-01 2021-07-01 NA NA
8 A8 2020-08-01 2021-08-01 NA NA
9 A9 2020-09-01 2021-09-01 2020-08-09 NA
10 A10 2020-10-01 2021-10-01 2020-09-09 NA
11 A2 2020-11-01 2021-11-01 2021-03-11 sick
12 A10 2020-12-01 2021-12-01 2020-09-09 NA
任何指针表示赞赏,谢谢。
解决方案
如果加入后的结果中有不止一行,ID
则只保留非 NA 行。这可以写成dplyr
——
library(dplyr)
library(lubridate)
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA)) %>%
group_by(ID, date_begin, date_end) %>%
filter(if(n() > 1) !is.na(result) else TRUE) %>%
ungroup
# ID date_begin date_end date_mid result
# <chr> <date> <date> <date> <chr>
# 1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
# 2 A2 2020-02-01 2021-01-31 2020-05-11 sick
# 3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
# 4 A4 2020-04-01 2021-04-01 2020-07-10 sick
# 5 A5 2020-05-01 2021-05-01 NA NA
# 6 A6 2020-06-01 2021-06-01 NA NA
# 7 A7 2020-07-01 2021-07-01 NA NA
# 8 A8 2020-08-01 2021-08-01 NA NA
# 9 A9 2020-09-01 2021-09-01 2020-08-09 NA
#10 A10 2020-10-01 2021-10-01 2020-09-09 NA
#11 A2 2020-11-01 2021-11-01 2021-03-11 sick
#12 A10 2020-12-01 2021-12-01 2020-09-09 NA
推荐阅读
- javascript - javascript中的http请求处理
- excel - 通过 Matlab GUI 在 Excel 中提取特定变量
- git - 在合并到 master 时,我从本地分支中丢失了一个完整的目录。如何找回那些丢失的文件?
- reference - 可变引用生命周期与不可变引用生命周期
- c# - 调用一个在后台运行的方法,仍然让用户在表单中工作
- php - Wordpress 产品数据刮板/API
- bash - Bash fastCGI 在没有给出参数时显示很多信息
- vb.net - VB ASP.NET Web 应用程序属性创建为类库?
- svn - SVN 合并冲突标记没有共同祖先(“diff3”)
- c# - 代码运行时未设置变量值。在代码如下