首页 > 解决方案 > 根据日期查找一个数据框中的所有行,查看 R 中其他数据框中的日期范围

问题描述

我有 2 个数据框,如下所述:

df1 <- data.frame(ID=c(1,1,1,2,3,4,4), actual.date=c('10/01/1996','10/02/1996','5/01/2002','7/01/1999','9/01/2005','5/01/2006','2/03/2003'),
val=c(5,10,15,20,25,30,35))
dcis <- grep('date$',names(df1))
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y')
df1

df2 <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004'))
dcis <- grep('date$',names(df2))
df2[dcis] <- lapply(df2[dcis],as.Date,'%m/%d/%Y')
df2

要求 -> 我将从 的每一行开始,df2看看有多少行位于该行(按 ID 分组)df1中指定的(包括范围)日期范围内。df2

例如:对于 中的第 1 行df2,(第 1 和第 2)中有 2 行df1ID=1 并且位于 的第 1 行的日期范围内df2。在最终输出中,我将求和 5+10(来自“val”列)并放在df2.

标签: r

解决方案


我们可以使用非 equi 连接data.table

library(data.table)
val1 <- setDT(df1)[df2, sum(val), on = .(ID, actual.date >= before.date, 
           actual.date <= after.date), by = .EACHI]$V1
df2$val <- val1
df2$val
#[1] 15 15 15 NA 25 35 NA NA

或者另一种选择是fuzzy_join

library(fuzzyjoin)
library(dplyr)
fuzzy_right_join(df1, df2, by = c("ID",
    "actual.date" = "before.date", "actual.date"= "after.date"), 
      match_fun = list(`==`, `>=`, `<=`)) %>%
 group_by(ID = ID.y, before.date, after.date) %>% 
 summarise(val = sum(val, na.rm = TRUE))
# A tibble: 8 x 4
# Groups:   ID, before.date [8]
#     ID before.date after.date   val
#  <dbl> <date>      <date>     <dbl>
#1     1 1996-10-01  1996-12-01    15
#2     1 1998-01-01  2003-09-30    15
#3     1 2000-01-01  2004-12-31    15
#4     2 2001-01-01  2006-03-31     0
#5     3 2001-01-01  2006-09-30    25
#6     4 2001-01-01  2005-09-30    35
#7     4 2004-10-01  2004-12-30     0
#8     4 2004-10-03  2004-11-28     0

推荐阅读