首页 > 解决方案 > 我如何使用 R 中的 group_by 将所有值与下一个值进行比较

问题描述

我有一个包含日期的数据框,我想使用 group_by 进行比较,但我需要将它们与组内的所有其他日期进行比较,以生成没有间隙或重叠的路线,因此我只能获得最大结束日期和每个 ID 的最短开始日期,一个可重复的示例:

ID <- c(1,1,1,3,3,7,7,7,22,22,32,32,173,173,213,213,230,330,330,330,330,150579)
EndDate <- c("9999-12-31","2018-04-30","2015-07-31","9999-12-31","2008-07-26","9999-12-31","9999-12-31","2011-08-31","9999-12-31","2006-11-30","9999-12-31","2007-06-30","9999-12-31","2010-09-30","9999-12-31","2013-04-30","9999-12-31","9999-12-31","2016-12-31","2016-09-30","2015-08-31","9999-12-31")
BegDate <- c("2015-08-01","2017-10-23","1983-12-05","2015-11-12","2003-02-24","2017-04-01","2014-07-15","1991-11-04","2006-12-01","1979-08-01","2007-07-01","1979-08-01","2010-10-01","1987-04-01","1980-10-20","2008-05-01","1983-02-14","1982-01-01","2016-10-01","2015-09-01","2014-02-01","1982-09-01")

df_dates <- data.frame(ID,EndDate,BegDate)

ID       EndDate       BegDate
1        9999-12-31    2015-08-01
1        2018-04-30    2017-10-23
1        2015-07-31    1983-12-05
3        9999-12-31    2015-11-12
3        9999-12-31    2015-11-12
7        9999-12-31    2017-04-01
7        9999-12-31    2014-07-15
7        2011-08-31    1991-11-04
22       9999-12-31    2006-12-01
22       2006-11-30    1979-08-01
32       9999-12-31    2007-07-01
32       2007-06-30    1979-08-01
173      9999-12-31    2010-10-01
173      2010-09-30    1987-04-01
213      9999-12-31    1980-10-20
213      2013-04-30    2008-05-01
233      9999-12-31    2016-06-01
233      2016-05-31    1998-10-01
330      9999-12-31    1982-01-01
330      2016-12-31    2016-10-01
330      2016-09-30    2015-09-01
330      2015-08-31    2014-02-01
150579   9999-12-31    1982-09-01

我试过 dplyr,但不知道如何在组的所有元素之间进行比较。我使用了 for 循环,但数据框很大,速度是必须的。

v_result <- c()

for(i in unique(df_dates$ID)){
   df_temp <- df_dates[df_dates$ID == i,]

   df_temp$EndDate <- as.Date(df_temp$EndDate,"%Y%m%d")
   df_temp$BegDate <- as.Date(df_temp$BegDate,"%Y%m%d")

   v_row <- (1:nrow(df_temp))

   for (j in v_row){
      h = j + 1
      elm <- v_row[!v_row %in% j]
      findNext <- FALSE

      for(h in elm){
        if((df_temp$EndDate[j] >= df_temp$EndDate[h] AND 
            df_temp$BegDate[j] <= df_temp$BegDate[h]) |
            df_temp$BegDate[j] - days(1) == df_temp$EndDate[h]){
              findNext <- TRUE
         }
      }
   v_result <- c(v_result,findNext) 
   } 
}

正如你所看到的,很多 for 循环,我不太了解apply函数系列,另外,可能有超过 150k 的 ID,所以不是一个可行的选择。我的想法是将重叠的和代表间隙的标记为假并过滤它们,允许我取最大值和最小值

df_final <- df_final%>%
   group_by(ID)%>%
   mutate(
    Biggest = max(EndDate),
    Lowest = min(BegDate)
   )

生成这样的东西:

ID       EndDate       BegDate
1        9999-12-31    1983-12-05
3        9999-12-31    2015-11-12
7        9999-12-31    2014-07-15
22       9999-12-31    1979-08-01
32       9999-12-31    1979-08-01
173      9999-12-31    2017-07-01
213      9999-12-31    1980-10-20
233      9999-12-31    1998-10-01
330      9999-12-31    1982-01-01
150579   9999-12-31    1982-09-01

结束日期不能总是 9999-12-31,只要它是 ID 的最大日期,它对应于没有间隔且忽略重叠的期间。我已经为此苦苦挣扎了几天,无法取得任何进展。

有没有一种方法可以使用 dplyr 对大型数据帧有效?

标签: rdategroup-bydplyriteration

解决方案


最终输出背后的逻辑并不完全清楚。例如,让我们做一些事情data.table(对于更大的数据帧应该是有效的)和magrittr(为了更好的可读性):

library(data.table)
library(magrittr)

calc_cummax <- function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))

df_final <- setDT(df_dates) %>%
  .[, `:=` (BegDate = as.Date(as.character(BegDate), "%Y-%m-%d"),
            EndDate = as.Date(as.character(EndDate), "%Y-%m-%d"))] %>%
  .[order(ID, BegDate),] %>%
  .[, max_until_now := shift(calc_cummax(EndDate)), by = ID] %>%
  .[, lead_max := shift(max_until_now, type = "lead"), by = ID] %>%
  .[is.na(max_until_now), max_until_now := lead_max, by = ID] %>%
  .[(max_until_now + 1L) >= BegDate, gap_between := 0, by = ID] %>% 
  .[(max_until_now + 1L) < BegDate, gap_between := 1, by = ID] %>%
  .[is.na(gap_between), gap_between := 0] %>% 
  .[, ("fakeidx") := cumsum(gap_between), by = ID] %>%
  .[, .(BegDate = min(BegDate), EndDate = max(EndDate)), by = .(ID, fakeidx)] %>% 
  #.[, .SD[.N], by = ID] %>%
  .[, ("fakeidx") := NULL]

这里的输出是:

        ID    BegDate    EndDate
 1:      1 1983-12-05 9999-12-31
 2:      3 2003-02-24 2008-07-26
 3:      3 2015-11-12 9999-12-31
 4:      7 1991-11-04 2011-08-31
 5:      7 2014-07-15 9999-12-31
 6:     22 1979-08-01 9999-12-31
 7:     32 1979-08-01 9999-12-31
 8:    173 1987-04-01 9999-12-31
 9:    213 1980-10-20 9999-12-31
10:    230 1983-02-14 9999-12-31
11:    330 1982-01-01 9999-12-31
12: 150579 1982-09-01 9999-12-31

如果您看一下第 2 行和第 4 行,您会发现根据您的说法,它们不应该在那里。

但是,两者之间存在差距,因此我们不能只取最低的BegDate,我们需要到达这一步才能产生最终输出。

对于您的最终输出,假设可能是您想在差距发生之前摆脱任何东西(即只取每组的最后一条记录)。您可以通过简单地取消注释最后一行之前的行来做到这一点,即:

library(data.table)
library(magrittr)

calc_cummax <- function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))

df_final <- setDT(df_dates) %>%
  .[, `:=` (BegDate = as.Date(as.character(BegDate), "%Y-%m-%d"),
            EndDate = as.Date(as.character(EndDate), "%Y-%m-%d"))] %>%
  .[order(ID, BegDate),] %>%
  .[, max_until_now := shift(calc_cummax(EndDate)), by = ID] %>%
  .[, lead_max := shift(max_until_now, type = "lead"), by = ID] %>%
  .[is.na(max_until_now), max_until_now := lead_max, by = ID] %>%
  .[(max_until_now + 1L) >= BegDate, gap_between := 0, by = ID] %>% 
  .[(max_until_now + 1L) < BegDate, gap_between := 1, by = ID] %>%
  .[is.na(gap_between), gap_between := 0] %>% 
  .[, ("fakeidx") := cumsum(gap_between), by = ID] %>%
  .[, .(BegDate = min(BegDate), EndDate = max(EndDate)), by = .(ID, fakeidx)] %>% 
  .[, .SD[.N], by = ID] %>%
  .[, ("fakeidx") := NULL]

生产:

        ID    BegDate    EndDate
 1:      1 1983-12-05 9999-12-31
 2:      3 2015-11-12 9999-12-31
 3:      7 2014-07-15 9999-12-31
 4:     22 1979-08-01 9999-12-31
 5:     32 1979-08-01 9999-12-31
 6:    173 1987-04-01 9999-12-31
 7:    213 1980-10-20 9999-12-31
 8:    230 1983-02-14 9999-12-31
 9:    330 1982-01-01 9999-12-31
10: 150579 1982-09-01 9999-12-31

推荐阅读