首页 > 解决方案 > 使用data.table在R中按时间段计算行

问题描述

library(data.table)

dt <- fread(" ID  DATE    
              A1 20170220
              A1 20170308
              A1 20170311
              A1 20170410
              A1 20170411
              A1 20170413
              A1 20170415
              A1 20170416
              A1 20170420
              A1 20170430
              A2 20170120
              A2 20170121
              A2 20170123
              A2 20170125
              A2 20170202 ")

并试图像这样计算 N :

 ID  DATE     count30day(count rows until after 30day)
 A1 20170220      3 (count row until 20170322)
 A1 20170308      2 (count row until 20170407)
 A1 20170311      2 (count row until 20170410)
 A1 20170410      7 (count row until 20170510)
 A1 20170411      6 (count row until 20170511)
 A1 20170413      5 (count row until 20170513)
 A1 20170415      4 (count row until 20170514)
 A1 20170416      3 (count row until 20170516)
 A1 20170420      2 (count row until 20170520)
 A1 20170430      1 (count row until 20170530)
 A2 20170120      5 (count row until 20170220)
 A2 20170121      4 (count row until 20170220)
 A2 20170123      3 (count row until 20170220) 
 A2 20170125      2 (count row until 20170220)
 A2 20170202      1 (count row until 20170220)      

我试过这个

dt[,N:=sapply(DATE, function(x) nrow(dt[x<=DATE&DATE < (x + months(1))]))]

这是工作,但最后 5 个值是错误的。它必须是 54321,但结果是 55432。

我处理的实际数据大约是 2500000 行,所以需要很长时间

无论如何,这可以减少时间并解决最后价值问题吗?

标签: rdata.table

解决方案


使用非 equi 自连接:

dt[, N := 
    dt[.(ID=ID, stt=DATE, end=DATE+30), on=.(ID, DATE>=stt, DATE<=end), .N, by=.EACHI]$N
]

输出:

    ID       DATE N
 1: A1 2017-02-20 3
 2: A1 2017-03-08 2
 3: A1 2017-03-11 2
 4: A1 2017-04-10 7
 5: A1 2017-04-11 6
 6: A1 2017-04-13 5
 7: A1 2017-04-15 4
 8: A1 2017-04-16 3
 9: A1 2017-04-20 2
10: A1 2017-04-30 1
11: A2 2017-01-20 5
12: A2 2017-01-21 4
13: A2 2017-01-23 3
14: A2 2017-01-25 2
15: A2 2017-02-02 1

数据:

library(data.table)    
dt <- fread(" ID  DATE    
              A1 20170220
              A1 20170308
              A1 20170311
              A1 20170410
              A1 20170411
              A1 20170413
              A1 20170415
              A1 20170416
              A1 20170420
              A1 20170430
              A2 20170120
              A2 20170121
              A2 20170123
              A2 20170125
              A2 20170202 ")
dt[, DATE := as.Date(as.character(DATE), "%Y%m%d")]

推荐阅读