首页 > 解决方案 > Subset the first three years of each group of a data.table

问题描述

I have a data.table, dt, that spans multiple years. The data is grouped so each group has different length of years. I only want to keep the first three years of each group. How can I do this with data.table? Here is some data to test:

dates <- c(seq(as.Date('2010-01-03'),as.Date('2019-12-31'),by = 1),
           seq(as.Date('2013-01-02'),as.Date('2018-12-31'),by = 1),
           seq(as.Date('2015-01-02'),as.Date('2020-07-31'),by = 1))

set.seed(1995)
value <- rnorm(length(dates), mean = 100, sd = 50)

IDs <- c(rep(c("ACG"),length.out = length(seq(as.Date('2010-01-03'),as.Date('2019-12-31'),by = 1))),
         rep(c("MKD"),length.out = length(seq(as.Date('2013-01-02'),as.Date('2018-12-31'),by = 1))),
         rep(c("ZED"),length.out = length(seq(as.Date('2015-01-02'),as.Date('2020-07-31'),by = 1)))
)

dt <- data.table(Date = dates,
                 Value = value,
                 ID = IDs
                )
dt

            Date     Value  ID
   1: 2010-01-03 153.03816 ACG
   2: 2010-01-04  83.22491 ACG
   3: 2010-01-05 107.26521 ACG
   4: 2010-01-06 119.70395 ACG
   5: 2010-01-07 183.24604 ACG
  ---                         
7874: 2020-07-27 184.45801 ZED
7875: 2020-07-28  91.53373 ZED
7876: 2020-07-29  67.42443 ZED
7877: 2020-07-30 125.62496 ZED
7878: 2020-07-31  89.02373 ZED

The final data.table should only have the first three years of data like this:

finalDT <- dt[c(1:1094,3651:4744,5841:6935),]
finalDT

            Date     Value  ID
   1: 2010-01-03 153.03816 ACG
   2: 2010-01-04  83.22491 ACG
   3: 2010-01-05 107.26521 ACG
   4: 2010-01-06 119.70395 ACG
   5: 2010-01-07 183.24604 ACG
  ---                         
3279: 2017-12-27 102.10622 ZED
3280: 2017-12-28  94.97718 ZED
3281: 2017-12-29 131.47358 ZED
3282: 2017-12-30 112.83836 ZED
3283: 2017-12-31 184.54966 ZED

The approach I've use works fine on a smallish dataset, but I have over a hundred IDs, some with 20 years of data. I need a programmatic approach in data.table

标签: rdataframedatedata.tabledata-manipulation

解决方案


dt[, .SD[year(Date) %in% unique(year(Date))[1:3]], by = ID]

Or

dt[, .SD[year(Date) <= unique(year(Date))[3]], by = ID]

Make sure to order by date first.


推荐阅读