首页 > 解决方案 > R group by 然后过滤以在每一行上添加值

问题描述

我有一个data.table包含 100000 行,我想为每一行创建一个列来计算这匹马ind_win=1在两年内赢得比赛的次数( rdate<rdate&rdate>=rdate-years(2)horse_win_count。我知道我可以使用 apply 过滤每一行,然后对整个数据进行子集化以计算值。但是我怎样才能快速做到这一点呢?

输入

在此处输入图像描述

structure(list(index = c(2501L, 3415L, 19740L, 20566L, 22604L, 
24622L, 66025L, 67207L, 87018L), rdate = structure(c(13845, 13873, 
14531, 14559, 14622, 14685, 16200, 16236, 16974), class = "Date"), 
    horsenum = c("E268", "E268", "E268", "E268", "E268", "E268", 
    "P178", "P178", "P178"), ind_win = c(0L, 1L, 0L, 1L, 0L, 
    1L, 0L, 1L, 0L)), row.names = c(NA, -9L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000029b06fb1ef0>)

输出

在此处输入图像描述

标签: rgroup-bydata.table

解决方案


您可以使用非 equi 连接:

setDT(DT)
DT[, twoyago := as.IDate(sapply(rdate, function(d) seq(d, by="-2 years", length.out=2L)[2L]))]
DT[, horse_win_count := 
        DT[DT, on=.(horsenum, rdate<rdate, rdate>=twoyago), 
            i.ind_win + sum(x.ind_win, na.rm=TRUE), by=.EACHI]$V1
    ]

输出:

   index      rdate horsenum ind_win    twoyago horse_win_count
1:  2501 2007-11-28     E268       0 2005-11-28               0
2:  3415 2007-12-26     E268       1 2005-12-26               1
3: 19740 2009-10-14     E268       0 2007-10-14               1
4: 20566 2009-11-11     E268       1 2007-11-11               2
5: 22604 2010-01-13     E268       0 2008-01-13               1
6: 24622 2010-03-17     E268       1 2008-03-17               2
7: 66025 2014-05-10     P178       0 2012-05-10               0
8: 67207 2014-06-15     P178       1 2012-06-15               1
9: 87018 2016-06-22     P178       0 2014-06-22               0

数据:

library(data.table)
DT <- structure(list(index = c(2501L, 3415L, 19740L, 20566L, 22604L, 
    24622L, 66025L, 67207L, 87018L), rdate = structure(c(13845, 13873, 
        14531, 14559, 14622, 14685, 16200, 16236, 16974), class = "Date"), 
    horsenum = c("E268", "E268", "E268", "E268", "E268", "E268", 
        "P178", "P178", "P178"), ind_win = c(0L, 1L, 0L, 1L, 0L, 
            1L, 0L, 1L, 0L)), row.names = c(NA, -9L), class = c("data.table", 
                "data.frame"))

推荐阅读