首页 > 解决方案 > 在 data.table 中创建新列时如何引用整行?

问题描述

我有data.table超过 200 个变量,这些变量都是二进制的。我想在其中创建一个新列来计算每行与参考向量之间的差异:

#Example
dt = data.table(
"V1" = c(1,1,0,1,0,0,0,1,0,1,0,1,1,0,1,0),
"V2" = c(0,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0),
"V3" = c(0,0,0,1,1,1,1,0,1,0,1,0,1,0,1,0),
"V4" = c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0),
"V5" = c(1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0)  
)

reference = c(1,1,0,1,0)

我可以用一个小的 for 循环来做到这一点,例如

distance = NULL
for(i in 1:nrow(dt)){      
  distance[i] = sum(reference != dt[i,])  
}

但这有点慢,而且肯定不是最好的方法。我试过了:

dt[,"distance":= sum(reference != c(V1,V2,V3,V4,V5))]
dt[,"distance":= sum(reference != .SD)]

但两者都不起作用,因为它们为所有行返回相同的值。此外,我不必键入所有变量名的解决方案会好得多,因为真正的 data.table 有超过 200 列

标签: rdata.table

解决方案


您可以使用sweep()with rowSums,即

rowSums(sweep(dt, 2, reference) != 0)
 #[1] 2 2 2 2 4 4 3 2 4 3 2 1 3 4 1 3

基准

HUGH <- function(dt) {
    dt[, I := .I] 
    distance_by_I <- melt(dt, id.vars = "I")[, .(distance = sum(reference != value)), keyby = "I"]
    return(dt[distance_by_I, on = "I"])
}

Sotos <- function(dt) {
    return(rowSums(sweep(dt, 2, reference) != 0))
}

dt1 <- as.data.table(replicate(5, sample(c(0, 1), 100000, replace = TRUE)))
microbenchmark(HUGH(dt1), Sotos(dt1))

#Unit: milliseconds
#       expr       min        lq      mean   median        uq       max neval cld
#  HUGH(dt1) 112.71936 117.03380 124.05758 121.6537 128.09904 155.68470   100   b
# Sotos(dt1)  23.66799  31.11618  33.84753  32.8598  34.02818  68.75044   100  a 

推荐阅读