r - data.table 分组列在“J”中的长度为 1
问题描述
在我学习data.table
的过程中,我发现了一种我无法优雅地解决的情况。
预先:公式的荒谬性是显而易见的,我正在尝试确定是否可以使用生态系统中lm
的关键字或特殊运算符轻松解决这种细微差别。data.table
library(data.table)
mt <- as.data.table(mtcars)
mt[, list(model = list(lm(mpg ~ disp))), by = "cyl"]
# cyl model
# 1: 6 <lm>
# 2: 4 <lm>
# 3: 8 <lm>
mt[, list(model = list(lm(mpg ~ disp + cyl))), by = "cyl"]
# Error in model.frame.default(formula = mpg ~ disp + cyl, drop.unused.levels = TRUE) :
# variable lengths differ (found for 'cyl')
这是因为在块内部,cyl
是一个长度为 1 的向量,而不是像其余值一样的列:
mt[, list(model = { browser(); list(lm(mpg ~ cyl+disp)); }), by = "cyl"]
# Called from: `[.data.table`(mt, , list(model = {
# browser()
# list(lm(mpg ~ cyl + disp))
# ...
# Browse[1]>
# debug at #1: list(lm(mpg ~ cyl + disp))
# Browse[2]>
disp
# [1] 160.0 160.0 258.0 225.0 167.6 167.6 145.0
# Browse[2]>
cyl
# [1] 6
最直接的方法似乎是将其作为临时变量在内部手动延长,或者在需要的地方手动延长:
mt[, list(model = { cyl2 <- rep(cyl, nrow(.SD)); list(lm(mpg ~ cyl2+disp)); }), by = "cyl"]
mt[, list(model = list(lm(mpg ~ rep(cyl, nrow(.SD))+disp))), by = "cyl"]
问:有没有更优雅的方法来处理这个问题?
各种松散相关的问题,激发了我的好奇心(朝着在 DT 对象中嵌入“东西”):
到目前为止,候选人很多:
mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
解决方案
感谢所有的候选人。
mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
性能(使用这个小模型)似乎有一些小的差异:
library(microbenchmark)
microbenchmark(
c1 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)],
c2 = mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
c3 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)],
c4 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
c5 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# c1 3.7328 4.21745 4.584591 4.43485 4.57465 9.8924 100
# c2 2.6740 3.11295 3.244856 3.21655 3.28975 5.6725 100
# c3 2.8219 3.30150 3.618646 3.46560 3.81250 6.8010 100
# c4 2.9084 3.27070 3.620761 3.44120 3.86935 6.3447 100
# c5 5.6156 6.37405 6.832622 6.54625 7.03130 13.8931 100
有更大的数据
mtbigger <- rbindlist(replicate(1000, mtcars, simplify=FALSE))
microbenchmark(
c1 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = mtbigger[.I]))), by = .(cyl)],
c2 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
c3 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mtbigger)],
c4 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
c5 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# c1 27.1635 30.54040 33.98210 32.2859 34.71505 76.5064 100
# c2 23.9612 25.83105 28.97927 27.5059 30.02720 67.9793 100
# c3 25.7880 28.27205 31.38212 30.2445 32.79030 105.4742 100
# c4 25.6469 27.84185 30.52403 29.8286 32.60805 37.8675 100
# c5 29.2477 32.32465 35.67090 35.0291 37.90410 68.5017 100
(我猜类似的相对表现规模。更好的裁决可能包括更广泛的数据。)
仅通过中值运行时间,看起来顶部(以很小的幅度)是:
mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
推荐阅读
- neural-network - 如何为相似图像的 kmeans 聚类获得准确的编码器输出
- python - 在python-numpy中按条件索引?
- javascript - JavaScript 保留关键字未正确通过函数传递
- python - getter setter作为python类中的函数给出“找不到属性”错误
- jenkins - Jenkins sphinx-build 未找到
- python - 检查输入时出错:预期的 dense_1_input 有 5 个维度,但得到了形状为 (1746、131072) 的数组
- laravel - 对模型中的所有列执行 where() (Laravel)
- hive - 检查时间是否在一组开始和结束时间之内
- google-apps-script - 将数据推送到索引中的匹配行
- mysql - 具有的子查询列的Mysql计数