r - 如何按 R data.table 中的不同列分组?
问题描述
我正在编写一个函数,该函数将按不同的分组列按 data.table 分组,并在同一 data.table 的新列中返回指定的汇总统计信息(对于每个分组)
我尝试了两种使用 for 循环的方法,两者都给了我预期的结果,但我想知道是否有一种有效的方法来做到这一点
这可能是另一种方法,但我还没有设法让它工作
a = sample(c("a1","a2","a3","a4","a5"), 5000000, replace = TRUE)
b = sample(c("b1","b2","b3","b4","b5","b6","b7"), 5000000, replace = TRUE)
c = sample(c("c1","c2","c3","c4","c5","c6"), 5000000, replace = TRUE)
d = sample(1:100000, 5000000, replace = TRUE)
DT = data.table(a = a, b = b,c= c, d = d)
#columns to group by
grp_by <- list(c("a","b"),c("a","c"),c("b","c"))
# required summary stats
FUNs <- c("mean")
#stats required on this column
measure_col <- "d"
#Method 1 - Using .SD
fn_agg_1 <- function(DT,grp_by,FUNs,measure_col) {
for(i in grp_by) {
for(j in FUNs) {
new_col_name <- paste0(paste0(unlist(i),collapse ="_"),"_",eval(j))
DT[,(new_col_name) := lapply(.SD,get(j)), by = i, .SDcols = measure_col]
}
}
}
#Method 2 - Using eval-parse
fn_agg_2 <- function(DT,grp_by,FUNs,measure_col) {
for(i in grp_by) {
for(j in FUNs) {
new_col_name <- paste0(paste0(unlist(i),collapse ="_"),"_",eval(j))
measure <- paste0(eval(j),"(",eval(measure_col),")")
DT[,(new_col_name) := eval(parse(text=measure)), by = i]
}
}
}
system.time(fn_agg_1(DT,grp_by,FUNs,measure_col))
DT = data.table(a = a, b = b,c= c, d = d)
system.time(fn_agg_2(DT,grp_by,FUNs,measure_col))
#Result as expected
DT[,.N, by = .(a,b,a_b_mean)][,-"N",with=FALSE]
a b a_b_mean
1: a5 b3 49880.27
2: a2 b7 49993.50
3: a3 b3 49823.01
4: a5 b7 50023.66
5: a3 b6 50017.05
解决方案
另一个选项使用base::Map
:
#columns to group by
grp_by <- list(c("a","b"),c("a","c"),c("b","c"))
# required summary stats
funcs <- list(mean)
#stats required on this column
measure_col <- "d"
DT[, .(Map(function(f, g) .SD[, lapply(.SD, f), by=g, .SDcols=measure_col],
rep(funcs, each=length(grp_by)),
rep(grp_by, times=length(funcs))
)),
]$V1
输出:
[[1]]
a b d
1: a5 b6 2.000000
2: a2 b7 4.000000
3: a2 b2 2.333333
4: a3 b5 3.000000
5: a5 b1 2.000000
6: a5 b3 2.000000
7: a4 b3 4.000000
8: a4 b7 1.000000
9: a1 b3 4.000000
10: a2 b4 2.000000
11: a1 b5 4.000000
12: a4 b4 2.000000
13: a4 b6 3.000000
14: a2 b6 4.000000
[[2]]
a c d
1: a5 c3 2
2: a2 c5 4
3: a2 c4 3
4: a3 c5 2
5: a5 c4 2
6: a5 c5 2
7: a5 c1 2
8: a4 c3 3
9: a4 c5 1
10: a1 c5 4
11: a2 c3 2
12: a1 c6 4
13: a2 c2 2
14: a4 c1 2
15: a3 c1 4
16: a4 c2 4
[[3]]
b c d
1: b6 c3 2.0
2: b7 c5 2.5
3: b2 c4 2.5
4: b5 c5 2.0
5: b1 c4 2.0
6: b3 c5 3.0
7: b1 c1 2.0
8: b3 c3 4.0
9: b4 c3 2.0
10: b5 c6 4.0
11: b2 c2 2.0
12: b6 c1 2.0
13: b5 c1 4.0
14: b6 c2 4.0
15: b6 c4 4.0
编辑:在评论中添加 jangorecki 的替代方法
for (f in funcs)
groupingsets(DT, j=lapply(.SD, f), by=Reduce(union, grp_by),
sets=grp_by,
.SDcols=measure_col)
数据:
library(data.table)
set.seed(0L)
nr <- 20
a = sample(c("a1","a2","a3","a4","a5"), nr, replace = TRUE)
b = sample(c("b1","b2","b3","b4","b5","b6","b7"), nr, replace = TRUE)
c = sample(c("c1","c2","c3","c4","c5","c6"), nr, replace = TRUE)
d = sample(nr/5, nr, replace = TRUE)
DT = data.table(a = a, b = b,c= c, d = d)
推荐阅读
- php - 无效的语言:“hreflang”上的“fil”
- android - Jetpack ComposeleadingIcon 不可见
- python - 获取文件最后修改日期和文件名 pyspark 的脚本
- autodesk-forge - 使用 Visual Code Forge 扩展 Forge SVF2 翻译问题
- windows - 如何让“使用 pyinstaller 构建 python 应用程序”工作?
- java - SQL 错误:0,SQLState:空 HikariPool-1 - 连接不可用,请求在 30000 毫秒后超时
- kubernetes - 如何使用默认 Ingress 使用 Kubernetes microk8s 设置 Letsencrypt?
- python - 如何更改代码以便在重新启动后它可以工作并且在终端中可见?
- mysql - docker mysql local-inline db设置问题
- fiddler - 如何在 Android TV 上安装证书?