r - R data.table - 将函数 A 应用于某些列,将函数 B 应用于其他列
问题描述
我想聚合数据表的行,但聚合函数取决于列的名称。
例如,如果列名是:
variable1
或variable2
,然后应用该mean()
功能。variable3
,然后应用该max()
功能。variable4
,然后应用该sd()
功能。
我的数据表总是有一datetime
列:我想按时间聚合行。但是,“数据”列的数量可能会有所不同。
我知道如何mean()
对所有列使用相同的聚合函数(例如):
dt <- dt[, lapply(.SD, mean),
by = .(datetime = floor_date(datetime, timeStep))]
或者仅对于列的子集:
cols <- c("variable1", "variable2")
dt <- dt[ ,(cols) := lapply(.SD, mean),
by = .(datetime = floor_date(datetime, timeStep)),
.SDcols = cols]
我想做的是:
colsToMean <- c("variable1", "variable2")
colsToMax <- c("variable3")
colsToSd <- c("variable4")
dt <- dt[ ,{(colsToMean) := lapply(.SD???, mean),
(colsToMax) := lapply(.SD???, max),
(colsToSd) := lapply(.SD???, sd)},
by = .(datetime = floor_date(datetime, timeStep)),
.SDcols = (colsToMean, colsToMax, colsToSd)]
我查看了 R 中的 data.table - 将多个函数应用于多个列,这让我想到了使用自定义函数:
myAggregate <- function(x, columnName) {
FUN = getAggregateFunction(columnName) # Return mean() or max() or sd()
return FUN(x)
}
dt <- dt[, lapply(.SD, myAggregate, ???columName???),
by = .(datetime = floor_date(datetime, timeStep))]
但我不知道如何将当前列名传递给myAggregate()
...
解决方案
Here is one way to do it with Map
or mapply
:
Let's make some toy data first:
dt <- data.table(
variable1 = rnorm(100),
variable2 = rnorm(100),
variable3 = rnorm(100),
variable4 = rnorm(100),
grp = sample(letters[1:5], 100, replace = T)
)
colsToMean <- c("variable1", "variable2")
colsToMax <- c("variable3")
colsToSd <- c("variable4")
Then,
scols <- list(colsToMean, colsToMax, colsToSd)
funs <- rep(c(mean, max, sd), lengths(scols))
# summary
dt[, Map(function(f, x) f(x), funs, .SD), by = grp, .SDcols = unlist(scols)]
# or replace the original values with summary statistics as in OP
dt[, unlist(scols) := Map(function(f, x) f(x), funs, .SD), by = grp, .SDcols = unlist(scols)]
Another option with GForce on:
scols <- list(colsToMean, colsToMax, colsToSd)
funs <- rep(c('mean', 'max', 'sd'), lengths(scols))
jexp <- paste0('list(', paste0(funs, '(', unlist(scols), ')', collapse = ', '), ')')
dt[, eval(parse(text = jexp)), by = grp, verbose = TRUE]
# Detected that j uses these columns: variable1,variable2,variable3,variable4
# Finding groups using forderv ... 0.000sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
# Getting back original order ... 0.000sec
# lapply optimization is on, j unchanged as 'list(mean(variable1), mean(variable2), max(variable3), sd(variable4))'
# GForce optimized j to 'list(gmean(variable1), gmean(variable2), gmax(variable3), gsd(variable4))'
# Making each group and running j (GForce TRUE) ... 0.000sec
推荐阅读
- php - 获取批处理文件的 PHP 输出
- react-native - 如何在本机反应中使用媒体查询?
- elixir - Guardian 库无法生成令牌
- azure - 如何以有效的方式使用逻辑应用程序检查 foreach 中的条件?
- django - 直接在服务器上从 React 后端访问 API
- javascript - 如何修复每次单击鼠标时都没有出现的 Ball 类
- ios - FIDO2 一致性工具 TypeError:无法读取未定义的属性“1”
- vb.net - ReceiveBufferSize - 从具有高延迟的服务器下载
- microservices - Resilience 4j 和 Prometheus 的集成
- javascript - 使用两个谷歌图表时,一个垂直轴消失