r - 如何在 dplyr + purrr 中优化内存使用
问题描述
我有一个问题,在为训练和测试集复制数据后,我在 Rstudio 中显示分配给我的用户的大量内存,但没有在我的 R 会话中使用。我创建了一个小例子来重现我的情况:)
这段代码运行了一堆模型,基于我给它的不同的公式、算法和参数集。它是一个函数,但我为 reprex 创建了一个简单的脚本。
library(dplyr)
library(purrr)
library(modelr)
library(tidyr)
library(pryr)
# set my inputs
data <- mtcars
formulas <- c(test1 = mpg ~ cyl + wt + hp,
test2 = mpg ~ cyl + wt)
params = list()
methods <- "lm"
n <- 20 # num of cv splits
mult <- 10 # number of times I want to replicate some of the data
frac <- .25 # how much I want to cut down other data (fractional)
### the next few chunks get the unique combos of the inputs.
if (length(params) != 0) {
cross_params <- params %>%
map(cross) %>%
map_df(enframe, name = "param_set", .id = "method") %>%
list
} else cross_params <- NULL
methods_df <- tibble(method = methods) %>%
list %>%
append(cross_params) %>%
reduce(left_join, by = "method") %>%
split(1:nrow(.))
# wrangle formulas into a split dataframe
formulas_df <- tibble(formula = formulas,
name = names(formulas)) %>%
split(.$name)
# split out the data into n random train-test combos
cv_data <- data %>%
crossv_kfold(n) %>% # rsample?
mutate_at(vars(train:test), ~map(.x, as_tibble))
# sample out if needed
cv_data_samp <- cv_data %>%
mutate(train = modify(train,
~ .x %>%
split(.$gear == 4) %>%
# take a sample of the non-vo data
modify_at("FALSE", sample_frac, frac) %>%
# multiply out the vo-on data
modify_at("TRUE", function(.df) {
map_df(seq_along(1:mult), ~ .df)
}) %>%
bind_rows))
# get all unique combos of formula and method
model_combos <- list(cv = list(cv_data_samp),
form = formulas_df,
meth = methods_df) %>%
cross %>%
map_df(~ bind_cols(nest(.x$cv), .x$form, .x$meth)) %>%
unnest(data, .preserve = matches("formula|param|value")) %>%
{if ("value" %in% names(.)) . else mutate(., value = list(NULL))}
# run the models
model_combos %>%
# put all arguments into a single params column
mutate(params = pmap(list(formula = formula, data = train), list)) %>%
mutate(params = map2(params, value, ~ append(.x, .y))) %>%
mutate(params = modify(params, discard, is.null)) %>%
# run the models
mutate(model = invoke_map(method, params))
mem_change(rm(data, cv_data, cv_data_samp))
mem_used()
现在,在我这样做之后,我的mem_used
的内存达到了 77.3mb,但我看到分配给我的 R 用户的(160Mb)大约翻了一番。当我的数据为 3 Gb 时,这真的会爆炸,这是我的真实案例。我最终使用了 100Gb 并占用了整个服务器:(。
发生了什么,我该如何优化?
任何帮助表示赞赏!
解决方案
我想通了!问题是我正在将我的一系列modelr
resample
对象转换为tibble
s 并且即使我随后对它们进行采样,这也会导致内存爆炸。解决方案?编写处理resample
对象的方法,这样我就不必将resample
对象转换为tibble
. 这些看起来像:
# this function just samples the indexes instead of the data
sample_frac.resample <- function(data, frac) {
data$idx <- sample(data$idx, frac * length(data$idx))
data
}
# this function replicates the indexes. I should probably call it something else.
augment.resample <- function(data, n) {
data$idx <- unlist(map(seq_along(1:n), ~ data$idx))
data
}
# This function does simple splitting (logical only) of resample obejcts
split.resample <- function(data, .p) {
pos <- list(data = data$data, idx = which(.p, 1:nrow(data$data)))
neg <- list(data = data$data, idx = which(!.p, 1:nrow(data$data)))
class(pos) <- "resample"
class(neg) <- "resample"
list("TRUE" = pos,
"FALSE" = neg)
}
# This function takes the equivalent of a `bind_rows` for resample objects.
# Since bind rows does not call `useMethod` I had to call it something else
bind <- function(data) {
out <- list(data = data[[1]]$data, idx = unlist(map(data, pluck, "idx")))
class(out) <- "resample"
out
}
purrr
然后,我只是在为该 CV 运行我的模型的同一个闭包中转换为一个 tibble 。问题解决了!我的内存使用率现在非常低。
推荐阅读
- ios - Swift - 纵向视频的帧尺寸
- mysql - 如何在 SQL 中使用 COUNT 和 JOIN?
- java - 为什么当链被拆分为多个部分时,ExpressionInterceptUrlRegistry 调用链会丢失类型信息?
- javascript - 如何在引导手风琴中切换加号到减号,反之亦然
- python - 在门户(门户模块)网站odoo13上添加新文件
- apache-flink - 如何为 flink 做贡献?
- php - htmlpurifier 在输入表单后不起作用
- python - TensorFlow 仅在提供的训练数据的 1/32 上运行
- python - 如何在python中使用映射值将日期从列转置到行?
- tensorflow - TF2 数据增强随机性