首页 > 解决方案 > 在 R(环境)中保存(回归)模型时避免更大(臃肿)的大小

问题描述

我想在另一个函数中创建一个回归模型;但我的问题是,当保存模型时,它变得非常非常大,因为环境中的其他数据正在与它一起保存。因此,我认为解决方案可能是处理不同的环境;有助于我更好地理解这一点。下面我通过几个步骤解释了这些问题。

# Helper function just to quickly assess how big the object becomes when being saved.
saveSize <- function (object) {
  tf <- tempfile(fileext = ".RData")
  on.exit(unlink(tf))
  save(object, file = tf)
  file.size(tf)
}

# Subset of columns to be used
subset = 1:4

# Model size to compare with; i.e., not created within a function
model1 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
saveSize(model1)
# Size = 965

# Function where there are other data that should NOT be saved. 
Function2 <- function (subset){
  data_not_to_be_saved <- 1:1e+15
  model2 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
}
model2 <- Function2(subset)
saveSize(model2) 
# Size = 1148 ; Problematic that size is larger that model 1.

# Solution to above is to create a new environment
Function3 <- function (subset){
  data_not_to_be_saved <- 1:1e+15
  # New environment
  env <- new.env(parent = globalenv())
  env$subset <- subset
  with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset))
}
model3 <- Function3(subset)
saveSize(model3) 
# 1002 # Success: considerably smaller than in Function 2. 



# PROBLEM: Getting solution in Function 3 to work within another function. 

# This function runs but result in large sized object again
# Also note that I do not want to call iris dataset within the lm call. 
Function5 <- function (subset){
  
  data_not_to_be_saved <- 1:1e+15
  
  Function5 <- function (subset) {
    
    env <- new.env(parent = globalenv())
    env$subset <- subset
    env$datainenvorment <- iris
    
    with(env, lm(Sepal.Length ~ Sepal.Width, data = datainenvorment, subset = subset))
  }
  model5 <- Function5(subset)
}

model5 <- Function5(subset)
saveSize(model5) 

提前致谢

标签: r

解决方案


您使用的解决方案可以正常工作。您看不到它,因为在新的 R 版本中,顺序整数向量非常节省内存。这种微小的差异来自额外变量(如env变量)的少量开销。最重要的是该data_not_to_be_saved变量被跳过。

使用一些更大的数据可以更清楚地看到它。

data_not_to_be_saved <- rnorm(10**5)

这个问题的根源是什么。返回一个对象,其中包含对其他环境的lm引用(例如,函数环境提供了从定义它的位置访问所有变量的权限)。此外,还save可以使用默认参数在所有可能的环境中查找所需的变量。

str(model5)
# like   .. .. ..- attr(*, ".Environment")=<environment: 0x7fdc9e6c2b68> 

另一种解决方案可能是使用lm.fit仅返回基本结构的函数。此处不再赘述

model_fit <- lm.fit(cbind(1,iris$Sepal.Width[subset]), iris$Sepal.Length[subset])

推荐阅读