首页 > 解决方案 > glmnet / glmnetUtils:重复交叉验证

问题描述

我正在尝试使用/运行重复的 10 倍 CV(alphalambda)。我建议的工作流程是:glmnetglmnetUtils

a) 在 11 个值处拟合提议的模型alpha

b) 运行进程 X(在本例中为 10)次,

c) 平均结果,和

alphad) 用和lambda( )的最佳组合拟合最终模型s = "lambda.1se"

为了解决 ac,我使用了下面的代码;但是,10 次迭代的结果完全相同。

library(glmnet)
library(glmnetUtils)
library(doParallel)

data(BinomialExample)


# Create alpha sequence; fix folds

alpha <- seq(.5, 1, .05)

set.seed(1)
folds <- sample(1:10, size = length(y), replace = TRUE)


# Determine optimal combination of alpha and lambda; extract lowest CV error and associated lambda at each alpha

extractGlmnetInfo <- function(object)
{
  # Find lambdas
  lambda1se <- object$lambda.1se

  # Determine where lambdas fall in path
  which1se <- which(object$lambda == lambda1se)

  # Create data frame with selected lambdas and corresponding error
  data.frame(lambda.1se = lambda1se, cv.1se = object$cvm[which1se])
}


#Run glmnet

cl <- makeCluster(detectCores())
registerDoParallel(cl)

enet <- foreach(i = 1:10,
                .inorder = FALSE,
                .multicombine = TRUE,
                .packages = "glmnetUtils") %dopar%
  {
    cv <- cva.glmnet(x, y,
                     foldid = folds,
                     alpha = alpha,
                     family = "binomial",
                     parallel = TRUE)
    }

stopCluster(cl)


# Extract smallest CV error and lambda at each alpha for each iteration of 10-fold CV
# Calculate means (across iterations) of lowest CV error and associated lambdas for each alpha

cv.rep1 <- ldply(enet[[1]]$modlist, extractGlmnetInfo)
cv.rep2 <- ldply(enet[[2]]$modlist, extractGlmnetInfo)
cv.rep3 <- ldply(enet[[3]]$modlist, extractGlmnetInfo)
cv.rep4 <- ldply(enet[[4]]$modlist, extractGlmnetInfo)
cv.rep5 <- ldply(enet[[5]]$modlist, extractGlmnetInfo)
cv.rep6 <- ldply(enet[[6]]$modlist, extractGlmnetInfo)
cv.rep7 <- ldply(enet[[7]]$modlist, extractGlmnetInfo)
cv.rep8 <- ldply(enet[[8]]$modlist, extractGlmnetInfo)
cv.rep9 <- ldply(enet[[9]]$modlist, extractGlmnetInfo)
cv.rep10 <- ldply(enet[[10]]$modlist, extractGlmnetInfo)

cv.rep <- bind_rows(cv.rep1, cv.rep2, cv.rep3, cv.rep4, cv.rep5, cv.rep6, cv.rep7, cv.rep8, cv.rep9, cv.rep10)

cv.rep <- data.frame(cbind(alpha, cv.rep))

问题

  1. 我的理解是交叉验证时应该修复折叠alpha。因此,我是否应该set.seed()多次folds为每次迭代生成不同的迭代并分别运行每次迭代,而不是循环遍历它们?例如:

    # Set folds for first iteration
    
    set.seed(1)
    folds1 <- sample(1:10, size = length(y), replace = TRUE)
    
    
    # Run first iteration
    
    enet1 <- cva.glmnet(x, y,
                    foldid = folds1,
                    alpha = alpha,
                    family = "binomial")
    
    
    # Set folds for second iteration
    
    set.seed(2)
    folds2 <- sample(1:10, size = length(y), replace = TRUE)
    
    
    # Run second iteration
    
    enet2 <- cva.glmnet(x, y,
                    foldid = folds2,
                    alpha = alpha,
                    family = "binomial")
    
  2. 或者有没有办法修复folds迭代中的 and 循环,从而利用并行处理?

  3. 回复: 1. 中提供的选项,我如何确定我应该使用哪种配置来使用和folds的最佳组合来拟合最终模型?决定是任意的吗?alphalambda

注意。我没有caret用于这个特定的任务。

标签: rglmnet

解决方案


推荐阅读