首页 > 解决方案 > 并行堆叠和调谐模型 (AutoTuner) 的基准测试

问题描述

我想比较使用 benchmark()mlr3 框架的调谐模型(通过 AutoTuner 对象)和堆叠学习器(也包含 AutoTuner 对象)的性能。由于手头的任务和搜索空间是计算密集型的,我想并行运行我能做的事情。我阅读了文档,但我仍然不确定流程的哪些部分完全并行化(即,如果我的资源使用不足或过度使用)。

我知道我想比较的一些学习者确实可以并行化(即为classif.rangeror生长树,如果您使用隐式并行化classif.xgboost,时间会大大减少),而对于其中一些学习者来说并不重要(即,您只能并行化重采样迭代,但不能使算法“更快”)。num.threadsclassif.glmnet

这是我的工作流程模板(我将跳过任务创建部分):

### Create AutoTuner objects ###

# Tuning
terminator = trm("evals", n_evals = 2)
tuner = tnr("grid_search", resolution = 5)

# Creating learners
lrn.rf = lrn("classif.ranger", predict_type = "prob", num.trees = 85)
lrn.xgb = lrn("classif.xgboost", predict_type = "prob", nrounds = 10)
lrn.rp = lrn("classif.rpart",  predict_type = "prob")

# search space
rf.ps = ps(
  mtry = p_int(lower = 4, upper = 120)
)

at_rng = AutoTuner$new(
  learner = lrn.rf,
  resampling = rsmp("cv", folds = 5),
  measure = msr("classif.logloss"),
  terminator = terminator,
  search_space = rf.ps,
  tuner = tuner,
  store_models = T
)

xgb.ps = ps(
  max_depth = p_int(5, 10),
  eta = p_dbl(0.5, 0.8),
  subsample = p_dbl(0.9, 1),
  min_child_weight = p_int(8, 10),
  colsample_bytree = p_dbl(0.5, 1)
)

at_xgb = AutoTuner$new(
  learner = lrn.xgb,
  resampling = rsmp("cv", folds = 5),
  measure = msr("classif.logloss"),
  terminator = terminator,
  search_space = xgb.ps,
  tuner = tuner,
  store_models = T
)

rp.ps = ps(
  minsplit = p_int(lower = 20, upper = 25),
  minbucket = p_int(lower = 5, upper = 10),
  cp = p_dbl(lower = 0.001, upper = 0.1),
  maxdepth = p_int(lower = 5, upper = 15)
)

at_rp = AutoTuner$new(
  learner = lrn.rp,
  resampling = rsmp("cv", folds = 5),
  measure = msr("classif.logloss"),
  terminator = terminator,
  search_space = rp.ps,
  tuner = tuner,
  store_models = T
)

### Create stacked learner ###
stacked_graph = gunion(list(
  po("learner_cv", at_rng),
  po("learner_cv", at_xgb),
  po("learner_cv", lrn("classif.glmnet", predict_type = "prob"))
)) %>>%
  po("featureunion") %>>% lrn("classif.log_reg", predict_type = "prob")

stacked_graph$keep_results = T
stacked_learner = as_learner(stacked_graph)

### Benchmarking design ###

design = benchmark_grid(
  task = list(task),
  learner = list(
    at_rng,
    at_xgb,
    at_rp,
    stacked_graph),
  resampling = list(rsmp("cv", folds = 5)$instantiate(task))),
  store_models = T)

假设我有 80 个内核可用。我知道,在这种配置下,如果我使用全部 80 个内核来训练堆叠模型,我将需要 30 分钟,而对于其他 AutoTuner 对象则分别需要:10、20 和 30 分钟。如果我按照文档进行设置:

future::plan(list(future::tweak("multisession", workers = 5),
                  future::tweak("multisession", workers = 16)))

我希望 AutoTuner 对象的每个内部重采样 + 堆叠学习器中的那些将有 16 个内核可用,而外部重采样(即每次基准迭代)在 5 上运行。现在问题:

  1. 我的假设正确吗?
  2. 如果是,我应该期望计算时间增加多少?mlr我在使用中尝试了类似的设置,parallelMap::parallelStartStocket()大约花了 6 个小时,而当我尝试使用mlr3它时,它持续了 24 小时,我不得不终止该进程。
  3. 我是用这个设置优化我的资源(即内存泄漏/挂起进程)还是有更好的设置方法?也许与所有核心分开运行部分进程?

标签: roptimizationparallel-processingbenchmarkingmlr3

解决方案


推荐阅读