首页 > 解决方案 > 在 MLR3 中将 rpart 超调整参数与下采样相结合

问题描述

我正在浏览 MLR3 包 ( mlr3gallery:imbalanced data examples ) 中的优秀示例,我希望看到一个结合了超参数调整和不平衡校正的示例。

从上面的链接中,作为我想要实现的目标的描述:

为了保持低运行时间,我们只为不平衡校正方法定义搜索空间。但是,也可以通过使用学习器的超参数扩展搜索空间来联合调整学习器的超参数和不平衡校正方法。

这是一个接近的示例 - mlr3 PipeOps:创建具有不同数据转换的分支,并对分支内和分支之间的不同学习器进行基准测试

所以我们可以(错误地)使用 misuse 的这个很好的例子作为一个演练:

#packages
library(paradox)
library(mlr3)
library(mlr3pipelines)
library(mlr3tuning)

#set up an rpart learner
learner <- lrn("classif.rpart", predict_type = "prob")
learner$param_set$values <- list(
  cp = 0,
  maxdepth = 21,
  minbucket = 12,
  minsplit = 24
)

#Create the tree graphs:

# graph 1, just imputehist
graph_nop <- po("imputehist") %>>%
  learner

# graph 2 : imputehist and undersample majority class (ratio relative to majority class)

graph_down <- po("imputehist") %>>%
  po("classbalancing", id = "undersample", adjust = "major", 
     reference = "major", shuffle = FALSE, ratio = 1/2) %>>%
  learner

# graph 3: impute hist and oversample minority class (ratio relative to minority class)

graph_up <- po("imputehist") %>>%
  po("classbalancing", id = "oversample", adjust = "minor", 
     reference = "minor", shuffle = FALSE, ratio = 2) %>>%
  learner

#Convert graphs to learners and set predict_type

graph_nop <-  GraphLearner$new(graph_nop)
graph_nop$predict_type <- "prob"

graph_down <- GraphLearner$new(graph_down)
graph_down$predict_type <- "prob"

graph_up <- GraphLearner$new(graph_up)
graph_up$predict_type <- "prob"

#define re-sampling and instantiate it so always the same split will be used:

hld <- rsmp("holdout")

set.seed(123)
hld$instantiate(tsk("sonar"))

#Benchmark

bmr <- benchmark(design = benchmark_grid(task = tsk("sonar"),
                                         learner = list(graph_nop,
                                                        graph_up,
                                                        graph_down),
                                         hld),
                 store_models = TRUE) #only needed if you want to inspect the models

#check result using different measures:

  bmr$aggregate(msr("classif.auc"))
  bmr$aggregate(msr("classif.ce"))

#This can be also performed within one pipeline with branching but one would need to define the paramset and use a tuner:

  graph2 <- 
  po("imputehist") %>>%
  po("branch", c("nop", "classbalancing_up", "classbalancing_down")) %>>%
  gunion(list(
    po("nop", id = "nop"),
    po("classbalancing", id = "classbalancing_up", ratio = 2, reference = 'major'),
    po("classbalancing", id = "classbalancing_down", ratio = 2, reference = 'minor') 
  )) %>>%
  po("unbranch") %>>%
  learner

graph2$plot()

#Note that the unbranch happens before the learner since one (always the same) learner is being used. Convert graph to learner and set predict_type

graph2 <- GraphLearner$new(graph2)
graph2$predict_type <- "prob"

#Define the param set. In this case just the different branch options.

ps <- ParamSet$new(
  list(
    ParamFct$new("branch.selection", levels = c("nop", "classbalancing_up", "classbalancing_down")),
  ))


#In general you would want to add also learner hyper parameters like cp and minsplit for rpart as well as the ratio of over/undersampling.

那么此时我们如何添加像cp和minsplit这样的学习器超参数呢?

#perhaps by adding them to the param list?
ps = ParamSet$new(list(
  ParamFct$new("branch.selection", levels = c("nop", "classbalancing_up", "classbalancing_down")),
  ParamDbl$new("cp", lower = 0.001, upper = 0.1),
  ParamInt$new("minsplit", lower = 1, upper = 10)
))

#Create a tuning instance and grid search with resolution 1 since no other parameters are tuned. The tuner will iterate through different pipeline branches as defined in the paramset.

instance <- TuningInstance$new(
  task = tsk("sonar"),
  learner = graph2,
  resampling = hld,
  measures = msr("classif.auc"),
  param_set = ps,
  terminator = term("none")
)


tuner <- tnr("grid_search", resolution = 1)
set.seed(321)
tuner$tune(instance)

但这会导致:

Error in (function (xs)  : 
  Assertion on 'xs' failed: Parameter 'cp' not available..

我觉得我可能缺少一个关于如何结合这两件事的分支层(rpart 超参数/minsplit 和 cp;以及下/上采样)?感谢您提供任何帮助。

标签: rmlr3

解决方案


一旦你构建了一个管道学习器,底层参数的 ID 就会改变,因为它们被添加了一个前缀。您可以随时检查param_set学习者的情况。在您的示例中是graph2$param_set. 在那里你会看到你正在寻找的参数如下:

ps = ParamSet$new(list(
  ParamFct$new("branch.selection", levels = c("nop", "classbalancing_up", "classbalancing_down")),
  ParamDbl$new("classif.rpart.cp", lower = 0.001, upper = 0.1),
  ParamInt$new("classif.rpart.minsplit", lower = 1, upper = 10)
))

推荐阅读