首页 > 解决方案 > imap 并在输出中使用输入变量的名称

问题描述

(我使用了 tidyverse 标签,因为我的问题有点宽泛地要求使用“整洁”的方法来解决问题)

我正在尝试建立一个用于培训和评估各种模型的结构。

过去,我使用过插入符号包resamples()功能,您可以在其中传递模型列表来评估和caret::resamples()输出每个模型的名称,并基于评估。

这次我使用的是 rsample 包,而是使用 k 折叠迭代 tibbles。

我想创建一个类似于resamples()输出每个模型的评估指标的函数。这是我的代码和我尝试过的:

library(rsample)
library(Metrics)
library(xgboost)

# 5 fold split stratified on spender
train_cv <- vfold_cv(diamonds, 5) %>% 

  # create training and validation sets within each fold
  mutate(train = map(splits, ~training(.x)), 
         validate = map(splits, ~testing(.x)))


# ranger random forrest across each fold
mod.rf <- train_cv %>% 
  mutate(regression = map(train, ~ranger::ranger(formula = price ~ carat, data = .x))) %>% # fit the model
  mutate(predictions = map2(.x = regression, .y = validate, ~predict(.x, .y)$predictions)) %>% # predictions
  mutate(validation_actuals = map(validate, ~.x$carat)) %>% # get the actuals for computing evaluation metrics
  mutate(mae = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::mae(actual = .x, predicted = .y))) %>% # mae
  mutate(rmse = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::rmse(actual = .x, predicted = .y))) # rmse


# xgb across each fold
mod.xgb <- train_cv %>%

  # convert regression data to a dmatrix for xgb. Just simple price ~ carat for here and now
  mutate(train_dmatrix = map(train, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price)),
         validate_dmatrix = map(validate, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price))) %>% 

  mutate(regression = map(train_dmatrix, ~xgboost(.x, objective = "reg:squarederror", nrounds = 100))) %>% # fit the model
  mutate(predictions =map2(.x = regression, .y = validate_dmatrix, ~predict(.x, .y))) %>% # predictions
  mutate(validation_actuals = map(validate, ~.x$carat)) %>% # get the actuals for computing evaluation metrics
  mutate(mae = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::mae(actual = .x, predicted = .y))) %>% # mae
  mutate(rmse = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::rmse(actual = .x, predicted = .y))) # rmse

随机福雷斯特:mod.rf

#  5-fold cross-validation 
# A tibble: 5 x 9
  splits                id    train                  validate               regression   predictions    validation_actuals   mae  rmse
* <named list>          <chr> <named list>           <named list>           <named list> <named list>   <named list>       <dbl> <dbl>
1 <split [43.2K/10.8K]> Fold1 <tibble [43,152 × 10]> <tibble [10,788 × 10]> <ranger>     <dbl [10,788]> <dbl [10,788]>     3867. 5318.
2 <split [43.2K/10.8K]> Fold2 <tibble [43,152 × 10]> <tibble [10,788 × 10]> <ranger>     <dbl [10,788]> <dbl [10,788]>     3916. 5414.
3 <split [43.2K/10.8K]> Fold3 <tibble [43,152 × 10]> <tibble [10,788 × 10]> <ranger>     <dbl [10,788]> <dbl [10,788]>     3946. 5448.
4 <split [43.2K/10.8K]> Fold4 <tibble [43,152 × 10]> <tibble [10,788 × 10]> <ranger>     <dbl [10,788]> <dbl [10,788]>     3996. 5514.
5 <split [43.2K/10.8K]> Fold5 <tibble [43,152 × 10]> <tibble [10,788 × 10]> <ranger>     <dbl [10,788]> <dbl [10,788]>     3936. 5414.

XGBoost:

mod.xgb

#  5-fold cross-validation 
# A tibble: 5 x 11
  splits                id    train                  validate               train_dmatrix validate_dmatrix regression   predictions    validation_actuals   mae  rmse
* <named list>          <chr> <named list>           <named list>           <named list>  <named list>     <named list> <named list>   <named list>       <dbl> <dbl>
1 <split [43.2K/10.8K]> Fold1 <tibble [43,152 × 10]> <tibble [10,788 × 10]> <xgb.DMtr>    <xgb.DMtr>       <xgb.Bstr>   <dbl [10,788]> <dbl [10,788]>     3868. 5319.
2 <split [43.2K/10.8K]> Fold2 <tibble [43,152 × 10]> <tibble [10,788 × 10]> <xgb.DMtr>    <xgb.DMtr>       <xgb.Bstr>   <dbl [10,788]> <dbl [10,788]>     3916. 5414.
3 <split [43.2K/10.8K]> Fold3 <tibble [43,152 × 10]> <tibble [10,788 × 10]> <xgb.DMtr>    <xgb.DMtr>       <xgb.Bstr>   <dbl [10,788]> <dbl [10,788]>     3945. 5447.
4 <split [43.2K/10.8K]> Fold4 <tibble [43,152 × 10]> <tibble [10,788 × 10]> <xgb.DMtr>    <xgb.DMtr>       <xgb.Bstr>   <dbl [10,788]> <dbl [10,788]>     3995. 5511.
5 <split [43.2K/10.8K]> Fold5 <tibble [43,152 × 10]> <tibble [10,788 × 10]> <xgb.DMtr>    <xgb.DMtr>       <xgb.Bstr>   <dbl [10,788]> <dbl [10,788]>     3935. 5413.

现在,如果我想知道每个模型的 rmse 或 mae,我可以取平均值:

> mod.rf$mae %>% mean()
[1] 3932.181
> mod.rf$rmse %>% mean()
[1] 5421.681
> mod.xgb$mae %>% mean()
[1] 3931.967
> mod.xgb$rmse %>% mean()
[1] 5421.148

但是,假设我有很多模型,并且我会列出传递模型名称的列表或向量,其中这些模型具有与上述相同的结构,我如何返回一个例如显示模型名称以及平均 mae 和 rmse 的数据框?

到目前为止尝试过:

model_list <- list(
  mod.rf,
  mod.xgb
)

purrr::imap(model_list, ~mean(.x$mae))
purrr::imap(model_list, ~mean(.x$rmse))

这使:

purrr::imap(model_list, ~mean(.x$mae))
[[1]]
[1] 3932.181

[[2]]
[1] 3931.967

> purrr::imap(model_list, ~mean(.x$rmse))
[[1]]
[1] 5421.681

[[2]]
[1] 5421.148

但我想要的是某种格式(假设看起来像一个表格,但我使用了条形 | 来分隔列):

model_name | mae | rmse
mod.rf | 3932.181 | 5421.681
mod.xgb | 3931.967 | 5421.148

我正在查看 purrr::imap,因为我认为它可以将 iteratd 组件的名称输出为 .y。从不久前保存的代码片段中:

imap(pr_curves_data, ~write.csv(x = .x,file = paste0(.y, ".csv"), row.names = F))

这将写入许多 csv 文件,其中每个 csv 文件的名称是被迭代的输入变量的名称,在我当前的工作示例中,等效为“mod.rf”和“mod.xgb”。

并排比较多个模型的输出的“整洁”方式是什么?

请注意,我没有在同一个 map() 代码块中训练 xgb 和 rf,因为在我的实际代码中,有许多模型具有自己的细微差别(例如 xgb 和 DMatrix)、rf 和 mtry 等。所以每个模型只是共享相同的折叠 train_cv。

标签: rtidyversepurrr

解决方案


如果您将模型放入命名列表中,您可以使用imap匿名函数来获得预期的输出:

library(tidyverse)

model_list <- list(
  mod.rf = mod.rf,
  mod.xgb = mod.xgb
)

model_list %>% 
  imap(~tibble(
      model_name = .y,
      mae = mean(.x$mae),
      rmse = mean(.x$rmse)
  )) %>% 
  bind_rows()
## A tibble: 2 x 3
#  model_name   mae  rmse
#  <chr>      <dbl> <dbl>
#1 mod.rf     3931. 5420.
#2 mod.xgb    3931. 5420.

推荐阅读