r - 如何在超级计算机上正确运行 Caret?
问题描述
超级计算机设置(会话信息)
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)
Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] caret_6.0-86 lattice_0.20-38 forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0
[9] tibble_3.0.1 ggplot2_3.3.0 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 lubridate_1.7.8 class_7.3-15 assertthat_0.2.1 ipred_0.9-9 foreach_1.5.0
[7] R6_2.4.1 cellranger_1.1.0 plyr_1.8.6 backports_1.1.7 stats4_3.6.3 reprex_0.3.0
[13] httr_1.4.1 pillar_1.4.4 rlang_0.4.6 readxl_1.3.1 data.table_1.12.8 rstudioapi_0.11
[19] rpart_4.1-15 Matrix_1.2-18 splines_3.6.3 gower_0.2.1 munsell_0.5.0 broom_0.5.6
[25] compiler_3.6.3 modelr_0.1.8 pkgconfig_2.0.3 nnet_7.3-12 tidyselect_1.1.0 prodlim_2019.11.13
[31] codetools_0.2-16 fansi_0.4.1 crayon_1.3.4 dbplyr_1.4.3 withr_2.2.0 ModelMetrics_1.2.2.2
[37] MASS_7.3-51.5 recipes_0.1.12 grid_3.6.3 nlme_3.1-144 jsonlite_1.6.1 gtable_0.3.0
[43] lifecycle_0.2.0 DBI_1.1.0 magrittr_1.5 pROC_1.16.2 scales_1.1.1 cli_2.0.2
[49] stringi_1.4.6 reshape2_1.4.4 fs_1.4.1 timeDate_3043.102 xml2_1.3.2 ellipsis_0.3.1
[55] generics_0.0.2 vctrs_0.3.0 lava_1.6.7 iterators_1.0.12 tools_3.6.3 glue_1.4.1
[61] hms_0.5.3 survival_3.1-8 colorspace_1.4-1 rvest_0.3.5 haven_2.2.0
该实例是 0 GPU、64 CPU 和 320 GB 内存。
可重现的例子
# packages
require(tidyverse)
require(caret)
require(parallel)
# nobs
n.obs = 100000
n.vars = 20
# generate data
class.data <- twoClassSim(
n = n.obs,
intercept = 0,
linearVars = n.vars,
noiseVars = n.vars,
corrVars = n.vars,
ordinal = F
)
# generate Fold
set.seed(1903)
myFolds <- createMultiFolds(
y = class.data$Class,
k = 10,
times = 3
)
# models;
algorithm <- c(
# Regular Tree
"rpart",
# Random Forest
"rf",
# Gradient Boosted Machine
"gbm"
)
# My control Object
myControl <- trainControl(
index = myFolds,
method = "repeatedcv",
allowParallel = T,
verboseIter = T
)
# model formula
model.formula <- as.formula(
Class ~ .
)
# one model; test ####
# Generate CLuster
cl <- makeCluster(
spec = 30
# Was 10; 325
# Was 20; 304
# Was 30; 313
)
doParallel::registerDoParallel(
cl = cl
)
system.time(
baseline.model <- train(
form = model.formula,
data = class.data,
method = algorithm[2],
trControl = myControl,
num.threads = 30
)
)
stopCluster(
cl = cl
)
我从来没有任何结果?
截至目前,该算法已经运行了 14 个小时,还没有任何结果。我尝试减少功能,使其仅具有相关功能;这大约有 21 个特征。跑了6个小时后,我也放弃了。
我做错了什么,或者这样的计算机上的算法在出现任何结果之前运行 12 个小时以上是否正常?
您对如何解决此类问题有任何建议,这样我就不会在运行数小时后醒来发现错误?
解决方案
推荐阅读
- c# - Dapper - 将类名/属性映射到数据库字段
- angular - 刷新 S3 上托管的 Angular 应用程序的路由时出错
- reactjs - Material UI Tooltip 以多行显示内容
- ios - 同时选择所有表格视图单元格的问题 | 迅速
- linux - 如何使用 Grep 命令在文本文件中查找特定值
- django - Django查询在模板中显示长度?
- java - 如何重构这个重复的 switch 语句代码片段
- python - value() 函数显示两次输出
- azure-data-factory - 时区选项在 Azure 数据工厂中不起作用
- android - 如何在 Android (Kotlin) 中获取特定国家/地区的当前时间或 UTC 时间?