r - 如何并行化 xgboost 拟合?
问题描述
我正在尝试使用不同参数(例如用于参数调整)来拟合许多 xgboost 模型。需要并行运行它们以减少时间。但是,在运行%dopar%
命令时,我收到以下错误:Error in unserialize(socklist[[n]]) : error reading from connection
.
下面是一个可重现的例子。它与 xgboost 有关,因为任何其他涉及全局变量的计算都在%dopar%
循环中工作。有人能指出这种方法有什么遗漏/错误吗?
#### Load packages
library(xgboost)
library(parallel)
library(foreach)
library(doParallel)
#### Data Sim
n = 1000
X = cbind(runif(n,10,20), runif(n,0,10))
y = 10 + 2*X[,1] + 3*X[,2] + rnorm(n,0,1)
#### Init XGB
train = xgb.DMatrix(data = X[-((n-10):n),], label = y[-((n-10):n)])
test = xgb.DMatrix(data = X[(n-10):n,], label = y[(n-10):n])
watchlist = list(train = train, test = test)
#### Init parallel & run
numCores = detectCores()
cl = parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)
clusterEvalQ(cl, {
library(xgboost)
})
pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
xgb.train(data = train, watchlist = watchlist, max_depth=i, nrounds = 1000, early_stopping_rounds = 10)$best_score
# if xgb.train is replaced with anything else, e.g. 1+y, it works
}
stopCluster(cl)
解决方案
正如 HenrikB 在评论中指出的那样,xgb.DMatrix
对象不能用于并行化。为了解决这个问题,我们可以将对象置于foreach
:
#### Load packages
library(xgboost)
library(parallel)
library(foreach)
library(doParallel)
#> Loading required package: iterators
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
#### Init parallel & run
numCores = detectCores()
cl = parallel::makeCluster(numCores, setup_strategy = "sequential")
doParallel::registerDoParallel(cl)
pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
# BRING CREATION OF XGB MATRIX INSIDE OF foreach
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist = list(dtrain = dtrain, dtest = dtest)
param <- list(max_depth = i, eta = 0.01, verbose = 0,
objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, early_stopping_rounds = 10)
bst$best_score
}
stopCluster(cl)
pred
#> [[1]]
#> dtest-auc
#> 0.892138
#>
#> [[2]]
#> dtest-auc
#> 0.987974
#>
#> [[3]]
#> dtest-auc
#> 0.986255
#>
#> [[4]]
#> dtest-auc
#> 1
#> ...
基准测试:
由于xgboost.train
已经并行化,因此查看线程用于xgboost
与用于并行运行调整轮次之间的速度差异可能会很有趣。
为此,我包装了一个函数并对不同的组合进行了基准测试:
tune_par <- function(xgbthread, doparthread) {
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
#### Init parallel & run
cl = parallel::makeCluster(doparthread, setup_strategy = "sequential")
doParallel::registerDoParallel(cl)
clusterEvalQ(cl, {
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
})
pred = foreach(i = 1:10, .packages = c("xgboost")) %dopar% {
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist = list(dtrain = dtrain, dtest = dtest)
param <- list(max_depth = i, eta = 0.01, verbose = 0, nthread = xgbthread,
objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, early_stopping_rounds = 10)
bst$best_score
}
stopCluster(cl)
pred
}
在我的测试中,当为 xgboost 使用更多线程而在并行运行调整轮次时使用更少的线程时,评估速度更快。最有效的方法可能取决于系统规格和数据量。
# 16 logical cores split between xgb threads and threads in dopar cluster:
microbenchmark::microbenchmark(
xgb16par1 = tune_par(xgbthread = 16, doparthread = 1),
xgb8par2 = tune_par(xgbthread = 8, doparthread = 2),
xgb4par4 = tune_par(xgbthread = 4,doparthread = 4),
xgb2par8 = tune_par(xgbthread = 2, doparthread = 8),
xgb1par16 = tune_par(xgbthread = 1,doparthread = 16),
times = 5
)
#> Unit: seconds
#> expr min lq mean median uq max neval cld
#> xgb16par1 2.295529 2.431110 2.500170 2.519277 2.527914 2.727021 5 a
#> xgb8par2 2.301189 2.308377 2.407767 2.363422 2.465446 2.600402 5 a
#> xgb4par4 2.632711 2.778304 2.875816 2.825471 2.849003 3.293593 5 b
#> xgb2par8 4.508485 4.682284 4.752776 4.810461 4.822566 4.940085 5 c
#> xgb1par16 8.493378 8.550609 8.679931 8.768008 8.779718 8.807943 5 d
推荐阅读
- node.js - 将数据发布到子文档 - Mongoose Express
- ios - CIDetector 在处理 CMSampleBuffer 时崩溃
- angular - ngFor 内存不足
- mysql - MySQL根据日期获取最近的可用行
- json - 如何在有条件的情况下导出为 CSV?
- cloud - 使用 Virtualbox/VMware 创建内部计算云
- mysql - 将两个 SQL 结果合并到一个表中
- javascript - 如何可视化和导出 ee.Algorithms.TemporalSegmentation.Ccdc 的输出?
- python - 我可以从简单查询中获得什么性能:简单示例很慢
- python - 创建命令说嵌入到 discord.py