python - n_jobs for sklearn multioutput regressor with estimator=random forest regressor
问题描述
How should :param n_jobs:
be used when both the random forest estimator for multioutput regressor and the multioutput regressor itself both have it? For example, is it better to not specify n_jobs
for the estimator, but set n_jobs
for the multioutput regressor? Several examples are shown below:
# Imports
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
# (1) No parallelization
rf_no_jobs = RandomForestRegressor()
multioutput_no_jobs_alpha = MultiOutputRegressor(estimator=rf_no_jobs)
# (2) RF w/ parallelization, multioutput w/o parallelization
rf_with_jobs = RandomForestRegressor(n_jobs=-1)
multioutput_no_jobs_beta = MultiOutputRegressor(estimator=rf_with_jobs)
# (3) RF w/o parallelization, multioutput w parallelization
multioutput_with_jobs_alpha = MultiOutputRegressor(estimator=rf_no_jobs, n_jobs=-1)
# (4) Both parallelized
multioutput_with_jobs_beta = MultiOutputRegressor(estimator=rf_with_jobs, n_jobs=-1)
解决方案
Since RandomForestRegressor
has 'native' multioutput support (no need for the multioutput wrapper), I instead looked at the KNeighborsRegressor
and LightGBM
which have an inner n_jobs
argument and about which I had the same question.
Running on a Ryzen 5950X (Linux) and Intel 11800H (Windows), both with n_jobs = 8, I found consistent results:
- With low Y dimensionality (say, 1 - 10 targets) it doesn't matter much where n_jobs goes, it finishes quickly regardless. Initializing multiprocessing has a ~1 second overhead, but joblib will reuse existing pools by default, speeding things up.
- With high dimensionality (say > 20) placing n_jobs only in the
MultiOutputRegressor
with KNN receiving n_jobs=1 is 10x faster at 160 dimensions/targets. - Using
with joblib.parallel_backend("loky", n_jobs=your_n_jobs):
was equally fast and conveniently sets the n_jobs for all sklearn things inside. This is the easy option. RegressorChain
is fast enough at low dimensionality but gets ridiculously slow (500x slower vs Multioutput) with 160 dimensions for KNeighbors (I would stick toLightGBM
for use with theRegressorChain
which performs better).- With
LightGBM
,MultiOutputRegressor
only setting n_jobs was again faster than the inner n_jobs, but the difference was much smaller (5950x Linux the difference was 3x, 11800H Windows only 1.2x).
Since the full code gets a bit long, here is a partial sample that gets most of it:
from timeit import default_timer as timer
import numpy as np
from joblib import parallel_backend
from sklearn.neighbors import KNeighborsRegressor
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
from sklearn.datasets import fetch_california_housing
# adjust n_jobs to the number of physical CPU cores on your machine or pass -1 for auto max
n_jobs = 8
knn_model_param_dict = {} # kwargs if desired
num_y_dims = 160
X, y_one_dim = fetch_california_housing(return_X_y=True)
y_one_dim = y_one_dim.reshape(-1, 1)
# extra multioutput dims generated randomly
dims = [y_one_dim]
for _ in range(num_y_dims - 1):
dims.append(np.random.gamma(y_one_dim.std(), size=y_one_dim.shape))
y = np.concatenate(dims, axis=1)
# INIT
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict),
n_jobs=n_jobs,
).fit(X, y)
trial = "KNN with all n_jobs=1"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=1),
n_jobs=1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "KNN inner model with n_jobs"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=n_jobs),
n_jobs=1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "KNN outer multioutput with n_jobs, inner with 1"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=1),
n_jobs=n_jobs,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "KNN inner and outer both -1"
start = timer()
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict, n_jobs=-1),
n_jobs=-1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")
trial = "joblib backend chooses"
start = timer()
with parallel_backend("loky", n_jobs=n_jobs):
regr = MultiOutputRegressor(
KNeighborsRegressor(**knn_model_param_dict),
)
regr.fit(X, y)
regr.predict(X)
end = timer()
推荐阅读
- php - 如何在xls中增加文件生成中的列数 - Laravel-Excel
- python - 矩形之间的 Pygame 碰撞
- visual-studio - 强制转换为“keyof T”是否可以工作或导致编译错误?
- c# - 如何使用带有凭据的 HttpWebRequest 下载(巨大的)文本文件?
- openshift - 如何在 openshift 源中创建 jbossweb 的图像流
- android-studio - 如何更改“compileSdkVersion”以兼容新的“实现”API?
- javascript - 出现超时错误(在 jasmine.DEFAULT_TIMEOUT_INTERVAL 指定的超时内未调用异步回调)
- css - 如果指定了字体格式,字体将不起作用
- amazon-web-services - 如何以编程方式将 AWS 资源从一个 AWS 账户复制到另一个
- python - 字典列表的 JSON 到 DataFrame