首页 > 解决方案 > n_jobs for sklearn multioutput regressor with estimator=random forest regressor

问题描述

How should :param n_jobs: be used when both the random forest estimator for multioutput regressor and the multioutput regressor itself both have it? For example, is it better to not specify n_jobs for the estimator, but set n_jobs for the multioutput regressor? Several examples are shown below:

# Imports
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor

# (1) No parallelization
rf_no_jobs = RandomForestRegressor()
multioutput_no_jobs_alpha = MultiOutputRegressor(estimator=rf_no_jobs)

# (2) RF w/ parallelization, multioutput w/o parallelization
rf_with_jobs = RandomForestRegressor(n_jobs=-1)
multioutput_no_jobs_beta = MultiOutputRegressor(estimator=rf_with_jobs)

# (3) RF w/o parallelization, multioutput w parallelization
multioutput_with_jobs_alpha = MultiOutputRegressor(estimator=rf_no_jobs, n_jobs=-1)

# (4) Both parallelized
multioutput_with_jobs_beta = MultiOutputRegressor(estimator=rf_with_jobs, n_jobs=-1)

标签: pythonscikit-learnparallel-processingrandom-forest

解决方案


Since RandomForestRegressor has 'native' multioutput support (no need for the multioutput wrapper), I instead looked at the KNeighborsRegressor and LightGBM which have an inner n_jobs argument and about which I had the same question.

Running on a Ryzen 5950X (Linux) and Intel 11800H (Windows), both with n_jobs = 8, I found consistent results:

  • With low Y dimensionality (say, 1 - 10 targets) it doesn't matter much where n_jobs goes, it finishes quickly regardless. Initializing multiprocessing has a ~1 second overhead, but joblib will reuse existing pools by default, speeding things up.
  • With high dimensionality (say > 20) placing n_jobs only in the MultiOutputRegressor with KNN receiving n_jobs=1 is 10x faster at 160 dimensions/targets.
  • Using with joblib.parallel_backend("loky", n_jobs=your_n_jobs): was equally fast and conveniently sets the n_jobs for all sklearn things inside. This is the easy option.
  • RegressorChain is fast enough at low dimensionality but gets ridiculously slow (500x slower vs Multioutput) with 160 dimensions for KNeighbors (I would stick to LightGBM for use with the RegressorChain which performs better).
  • With LightGBM, MultiOutputRegressor only setting n_jobs was again faster than the inner n_jobs, but the difference was much smaller (5950x Linux the difference was 3x, 11800H Windows only 1.2x).

Since the full code gets a bit long, here is a partial sample that gets most of it:

from timeit import default_timer as timer
import numpy as np
from joblib import parallel_backend
from sklearn.neighbors import KNeighborsRegressor
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
from sklearn.datasets import fetch_california_housing

# adjust n_jobs to the number of physical CPU cores on your machine or pass -1 for auto max
n_jobs = 8
knn_model_param_dict = {}  # kwargs if desired
num_y_dims = 160

X, y_one_dim = fetch_california_housing(return_X_y=True)
y_one_dim = y_one_dim.reshape(-1, 1)
# extra multioutput dims generated randomly
dims = [y_one_dim]
for _ in range(num_y_dims - 1):
    dims.append(np.random.gamma(y_one_dim.std(), size=y_one_dim.shape))
y = np.concatenate(dims, axis=1)


# INIT
regr = MultiOutputRegressor(
    KNeighborsRegressor(**knn_model_param_dict),
    n_jobs=n_jobs,
).fit(X, y)

trial = "KNN with all n_jobs=1"
start = timer()
regr = MultiOutputRegressor(
    KNeighborsRegressor(**knn_model_param_dict, n_jobs=1),
    n_jobs=1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")

trial = "KNN inner model with n_jobs"
start = timer()
regr = MultiOutputRegressor(
    KNeighborsRegressor(**knn_model_param_dict, n_jobs=n_jobs),
    n_jobs=1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")

trial = "KNN outer multioutput with n_jobs, inner with 1"
start = timer()
regr = MultiOutputRegressor(
    KNeighborsRegressor(**knn_model_param_dict, n_jobs=1),
    n_jobs=n_jobs,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")

trial = "KNN inner and outer both -1"
start = timer()
regr = MultiOutputRegressor(
    KNeighborsRegressor(**knn_model_param_dict, n_jobs=-1),
    n_jobs=-1,
)
regr.fit(X, y)
regr.predict(X)
end = timer()
print(f"trial: {trial} with runtime: {end - start}")

trial = "joblib backend chooses"
start = timer()
with parallel_backend("loky", n_jobs=n_jobs):
    regr = MultiOutputRegressor(
        KNeighborsRegressor(**knn_model_param_dict),
    )
    regr.fit(X, y)
    regr.predict(X)
end = timer()

推荐阅读