首页 > 解决方案 > 使用 SelectFromModel 和 MultiOutputRegressor 进行多步回归的特征选择。如何获得选定的特征及其特征重要性?

问题描述

我想用它sklearn.feature_selection.SelectFromModel来提取多步回归问题中的特征。MultiOutputRegressor回归问题使用 与结合来预测多个值RandomForestRegressor。当我尝试使用它获取所选功能时SelectFromModel.get_support(),会出现错误,表明我需要使一些feature_importances_可访问的方法才能正常工作。可以按照此问题中的说明访问feature_importances_of 。但是我不确定如何正确地将这些传递给课堂。MultiOutputRegressorfeature_importances_SelectFromModel

这是我到目前为止所做的:

# make sample data
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
 
X, y = make_regression(n_samples=100, n_features=100, n_targets=5)
print(X.shape, y.shape)
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, shuffle=True)
 
# get important features for prediction problem:
from sklearn.multioutput import MultiOutputRegressor
 
regr_multirf = MultiOutputRegressor(RandomForestRegressor(n_estimators = 100))
regr_multirf = regr_multirf.fit(X_train, y_train)
sel = SelectFromModel(regr_multirf, max_features= int(np.floor(X_train.shape[1] / 3.)))
sel.fit(X_train, y_train)
sel.get_support()
 
# to get feature_importances_ of Multioutputregressor use line:
regr_multirf.estimators_[1].feature_importances_

输出:

---------------------------------------------------------------------------
 
ValueError                                Traceback (most recent call last)
 
<ipython-input-72-a1d635ad4a34> in <module>()
      5 sel = SelectFromModel(regr_multirf, max_features= int(np.floor(X_train.shape[1] / 3.)))
      6 sel.fit(X_train, y_train)
----> 7 sel.get_support()
 
2 frames
 
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_from_model.py in _get_feature_importances(estimator, norm_order)
     30             "`feature_importances_` attribute. Either pass a fitted estimator"
     31             " to SelectFromModel or call fit before calling transform."
---> 32             % estimator.__class__.__name__)
     33 
     34     return importances
 
ValueError: The underlying estimator MultiOutputRegressor has no `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to SelectFromModel or call fit before calling transform.
 

任何帮助和提示将不胜感激。

标签: pythonscikit-learn

解决方案


在来自 sklearn 的 MultiOutputRegressors 中,每个目标都配备了自己的模型,如文档中所述:“此策略包括为每个目标拟合一个回归器。”。这意味着您需要计算 MultiOutputRegressor 中每个随机森林回归器的特征重要性。每个回归器的特征重要性不直接保存在 MultiOutputRegressor 中。regr_multirf.estimators_[# of regressor you want]相反,您可以通过if regr_multirfis your fit MultiOutputRegressor 从拟合的 MultiOutputRegressor 对象中提取每个回归量(或也称为估计量) 。

因此,您不需要SelectFromModel检索 MultiOutput sklearn 回归模型的特征重要性,而是直接使用每个估计器,如本问题中所述,此答案也非常依赖于此。您的方法仅适用于本质上可以预测多变量目标并且不为每个目标训练单个模型的方法。

在您的情况下,您可以regr_multirf通过拟合的回归器直接检索特征重要性

# make sample data
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.feature_selection import SelectFromModel
import numpy as np
import pandas as pd
 
X, y = make_regression(n_samples=100, n_features=100, n_targets=5)
print(X.shape, y.shape)
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, shuffle=True)

regr_multirf = MultiOutputRegressor(RandomForestRegressor(n_estimators = 100))
regr_multirf = regr_multirf.fit(X_train, y_train)

# now extract the estimator from your regression model
# this estimator carries the feature importances
# you're interested in
# You can also loop the following code
# over all your targets

no_est = 0 # index of target you want feature importance for
# get estimator
est = regr_multirf.estimators_[0]
# get feature importances
feature_importances = pd.DataFrame(est.feature_importances_,
                                   columns=['importance']).sort_values('importance')
print(feature_importances)
feature_importances.plot(kind = 'barh')

输出:

输出


推荐阅读