首页 > 解决方案 > 具有 L1 正则化逻辑回归的 Sklearn SelectFromModel

问题描述

作为我管道的一部分,我想LogisticRegression(penalty='l1')结合SelectFromModel. 为了选择合适的正则化量,我优化了正则化参数CGridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
import numpy as np

seed = 111
breast = load_breast_cancer()
X = breast.data
y = breast.target
LR_L1 = LogisticRegression(penalty='l1', random_state=seed, solver='saga', max_iter=1e5)
pipeline = Pipeline([('scale', StandardScaler()),
                     ('SelectFromModel', SelectFromModel(LR_L1)),
                     ('classifier', RandomForestClassifier(n_estimators=500, random_state=seed))])
Lambda = np.array([])
for i in [1e-1, 1, 1e-2, 1e-3]:
    Lambda = np.append(Lambda, i * np.arange(2, 11, 2))
param_grid = {'SelectFromModel__estimator__C': Lambda,
              'classifier_max_features': np.arange(10,100, 10)}
clf = GridSearchCV(pipeline, param_grid, scoring='roc_auc', n_jobs=7, cv=RepeatedStratifiedKFold(random_state=seed),
                   verbose=1)
clf.fit(X, y)

对于某些值,C我收到以下警告:

UserWarning: No features were selected: either the data is too noisy or the selection test too strict.

这是可以理解的。然而,当拟合与LogisticRegression分类器相同而不是特征选择时,我没有问题,而训练集和用于拟合算法的超参数是相同的。从结果来看,不可能有 0 个特征的系数不同于 0。

pipeline2 = Pipeline([('scale', StandardScaler()),
                     ('classifier', LR_L1)])
param_grid2 = {'classifier__C': Lambda}
clf2 = GridSearchCV(pipeline2, param_grid2, scoring='roc_auc', n_jobs=7, cv=RepeatedStratifiedKFold(random_state=seed),
                    verbose=1)
clf2.fit(X, y)

这是一个错误还是我误解了什么?

标签: pythonmachine-learningscikit-learnfeature-selection

解决方案


由于 LogisticRegression 的正则化太强,您发现了一个错误。param_grid参数中还有一个错字classifier_max_features- 它应该是classifier__max_features(两个下划线)。

使用正则化值C >= 1e-2,代码可以工作。在这里,您可以找到带有示例的google colab 笔记本

还有一点需要注意 - 数据集太小,无法进行如此复杂的操作。


推荐阅读