首页 > 解决方案 > 数据预处理——制作高效的 sklearn 管道

问题描述

我正在尝试为分类问题建立一个有效的sklearn管道。这里的高效这个词是关键。

原始数据pandas DataFrame包含多个列和多种数据类型,需要进行多种转换,因此需要根据具体情况应用多个编码器/估计器/转换器。

为了实现这一点,我最终将我自己的自定义转换器定义为一个继承自sklearn'sBaseEstimatorTransformerMixinobjects 的类。

它确实按预期工作,但我仍然有一些疑问,主要是关于效率和最佳实践:

  1. 每当使用编码器/转换器时,它们都会输出numpy数组(例如StandardScaler,但还有许多其他数组),这意味着我必须更加努力地工作以确保最终从该transform方法中产生的任何内容仍然是DataFrame具有正确列名的。
  2. 虽然我非常了解机器学习算法,例如LogisticRegressionXGBoost不关心特征的名称,并且只需要使用数字数组结构,但这对我来说很重要,因为在以后的道路上,评估模型的性能可能需要根据一个或多个特征分割结果。
  3. 我担心的另一个问题是速度。每当我执行特征选择或超参数调整时,我都需要重新调整我的管道并将其重新应用于我在此过程中生成的每个折叠或保留样本,这会使其速度明显变慢。当然,另一种选择是一次性处理所有内容,然后折叠或握住它,但随后......数据泄漏......

所以,我的问题是,有没有办法重构这个管道/代码(下面提供了一个 MWE;实际的管道要大得多),这样我就可以在处理数据后保留识别特征的能力,并让它成为更有效率?

最小的工作示例:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

# Custom scaler
class MyScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        super().__init__()
        self.enc = StandardScaler()

    def fit(self, X, y=None):
        self.enc.fit(X)
        return self

    def transform(self, X, y=None):
        return pd.DataFrame(self.enc.transform(X.copy()), columns=X.columns)

# Custom transformer
class MyTransformer(BaseEstimator, TransformerMixin):
    MODEL_FEATURES = [
        'Gender', 'Urban', 'DrugUse', 'Urban_DrugUse',
        'CappedIncome', 'CappedCoverage', 'Coverage2IncomeRatio', 'BMI',
        'Province1', 'Province2', 'Province3', 'Province4', 'Province5',
        'AverageHouseholdSize', 'UnemploymentRate'
    ]

    MODEL_RESPONSE = 'FalseDeclaration'
    MAX_INCOME = 150_000
    MAX_COVERAGE = 30_000

    PROVINCE_MAP = {
        'AB': 1, 'MB': 1, 'SK': 1, 'NT': 1, 'NU': 1,
        'BC': 2, 'YT': 2,
        'NB': 3, 'NL': 3, 'NS': 3, 'NF': 3, 'PE': 3,
        'ON': 4,
        'QC': 5
    }

    def __init__(self):
        super().__init__()

        self.gender_enc = OneHotEncoder(sparse=False, handle_unknown='ignore', categories=[['M']])
        self.yes_no_enc = OneHotEncoder(sparse=False, handle_unknown='ignore', categories=[['Y']])
        self.average_household_size_enc = SimpleImputer(missing_values=np.nan, strategy='mean')
        self.unemployment_rate_enc = SimpleImputer(missing_values=np.nan, strategy='mean')
        self.scale_enc = StandardScaler()

    def fit(self, X, y=None):
        self.gender_enc.fit(X=X['Gender'].values.reshape(-1, 1), y=y)
        self.yes_no_enc.fit(X=X['Urban'].values.reshape(-1, 1), y=y) # Could have been either 'Urban' or 'DrugUse'
        self.average_household_size_enc.fit(X=X['AverageHouseholdSize'].values.reshape(-1, 1), y=y)
        self.unemployment_rate_enc.fit(X=X['UnemploymentRate'].values.reshape(-1, 1), y=y)

        return self

    def transform(self, X, y=None):
        X_ = X.copy()
        X_.reset_index(inplace=True, drop=True)

        # Feature encoding
        # Province has too many levels, some are grouped before one-hot encoding can be applied
        X_['Province'] = X_['Province'].replace(to_replace=MyTransformer.PROVINCE_MAP)

        for lvl in set(MyTransformer.PROVINCE_MAP.values()):
            X_['Province' + str(lvl)] = X_['Province'].eq(lvl) * 1

        X_['Gender'] = self.gender_enc.transform(X=X_['Gender'].values.reshape(-1, 1))
        X_['DrugUse'] = self.yes_no_enc.transform(X=X_['DrugUse'].values.reshape(-1, 1))
        X_['Urban'] = self.yes_no_enc.transform(X=X_['Urban'].values.reshape(-1, 1))

        # Feature engineering
        X_['CappedIncome'] = np.minimum(X_['Income'], MyTransformer.MAX_INCOME)
        X_['CappedCoverage'] = np.minimum(X_['CoverageAmount'], MyTransformer.MAX_COVERAGE)
        X_['Coverage2IncomeRatio'] = X_['CappedCoverage'] / X_['CappedIncome']
        X_['BMI'] = X_['Weight'] / X_['Height'] ** 2
        X_['Urban_DrugUse'] = X_['Urban'] * X_['DrugUse']

        # Statistical imputation
        X_['AverageHouseholdSize'] = self.average_household_size_enc.transform(X=X_['AverageHouseholdSize'].values.reshape(-1, 1))
        X_['UnemploymentRate'] = self.unemployment_rate_enc.transform(X=X_['UnemploymentRate'].values.reshape(-1, 1))

        # Slicing the columns in case there are some temporary columns I need to discard
        return X_[MyTransformer.MODEL_FEATURES].sort_index(axis=1)

# Creating some mock data
n = 100

df = pd.DataFrame({
    'FalseDeclaration': np.random.choice(['Y', 'N'], size=n),
    'Gender': np.random.choice(['M', 'F'], size=n),
    'Urban': np.random.choice(['Y', 'N'], size=n),
    'DrugUse': np.random.choice(['Y', 'N'], size=n),
    'Income': np.rint(np.random.uniform(30_000, 200_000, size=n)),
    'CoverageAmount': 1000 * np.random.choice([5, 10, 15, 20, 30], size=n),
    'Province': np.random.choice(['QC', 'ON', 'AB', 'NS', 'PE'], size=n),
    'Height': np.round(np.random.uniform(1.20, 2.00, size=n), 2),
    'Weight': np.rint(np.random.uniform(40, 145, size=n)),
    'AverageHouseholdSize': np.random.choice(np.append(np.round(np.random.uniform(0.5, 5.0, size=4), 1), np.nan), size=n),
    'UnemploymentRate': np.random.choice(np.append(np.round(np.random.uniform(0.01, 0.1, size=4), 2), np.nan), size=n)
})

# Splitting the data into feature matrix and response vector
X = df.drop('FalseDeclaration', axis=1)
y = df['FalseDeclaration'].replace({'Y': 1, 'N': 0})

# Initializing the pipeline
p = Pipeline(steps=[
    ('Transfo', MyTransformer()),
    ('Scale', MyScaler()),
    ('Model', LogisticRegression(class_weight='balanced', solver='lbfgs'))
])

# Fitting the model
p.fit(X, y)

# Evaluating predictive scores (not in the sense of model scoring, but in the sense of predicted values)
# (I know we would usually split the data into a training and test set, but this is only a toy example)
scores = p.predict_proba(X)[:, 1]

# Compute the recall rate assuming the top 20% of cases will be audited
# (I would also need to assess this with regard to some model features, such as "gender" or "urban")
mask = scores > np.quantile(scores, 1 - 0.2)
recall = np.sum(y[mask]) / np.sum(y)
print(recall)

我对其他人在这件事上的经历非常感兴趣。有没有其他人面临或曾经面临过这样的实际问题,如果有,他们是如何克服的?

编辑:向玩具数据集添加了更多功能,因此我可以更好地揭示我遇到的困难,并且我无法通过(a)诉诸自定义估计器和(b)强制它输出一个DataFrame对象来解决这些困难.

基本上,(a) 存在链式转换(例如Income, 和CoverageAmount首先被封顶,然后才被Coverage2IncomeRatio计算)。由于具有原始numpy输出会导致特征名称的丢失(例如CappedIncome,例如),我不确定如何在其转换的第二阶段引用两个,Coverage2IncomeRatio而不是索引。我发现这很乏味且容易出错,因为对处理管道的任何进一步修改都可能导致此类索引向左或向右移动。(b) 我知道sklearn' 的ColumnTransformer对象,但同样,名称是一个问题。(c) 某些特征仅作为临时措施创建,以便创建其他更复杂的特征。此外,一些原始特征没有进入最终剪辑(例如,由于特征选择),因此为什么MODEL_FEATURES常量被定义。

标签: pythonpandasmachine-learningscikit-learnpipeline

解决方案


推荐阅读