首页 > 解决方案 > 将缩放和 pca 应用于 ColumnTransformer 中的列子集

问题描述

我有一个数据集,想要应用缩放,然后将 PCA 应用到 pandas 数据框的子集,并只返回未转换的组件和列。因此,使用mpgseaborn 的数据集,我可以看到尝试预测 mpg 的训练集如下所示:

在此处输入图像描述

现在假设我想单独留下气缸排量并缩放其他所有内容并将其减少到 2 个组件。我希望结果是 4 列,原始的 2 列加上 2 个组件。

如何使用ColumnTransformer缩放到列的子集,然后是 PCA 并仅返回组件和 2 个直通列?

MWE

import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer

df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()

X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21) 


scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))

preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)

pd.DataFrame(trans)

我强烈怀疑我对这一步如何工作的误解是错误的:preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)]. 我认为它在最后 4 列上运行,首先进行缩放,然后进行 PCA,最终返回 2 个组件,但我得到 8 列,前 4 列是缩放,接下来的 2 列似乎是组件(可能它们不是缩放首先),最后是两列 I 'passthrough'

标签: pythonpandasscikit-learn

解决方案


我认为这可行,但不知道这是否是 Python/scikit 解决它的方式:

import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer

df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()

X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21) 


scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
dtm_i2 = list(range(0, len(X_train.columns)-2))

preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i)], remainder='passthrough')
preprocess2 = ColumnTransformer(transformers=[('PCA DTM', pca, dtm_i2)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
trans = preprocess2.fit_transform(trans)

pd.DataFrame(trans)

推荐阅读