python - 使用数字、分类和文本管道制作 ColumnTransformer
问题描述
我正在尝试制作一个处理数字、分类和文本变量的管道。我希望在运行分类器之前将数据输出到新的数据帧。我收到以下错误
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2499 and the array at index 2 has size 1
.
请注意,2499 是我的训练数据的大小。如果我删除text_preprocessing
管道的一部分,我的代码就可以工作。有什么想法可以让它发挥作用吗?谢谢!
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)
# Numeric pipeline
numeric_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='mean')),
('Scaling', StandardScaler())
]
)
text_preprocessing = Pipeline(
[
('Text',TfidfVectorizer())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(numeric_features, numeric_preprocessing),
(categorical_features, categorical_preprocessing),
(text_features,text_preprocessing),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
test = pipeline.fit_transform(x_train)
解决方案
我认为您已尝试交换功能和管道,make_column_transformer
但在发布问题时并未将其更改回来。
考虑到它们的顺序正确(estimator
, column/s),当向量化器在 ColumnTransformer 中给出列名列表时,会发生此错误。因为 sklearn 中的所有矢量化器只采用一维数据/迭代器/ pd.Series
,所以它不能处理/应用多个列。
例子:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
x_train = pd.DataFrame({'fruit': ['apple','orange', np.nan],
'score': [np.nan, 12, 98],
'summary': ['Great performance',
'fantastic performance',
'Could have been better']}
)
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)
# Numeric pipeline
numeric_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='mean')),
('Scaling', StandardScaler())
]
)
text_preprocessing = Pipeline(
[
('Text',TfidfVectorizer())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(numeric_preprocessing, ['score']),
(categorical_preprocessing, ['fruit']),
(text_preprocessing, 'summary'),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
test = pipeline.fit_transform(x_train)
如果我改变
(text_preprocessing, 'summary'),
至
(text_preprocessing, ['summary']),
它抛出一个
ValueError:连接轴的所有输入数组维度必须完全匹配,但沿维度 0,索引 0 处的数组大小为 3,索引 2 处的数组大小为 1
推荐阅读
- excel - 具有匹配工作表名称和多个条件的 VBA 复制和粘贴数据
- android - 参数类型“String”不能分配给参数类型“bool”
- angular - 如何使用 Angular 2 + 拦截 Oboe js api 请求
- javascript - 从需要使用 R/Rvest 登录的 javascript 网站抓取
- javascript - 关于 javascript 事件及其工作原理
- list - 使用列表迭代的子图标题
- python - 通过 Gitbash 访问 python(anaconda) 时是否需要包含任何特定步骤?
- python - 如何检查数据框中是否存在列表元素?
- swift - 如何确保在应用程序结束时更新数据?
- python - 当您将鼠标悬停在特定元素上时,如何选择特定元素?Selenium webdriver python;