python - scikit-learn 中带有 FeatureUnion 的自定义转换器 mixin
问题描述
我正在 scikit-learn 中编写自定义转换器,以便对数组进行特定操作。为此,我使用 TransformerMixin 类的继承。当我只处理一个变压器时它工作正常。但是,当我尝试使用 FeatureUnion(或 make_union)链接它们时,数组会被复制 n 次。我能做些什么来避免这种情况?我是否按照应有的方式使用 scikit-learn?
import numpy as np
from sklearn.base import TransformerMixin
from sklearn.pipeline import FeatureUnion
# creation of array
s1 = np.array(['foo', 'bar', 'baz'])
s2 = np.array(['a', 'b', 'c'])
X = np.column_stack([s1, s2])
print('base array: \n', X, '\n')
# A fake example that appends a column (Could be a score, ...) calculated on specific columns from X
class DummyTransformer(TransformerMixin):
def __init__(self, value=None):
TransformerMixin.__init__(self)
self.value = value
def fit(self, *_):
return self
def transform(self, X):
# appends a column (in this case, a constant) to X
s = np.full(X.shape[0], self.value)
X = np.column_stack([X, s])
return X
# as such, the transformer gives what I need first
transfo = DummyTransformer(value=1)
print('single transformer: \n', transfo.fit_transform(X), '\n')
# but when I try to chain them and create a pipeline I run into the replication of existing columns
stages = []
for i in range(2):
transfo = DummyTransformer(value=i+1)
stages.append(('step'+str(i+1),transfo))
pipeunion = FeatureUnion(stages)
print('Given result of the Feature union pipeline: \n', pipeunion.fit_transform(X), '\n')
# columns 1&2 from X are replicated
# I would expect:
expected = np.column_stack([X, np.full(X.shape[0], 1), np.full(X.shape[0], 2) ])
print('Expected result of the Feature Union pipeline: \n', expected, '\n')
输出:
base array:
[['foo' 'a']
['bar' 'b']
['baz' 'c']]
single transformer:
[['foo' 'a' '1']
['bar' 'b' '1']
['baz' 'c' '1']]
Given result of the Feature union pipeline:
[['foo' 'a' '1' 'foo' 'a' '2']
['bar' 'b' '1' 'bar' 'b' '2']
['baz' 'c' '1' 'baz' 'c' '2']]
Expected result of the Feature Union pipeline:
[['foo' 'a' '1' '2']
['bar' 'b' '1' '2']
['baz' 'c' '1' '2']]
非常感谢
解决方案
FeatureUnion
只会连接它从内部变压器中得到的东西。现在在您的内部转换器中,您从每个转换器发送相同的列。它取决于变压器正确地向前发送正确的数据。
我建议您只从内部转换器返回新数据,然后从外部或内部连接剩余的列FeatureUnion
。
如果您还没有,请查看此示例:
例如,您可以这样做:
# This dont do anything, just pass the data as it is
class DataPasser(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return X
# Your transformer
class DummyTransformer(TransformerMixin):
def __init__(self, value=None):
TransformerMixin.__init__(self)
self.value = value
def fit(self, *_):
return self
# Changed this to only return new column after some operation on X
def transform(self, X):
s = np.full(X.shape[0], self.value)
return s.reshape(-1,1)
之后,在您的代码中进一步更改:
stages = []
# Append our DataPasser here, so original data is at the beginning
stages.append(('no_change', DataPasser()))
for i in range(2):
transfo = DummyTransformer(value=i+1)
stages.append(('step'+str(i+1),transfo))
pipeunion = FeatureUnion(stages)
运行这个新代码的结果是:
('Given result of the Feature union pipeline: \n',
array([['foo', 'a', '1', '2'],
['bar', 'b', '1', '2'],
['baz', 'c', '1', '2']], dtype='|S21'), '\n')
('Expected result of the Feature Union pipeline: \n',
array([['foo', 'a', '1', '2'],
['bar', 'b', '1', '2'],
['baz', 'c', '1', '2']], dtype='|S21'), '\n')
推荐阅读
- arrays - 将 ALL 与 Hive 中的空数组联合
- c# - 使用c#更改数据网格视图单元格内仅部分文本的颜色
- java - @GetMapping 返回列表为空的字符串信息
- javascript - 如何在 PixiJS 中更优雅地处理 WebGL CONTEXT_LOST_WEBGL 错误?
- php - 通过标题值更新 HTML 元素 - PHP
- c# - 如何从 IFormFile 保存图像
- javascript - MutationsObserver 点击元素/类
- asp.net - ASP.Net Core 中的简单服务注入不调用构造函数
- c# - MySql 和 C#:无法建立连接,因为目标机器主动拒绝它 127.0.0.1:3306
- php - 为什么我的 cookie 没有保留在 Laravel 中(使用 Homestead、Vagrant)?