首页 > 解决方案 > 使用 sklearn 缩放器覆盖 dask 数据帧

问题描述

我有以下 dask 数据框:

在此处输入图像描述

为此,我想将 sklearn 缩放器 EG 应用于 LotArea 列:

scaler = StandardScaler()
scaler.fit_transform(df[['LotArea']]) 

这将返回一个 numpy 数组:

array([[ 0.82160041],
       [ 1.59216945],
       [ 1.46485804],
       [-0.11648362],
       [-1.10613315],
       [ 0.34906243],
       [-0.23942507],
       [-0.11648362],
       [ 0.40033659],
       [-0.11706628],
       [-0.85762828],
       [-2.07480689]])

但我无法将数据框更新为:

df[column] = (scaler.fit_transform(df[[column]]))

它返回以下错误:

TypeError: Column assignment doesn't support type numpy.ndarray

我尝试将其转换为 dask 数组,但得到了相同的结果:

df['LotArea'] = da.from_array(scaler.fit_transform(df[[column]]))

TypeError: Column assignment doesn't support type dask.array.core.Array

如何使用缩放器更新数据框?

标签: pythonarraysscikit-learndask

解决方案


这归结为“如何将列添加到 Dask DataFrame”。

In [22]: df = pd.DataFrame({"A": [1, 2, 3, 4]})

In [23]: ddf = dd.from_pandas(df, 2)

In [24]: b = da.from_array(np.array([1, 2, 3, 4]), chunks=2)

In [25]: ddf['B'] = dd.from_dask_array(b, index=ddf.index)

In [26]: ddf.head()
/Users/taugspurger/sandbox/dask/dask/dataframe/core.py:5724: UserWarning: Insufficient elements for `head`. 5 elements requested, only
2 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(msg.format(n, len(r)))
Out[26]:
   A  B
0  1  1
1  2  2

这可能在 Dask 中变得更容易。请参阅https://github.com/dask/dask/issues/5118


推荐阅读