首页 > 解决方案 > 在 Python 和 Sklearn 中为集群缩放特征时出错

问题描述

我想用 Python 和 Scikit-Learn 库为我的数据集创建聚类模型。数据集包含连续值和分类值。我已经编码了分类值,但是当我想缩放功能时,我收到了这个错误:

"Cannot center sparse matrices: pass `with_mean=False` "
ValueError: Cannot center sparse matrices: pass `with_mean=False` instead. See docstring for motivation and alternatives.

我在这一行中遇到了这个错误:

features = scaler.fit_transform(features)

我究竟做错了什么?

这是我的代码:

features = df[['InvoiceNo', 'StockCode', 'Description', 'Quantity',
               'UnitPrice', 'CustomerID', 'Country', 'Total Price']]

columns_for_scaling = ['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'UnitPrice', 'CustomerID', 'Country', 'Total Price']

transformerVectoriser = ColumnTransformer(transformers=[('Encoding Invoice number', OneHotEncoder(handle_unknown = "ignore"), ['InvoiceNo']),
                                                        ('Encoding StockCode', OneHotEncoder(handle_unknown = "ignore"), ['StockCode']),
                                                        ('Encoding Description', OneHotEncoder(handle_unknown = "ignore"), ['Description']),
                                                        ('Encoding Country', OneHotEncoder(handle_unknown = "ignore"), ['Country'])],
                                          remainder='passthrough') # Default is to drop untransformed columns

features = transformerVectoriser.fit_transform(features)
print(features.shape)

scaler = StandardScaler()
features = scaler.fit_transform(features)

sum_of_squared_distances = []
for k in range(1,16):
    kmeans = KMeans(n_clusters=k)
    kmeans = kmeans.fit(features)
    sum_of_squared_distances.append(features.inertia_)

预处理前的数据(401604, 8) 形状: 预处理后的数据形状:(401604, 29800)

标签: pythonmachine-learningscikit-learnk-means

解决方案


如果您sparse=False在实例化时设置,OneHotEncoder则将StandardScaler()按预期工作。

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans

# define the feature matrix
features = pd.DataFrame({
    'InvoiceNo': np.random.randint(1, 100, 100),
    'StockCode': np.random.randint(100, 200, 100),
    'Description': np.random.choice(['a', 'b', 'c', 'd'], 100),
    'Quantity': np.random.randint(1, 1000, 100),
    'UnitPrice': np.random.randint(5, 10, 100),
    'CustomerID': np.random.choice(['1', '2', '3', '4'], 100),
    'Country': np.random.choice(['A', 'B', 'C', 'D'], 100),
    'Total Price': np.random.randint(100, 1000, 100),
})

# encode the features (set "sparse=False") 
transformerVectoriser = ColumnTransformer(
    transformers=[
        ('Encoding Invoice number', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['InvoiceNo']),
        ('Encoding StockCode', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['StockCode']),
        ('Encoding Description', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['Description']),
        ('Encoding Country', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['Country'])
    ],
    remainder='passthrough'
)

features = transformerVectoriser.fit_transform(features)

# scale the features
scaler = StandardScaler()
features = scaler.fit_transform(features)

# run the cluster analysis
sum_of_squared_distances = []
for k in range(1, 16):
    kmeans = KMeans(n_clusters=k)
    kmeans = kmeans.fit(features)
    sum_of_squared_distances.append(kmeans.inertia_)

或者,您可以使用features = features.toarray()将稀疏矩阵转换为数组。

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans

# define the feature matrix
features = pd.DataFrame({
    'InvoiceNo': np.random.randint(1, 100, 100),
    'StockCode': np.random.randint(100, 200, 100),
    'Description': np.random.choice(['a', 'b', 'c', 'd'], 100),
    'Quantity': np.random.randint(1, 1000, 100),
    'UnitPrice': np.random.randint(5, 10, 100),
    'CustomerID': np.random.choice(['1', '2', '3', '4'], 100),
    'Country': np.random.choice(['A', 'B', 'C', 'D'], 100),
    'Total Price': np.random.randint(100, 1000, 100),
})

# encode the features
transformerVectoriser = ColumnTransformer(
    transformers=[
        ('Encoding Invoice number', OneHotEncoder(handle_unknown='ignore'), ['InvoiceNo']),
        ('Encoding StockCode', OneHotEncoder(handle_unknown='ignore'), ['StockCode']),
        ('Encoding Description', OneHotEncoder(handle_unknown='ignore'), ['Description']),
        ('Encoding Country', OneHotEncoder(handle_unknown='ignore'), ['Country'])
    ],
    remainder='passthrough'
)

features = transformerVectoriser.fit_transform(features)
features = features.toarray() # convert sparse matrix to array

# scale the features
scaler = StandardScaler()
features = scaler.fit_transform(features)

# run the cluster analysis
sum_of_squared_distances = []
for k in range(1, 16):
    kmeans = KMeans(n_clusters=k)
    kmeans = kmeans.fit(features)
    sum_of_squared_distances.append(kmeans.inertia_)

推荐阅读