python - 在 Python 和 Sklearn 中为集群缩放特征时出错
问题描述
我想用 Python 和 Scikit-Learn 库为我的数据集创建聚类模型。数据集包含连续值和分类值。我已经编码了分类值,但是当我想缩放功能时,我收到了这个错误:
"Cannot center sparse matrices: pass `with_mean=False` "
ValueError: Cannot center sparse matrices: pass `with_mean=False` instead. See docstring for motivation and alternatives.
我在这一行中遇到了这个错误:
features = scaler.fit_transform(features)
我究竟做错了什么?
这是我的代码:
features = df[['InvoiceNo', 'StockCode', 'Description', 'Quantity',
'UnitPrice', 'CustomerID', 'Country', 'Total Price']]
columns_for_scaling = ['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'UnitPrice', 'CustomerID', 'Country', 'Total Price']
transformerVectoriser = ColumnTransformer(transformers=[('Encoding Invoice number', OneHotEncoder(handle_unknown = "ignore"), ['InvoiceNo']),
('Encoding StockCode', OneHotEncoder(handle_unknown = "ignore"), ['StockCode']),
('Encoding Description', OneHotEncoder(handle_unknown = "ignore"), ['Description']),
('Encoding Country', OneHotEncoder(handle_unknown = "ignore"), ['Country'])],
remainder='passthrough') # Default is to drop untransformed columns
features = transformerVectoriser.fit_transform(features)
print(features.shape)
scaler = StandardScaler()
features = scaler.fit_transform(features)
sum_of_squared_distances = []
for k in range(1,16):
kmeans = KMeans(n_clusters=k)
kmeans = kmeans.fit(features)
sum_of_squared_distances.append(features.inertia_)
预处理前的数据(401604, 8)
形状: 预处理后的数据形状:(401604, 29800)
解决方案
如果您sparse=False
在实例化时设置,OneHotEncoder
则将StandardScaler()
按预期工作。
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans
# define the feature matrix
features = pd.DataFrame({
'InvoiceNo': np.random.randint(1, 100, 100),
'StockCode': np.random.randint(100, 200, 100),
'Description': np.random.choice(['a', 'b', 'c', 'd'], 100),
'Quantity': np.random.randint(1, 1000, 100),
'UnitPrice': np.random.randint(5, 10, 100),
'CustomerID': np.random.choice(['1', '2', '3', '4'], 100),
'Country': np.random.choice(['A', 'B', 'C', 'D'], 100),
'Total Price': np.random.randint(100, 1000, 100),
})
# encode the features (set "sparse=False")
transformerVectoriser = ColumnTransformer(
transformers=[
('Encoding Invoice number', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['InvoiceNo']),
('Encoding StockCode', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['StockCode']),
('Encoding Description', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['Description']),
('Encoding Country', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['Country'])
],
remainder='passthrough'
)
features = transformerVectoriser.fit_transform(features)
# scale the features
scaler = StandardScaler()
features = scaler.fit_transform(features)
# run the cluster analysis
sum_of_squared_distances = []
for k in range(1, 16):
kmeans = KMeans(n_clusters=k)
kmeans = kmeans.fit(features)
sum_of_squared_distances.append(kmeans.inertia_)
或者,您可以使用features = features.toarray()
将稀疏矩阵转换为数组。
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans
# define the feature matrix
features = pd.DataFrame({
'InvoiceNo': np.random.randint(1, 100, 100),
'StockCode': np.random.randint(100, 200, 100),
'Description': np.random.choice(['a', 'b', 'c', 'd'], 100),
'Quantity': np.random.randint(1, 1000, 100),
'UnitPrice': np.random.randint(5, 10, 100),
'CustomerID': np.random.choice(['1', '2', '3', '4'], 100),
'Country': np.random.choice(['A', 'B', 'C', 'D'], 100),
'Total Price': np.random.randint(100, 1000, 100),
})
# encode the features
transformerVectoriser = ColumnTransformer(
transformers=[
('Encoding Invoice number', OneHotEncoder(handle_unknown='ignore'), ['InvoiceNo']),
('Encoding StockCode', OneHotEncoder(handle_unknown='ignore'), ['StockCode']),
('Encoding Description', OneHotEncoder(handle_unknown='ignore'), ['Description']),
('Encoding Country', OneHotEncoder(handle_unknown='ignore'), ['Country'])
],
remainder='passthrough'
)
features = transformerVectoriser.fit_transform(features)
features = features.toarray() # convert sparse matrix to array
# scale the features
scaler = StandardScaler()
features = scaler.fit_transform(features)
# run the cluster analysis
sum_of_squared_distances = []
for k in range(1, 16):
kmeans = KMeans(n_clusters=k)
kmeans = kmeans.fit(features)
sum_of_squared_distances.append(kmeans.inertia_)
推荐阅读
- r - 无法发布 Shiny R 数据表
- android - 无法从适配器类中检索数据
- numpy - 矩阵减法的向量化和优化
- javascript - 多个文件中的javascript构造函数出现问题
- git - 有没有其他方法可以在 Jenkins 管道中将构建的项目推送到 GitHub 而不是使用 SSH 代理?
- firebase - 在 Flutter 中将带有对象的类上传到 Firestore
- python - Google Ads API 的 Google OAuth 问题 - python
- java - Gurobi 800,Java,检索变量数返回零
- html - 在html中制作背景图片整页
- android - 将我的 android 项目共享到 Github 后,应用程序文件被禁用且未上传