python - 找到最接近质心的列 - Pandas
问题描述
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=["State", "Adult", "Senior","Children"])
df.loc[0] = ["California", 111, 2, 6 ]
df.loc[1] = ["Texas", 70, 2, 4 ]
df.loc[2] = ["Florida", 64, 4, 5 ]
df.loc[3] = ["Georgia", 25, 2, 3 ]
df.loc[4] = ["Alaska", 90, 1, 2 ]
df.loc[5] = ["Hawaii", 105, 2, 1 ]
df.loc[6] = ["Washington", 27, 3, 2 ]
df.loc[7] = ["Pennsylvania", 90, 2, 1 ]
df.loc[8] = ["Virginia", 63, 2, 3 ]
df.loc[9] = ["Arizona", 34, 2, 4 ]
df.loc[10] = ["Michigan", 22, 5, 2 ]
kmeans = KMeans(n_clusters=4)
y = kmeans.fit_predict(df[['Adult', 'Senior', 'Children']])
df['Cluster'] = y
centers = kmeans.cluster_centers_
plt.scatter(df.Adult, df.Senior, c=df.Cluster)
plt.scatter(centers[:,0],centers[:,1],color='black',marker='*',label='centroid')
plt.show()
对于按上述状态分解的 Kmeans 分析,我想从每个集群中提取/识别元素,这些元素最接近该集群的质心。
解决方案
基本上:KMeans
实现是基于欧几里得距离。为了获得离每个质心最近的两个点,我们可以查看属于每个聚类的点集,取相关质心之间差异的 2-范数,并返回两个最近的点:
def get_2_closest(cluster_id, df, columns, centers):
current = df[df["Cluster"] == cluster_id][columns]
closest = np.argsort(
np.linalg.norm(current.to_numpy(dtype=np.float64) - centers[cluster_id], axis=1)
)
return current.iloc[closest[:2]]
上下文中的完整示例:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=["State", "Adult", "Senior","Children"])
df.loc[0] = ["California", 111, 2, 6 ]
df.loc[1] = ["Texas", 70, 2, 4 ]
df.loc[2] = ["Florida", 64, 4, 5 ]
df.loc[3] = ["Georgia", 25, 2, 3 ]
df.loc[4] = ["Alaska", 90, 1, 2 ]
df.loc[5] = ["Hawaii", 105, 2, 1 ]
df.loc[6] = ["Washington", 27, 3, 2 ]
df.loc[7] = ["Pennsylvania", 90, 2, 1 ]
df.loc[8] = ["Virginia", 63, 2, 3 ]
df.loc[9] = ["Arizona", 34, 2, 4 ]
df.loc[10] = ["Michigan", 22, 5, 2 ]
kmeans = KMeans(n_clusters=4)
y = kmeans.fit_predict(df[["Adult", "Senior", "Children"]])
df["Cluster"] = y
centers = kmeans.cluster_centers_
def get_2_closest(cluster_id, df, columns, centers):
current = df[df["Cluster"] == cluster_id][columns]
closest = np.argsort(
np.linalg.norm(current.to_numpy(dtype=np.float64) - centers[cluster_id], axis=1)
)
return current.iloc[closest[:2]]
_closest = pd.DataFrame(columns=['Adult', "Senior", "Children"])
for i in range(len(centers)):
output = get_2_closest(i, df, ["Adult", "Senior", "Children"], kmeans.cluster_centers_)
_closest = _closest.append(output)
plt.scatter(df.Adult, df.Senior, label="Original")
plt.scatter(_closest.Adult, _closest.Senior, label="2 Closest to Centroid")
plt.scatter(centers[:, 0], centers[:, 1], color="black", marker="*", label="centroid")
plt.legend()
plt.show()
预期输出:
评论中提出的问题:您可以State
通过合并两个数据框来取回该列:
print(
_closest.merge(df, left_index=True, right_index=True)['State']
)
输出:
4 Alaska
7 Pennsylvania
6 Washington
3 Georgia
2 Florida
8 Virginia
0 California
5 Hawaii
推荐阅读
- react-native - 获取所有异步存储数据键并将它们设置为状态
- python - Google GCP Cloud Functions 到 BigQuery 错误
- php - php info 不同于 server 和 info.php
- omnet++ - OMNet++中peek命令的作用是什么
- c# - Specflow 3.3 中 NullValueRetriever 功能的替换
- powershell - 如何使用 Invoke-RestMethod 从 Powershell 发布
- google-chrome - 在 Chrome devtools 中多次加载字体
- python - Python 调用 boost .so 文件库未加载错误
- ssis - 将文件上传到 SFTP 后 WinSCP 挂起
- react-native - 错误:没有为关键的 Add Grocery Item 定义路由。必须是以下之一:“屏幕”堆栈操作