首页 > 解决方案 > 如何检索类内的聚类计算?

问题描述

我正在试验基于 KM 的算法,即所谓的 ODKM,它使用KMeans聚类算法。

我想优雅地检索聚类信息 cluster_centers_, labels_ cluster_score, effect, 。distclass ODKM

import math
from math import pow
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans


class ODKM:
    
    def __init__(self,n_clusters=15,effectiveness=500,max_iter=2):
        self.n_clusters=n_clusters
        self.effectiveness=effectiveness
        self.max_iter=max_iter
        self.kmeans = {}
        self.cluster_score = {}
        
    def fit(self, data):
        length = len(data)
        for column in data.columns:
            kmeans = KMeans(n_clusters=self.n_clusters,max_iter=self.max_iter)
            self.kmeans[column]=kmeans
            kmeans.fit(data[column].values.reshape(-1,1))
            assign = pd.DataFrame(kmeans.predict(data[column].values.reshape(-1,1)),columns=['cluster'])
            cluster_score=assign.groupby('cluster').apply(len).apply(lambda x:x/length)
            ratio=cluster_score.copy()
        
            sorted_centers = sorted(kmeans.cluster_centers_)
            max_distance = ( sorted_centers[-1] - sorted_centers[0] )[ 0 ]
        
            for i in range(self.n_clusters):
                for k in range(self.n_clusters):
                    if i != k:
                        dist = np.abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
                        effect = ratio[k]*(1/pow(self.effectiveness,dist))
                        cluster_score[i] = cluster_score[i]+effect
                        
            self.cluster_score[column] = cluster_score
                    
    def predict(self, data):
        length = len(data)
        score_array = np.zeros(length)
        for column in data.columns:
            kmeans = self.kmeans[ column ]
            cluster_score = self.cluster_score[ column ]
            
            assign = kmeans.predict( data[ column ].values.reshape(-1,1) )
            #print(assign)
            
            for i in range(length):
                score_array[i] = score_array[i] + math.log10( cluster_score[assign[i]] )
            
        return score_array
    
    def fit_predict(self,data):
        self.fit(data)
        return self.predict(data)

测试结果:

import pandas as pd

df = pd.DataFrame(data={'attr1':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,7,7,7,7,15],
                         'attr2':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,13,13,13,14,15]})


odkm_model = ODKM(n_clusters=3, max_iter=1)
result = odkm_model.fit_predict(df)

df['ODKM_Score']= result 
df

#for i in result:
#    print(round(i,2))

#results
#-0.51, -0.51 , -0.51 , -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51
#-0.78, -0.78, -0.78, -0.78, -0.78, -0.78, -0.78
#-0.99, -0.99, -0.99, -0.99
#-1.99

所以问题是:有没有什么优雅的方法可以包含聚类信息并在我运行时返回它,class ODKM并得到反映结果,@class ODKM比如我们如何df['ODKM_Score']= resultdf['Cluster_labels']= result, df['Cluster_centers']= resultdf['cluster_score']= result在主数据框中拥有所有信息df并为聚类结果的可视化铺平道路。

通常没有类方法脚本我会使用km.cluster_centers_and来做到这一点km.cluster_centers_

n_clusters=3

km = KMeans(init='k-means++', n_clusters=n_clusters).fit(df[['Score']])

counts = np.bincount(km.labels_)

for center, count, label in zip(km.cluster_centers_, counts, range(n_clusters)):
    print(center, count)
    plt.bar(center, count, width=0.2, label=label)

但我想知道我是否可以在拟合和变换模型之后收集这个聚类信息,也许在类的末尾定义一个函数,名为KM_summary

图像

任何帮助将不胜感激。

标签: pythonpandasclassscikit-learnk-means

解决方案


推荐阅读