首页 > 解决方案 > 运行 K_Means 聚类算法并得到奇怪的结果

问题描述

我正在运行下面的代码。

from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
import numpy as np
from scipy.cluster.vq import kmeans,vq
import pandas as pd
import pandas_datareader as dr
from math import sqrt
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt

pd.set_option('display.max_columns', 500)
df = pd.read_csv('C:\\in_path\\raw_data.csv')
print('done!')

df = df[:100000]
df = df.fillna(0)

dataset = df[['AcctNo', 'PriceBin', 'CouponBin', 'RatingScore', 'Term', 'LRMScore', 'Spread']].copy() # 'Rating' # 'HQLACategoryOne'

#format the data as a numpy array to feed into the K-Means algorithm
data = np.asarray([np.asarray(dataset['Spread']),np.asarray(dataset['Score'])]).T

centroids,_ = kmeans(data,500)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

# some plotting using numpy's logical indexing
plt.scatter(data[:, 0], data[:, 1], c=idx, s=50, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, alpha=0.5)

details = [(name,cluster) for name, cluster in zip(dataset.Cusip,idx)]
for detail in details:
    print(detail)

details_df = pd.DataFrame(details)
details_df.columns = ['AccountNo','ClusterID']
finalDF = pd.merge(dataset, details_df, left_index=False, right_index=False, how='inner')

finalDF.to_csv('C:\\out_path\\test.csv')

我正在查看相对于“分数”的“传播”。这两个特征只是略微正相关;约 14%。我不知道这是否是问题所在,但我认为聚类将基于“Spread”和“Score”完成,但我的 ClusterID 似乎与这两个变量都不相关。我做错什么了吗?我错过了什么?

最后,我得到了这些结果(只是一个小样本)。

AcctNo      PriceBin    CouponBin   RatingScore Term    Score   Spread  ClusterID
A85771075   0           5           0           30      9.75    0.13    3
A16898795   0           7           0           30      9.75    0.13    3
A87632163   0           5           0           30      9.75    0.06    7
A32073695   0           5           0           30      9.75    0.06    7
A05966021   -1          2           0           29      9.75    0.12    2
A38865245   0           4           0           30      9.75    0.07    10
A17800838   0           3           0           30      9.75    0.06    7
A19974047   0           6           0           15      9.75    0.16    3
A93145719   0           6           0           15      9.75    0.16    3
A32581133   0           6           0           15      9.75    0.16    3
A56322331   0           6           0           15      9.75    0.16    3
A67851213   0           6           0           15      9.75    0.16    3
A51232438   0           6           0           15      9.75    0.16    3

标签: pythonpython-3.xmachine-learningscikit-learn

解决方案


推荐阅读