首页 > 解决方案 > 使用 k 均值聚类创建肘部图

问题描述

我正在使用 K-means 对许多文本进行聚类。现在我试图通过创建肘图来确定最佳集群数量。但是,我还没有成功。

我的代码看起来像这样,其中语料库

corpus = df['content'].tolist()
language = 'dutch'
corpus = processCorpus(corpus, language)

后跟多个函数以删除停用词、少于 2 个字母的词、词干等。因此,接下来是:


vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
x_norm = normalize(X)
tf_idf = pd.DataFrame(data = X.toarray(), columns=vectorizer.get_feature_names())

final_df = tf_idf
final_df.to_excel("test_defensie.xlsx")

print("{} rows".format(final_df.shape[0]))
final_df.T.nlargest(5, 0)

def run_KMeans(max_k, data):
    max_k += 1
    kmeans_results = dict()
    for k in range(2 , max_k):
        kmeans = cluster.KMeans(n_clusters = k
                               , init = 'k-means++'
                               , n_init = 10
                               , tol = 0.0001
                               , n_jobs = -1
                               , random_state = 1
                               , algorithm = 'elkan')

        kmeans_results.update( {k : kmeans.fit(data)} )
        
    return kmeans_results
def printAvg(avg_dict):
    for avg in sorted(avg_dict.keys(), reverse=True):
        print("Avg: {}\tK:{}".format(avg.round(4), avg_dict[avg]))
# Running Kmeans
k = 12
kmeans_results = run_KMeans(k, final_df)
best_result = 8
kmeans = kmeans_results.get(best_result)

final_df_array = final_df.to_numpy()
prediction = kmeans.predict(final_df)
n_feats = 20
dfs = get_top_features_cluster(final_df_array, prediction, n_feats)

我尝试添加以下代码,但出现以下错误:PCA 不支持稀疏输入。有关可能的替代方案,请参见截断SVD

number_clusters = range(1, 10)
sklearn_pca = PCA(n_components = 2)
Y_sklearn = sklearn_pca.fit_transform(x_norm_array)
kmeans = [KMeans(n_clusters=i, max_iter = 600) for i in number_clusters]
kmeans
score = [kmeans[i].fit(Y_sklearn).score(Y_sklearn) for i in range(len(kmeans))]
score
plt.plot(number_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Method')
plt.show()

标签: pythonnlpcluster-analysisk-means

解决方案


推荐阅读