首页 > 解决方案 > 循环还是不循环?

问题描述

我想kmean()在数据集中运行。但我想将第二个参数设置kmean()1:10,即我想将 K 从 1 设置为 10。

从以下代码生成的数据集:

 data.frame( grps  = 1:5,
                           gsize = c(1000, 500, 750, 900, 800),
                           m1    = c(  -2,  -1,   0,   1,   2),
                           m2    = c(   0,   3,   1,   2,   4),
                           m3    = c(   1,   4,   2,   5,  -1),
                           m4    = c(   2,  -3,   4,  -1,   1) )

# training set generation
kd          <- centers        %>%
  group_by(grps) %>%
  do(data.frame( v1= rnorm(.$gsize[1], .$m1[1]),
                 v2= rnorm(.$gsize[1], .$m2[1]),
                 v3= rnorm(.$gsize[1], .$m3[1]),
                 v4= rnorm(.$gsize[1], .$m4[1])) ) 
minClusters <- 1
maxClusters <- 10

kclust  <- kd                                   %>%
  crossing(k= minClusters:maxClusters) %>%
  group_by(k)                          %>%
  do(clust= kmeans(select(., v1, v2, v3, v4), 
                   .$k[1], 
                   nstart=5) )

所以我很困惑对象是否kclust是从循环中获得的?我认为不是,因为kmeans()函数中的第二个参数是一个固定数字“1”。也许我误解了什么?谢谢!

标签: rtidyversek-means

解决方案


您的代码工作正常,输出符合预期。问题是,由于您使用随机数来生成数据并且均值就在附近,因此集群将合并为 1 个大集群,因此大多数点都被分配给单个集群。例如,您可以检查 kclust$clust 变量中的第 4 个条目,您会看到有 4 个聚类中心,但在其下方,您会看到大多数点都分配给了聚类 1。

library(tidyverse)
centers = data.frame( grps  = 1:5,
        gsize = c(1000, 500, 750, 900, 800),
        m1    = c(  -2,  -1,   0,   1,   2),
        m2    = c(   0,   3,   1,   2,   4),
        m3    = c(   1,   4,   2,   5,  -1),
        m4    = c(   2,  -3,   4,  -1,   1) )

# training set generation
kd = centers %>%
  group_by(grps) %>%
  do(data.frame( v1= rnorm(.$gsize[1], .$m1[1]),
                 v2= rnorm(.$gsize[1], .$m2[1]),
                 v3= rnorm(.$gsize[1], .$m3[1]),
                 v4= rnorm(.$gsize[1], .$m4[1])) ) 

minClusters = 1
maxClusters = 10

kclust  <- kd %>%
  crossing(k = minClusters:maxClusters) %>%
  group_by(k) %>%
  do(clust = kmeans(select(., v1, v2, v3, v4), .$k[1], nstart=5))enter code here

> kclust$clust[4]
[[1]]
K-means clustering with 4 clusters of sizes 982, 771, 1399, 798

Cluster means:
          v1         v2         v3         v4
1 -2.0413678 0.01394798  0.9787409  1.9646186
2 -0.0179719 1.05571578  2.0228387  4.0233226
3  0.2979344 2.34438159  4.6230947 -1.6822656
4  1.9941418 4.03159297 -1.0617125  0.9776119

Clustering vector:
   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  [69] 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [137] 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
 [205] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [273] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [341] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [409] 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
 [477] 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
 [545] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [613] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1
 [681] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [749] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
 [817] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
 [885] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[953] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 3 1 1 1
 [ reached getOption("max.print") -- omitted 2950 entries ]

尝试将均值增加到更大的范围(* 请参见下面的更新),结果应该会更好。另一个问题是数据是按组排序的,因此当您只查看第一个样本时,它也会在视图中引起问题(它们应该被分配到同一个集群)。尝试改组数据点,结果看起来更合理。

centers = data.frame( grps  = 1:5,
                      gsize = c(1000, 500, 750, 900, 800),
                      m1    = c(-10, -5, 0, 5, 10),
                      m2    = c(-10, -5, 0, 5, 10),
                      m3    = c(-10, -5, 0, 5, 10),
                      m4    = c(-10, -5, 0, 5, 10) )

# training set generation
kd = centers %>%
  group_by(grps) %>%
  do(data.frame( v1= rnorm(.$gsize[1], .$m1[1]),
                 v2= rnorm(.$gsize[1], .$m2[1]),
                 v3= rnorm(.$gsize[1], .$m3[1]),
                 v4= rnorm(.$gsize[1], .$m4[1])) ) %>%
  ungroup()

minClusters = 1
maxClusters = 10

kclust  <- kd %>%
  sample_frac(size=1) %>%
  crossing(k = minClusters:maxClusters) %>%
  group_by(k) %>%
  do(clust = kmeans(select(., v1, v2, v3, v4), .$k[1], nstart=5))


> kclust$clust[5]
[[1]]
K-means clustering with 5 clusters of sizes 900, 1000, 800, 750, 500

Cluster means:
             v1           v2          v3          v4
1   4.947859954  4.990537346  4.96409669  5.02513562
2 -10.014275191 -9.990181395 -9.96969088 -9.96127717
3  10.054780835  9.942738199 10.03617191 10.01820661
4   0.005084275 -0.003034476 -0.03353889 -0.01343343
5  -5.056184108 -5.004413465 -5.00059546 -5.06765925

Clustering vector:
   [1] 2 4 4 2 4 2 4 1 4 4 5 4 1 3 1 3 1 1 2 2 2 2 4 2 4 5 2 4 2 2 4 3 2 1 3 1 2 2 3 4 1 4 1 3 3 5 2 3 3 1 1 4 5 2 4 2 4 2 2 2 2 5 3 2 5 2 1 5
  [69] 2 3 1 1 1 2 2 1 4 2 1 1 2 2 4 4 2 2 5 5 4 2 3 4 5 2 5 3 5 5 4 3 3 3 1 3 5 3 4 1 3 4 1 2 2 3 4 1 1 3 3 1 2 3 5 2 3 1 2 3 5 3 2 2 2 2 1 4
 [137] 4 3 1 4 5 2 3 1 4 1 3 3 1 2 3 3 1 3 1 4 4 2 1 1 5 3 3 2 4 3 5 3 1 3 1 4 4 2 1 3 1 2 3 5 2 1 3 1 3 2 1 2 1 1 2 1 1 3 2 2 5 2 3 3 1 5 3 2
 [205] 5 3 2 2 3 5 2 4 5 4 1 1 2 2 3 1 4 2 4 1 5 3 3 3 5 4 1 2 3 3 5 1 1 5 1 3 3 2 2 5 3 2 2 1 2 4 2 5 4 5 5 2 4 5 3 4 5 3 1 1 2 1 1 2 4 2 4 1
 [273] 2 3 4 5 2 3 1 5 3 3 3 3 5 4 3 4 2 3 2 4 1 4 4 1 1 2 2 3 2 2 3 5 4 5 2 1 4 2 1 3 2 3 4 4 2 2 2 4 3 3 1 4 1 3 5 1 3 3 1 3 3 2 4 1 2 4 2 3
 [341] 2 3 3 2 3 1 5 3 2 5 2 3 4 4 1 2 5 3 1 3 1 1 1 1 3 3 4 1 2 3 1 2 4 1 2 3 5 4 1 4 3 4 3 4 1 4 5 5 5 4 2 4 3 1 2 3 2 3 1 3 2 4 5 4 2 1 1 1
 [409] 4 1 5 2 4 2 1 2 3 4 3 4 5 2 3 5 2 3 1 4 1 1 4 3 3 1 1 1 3 2 2 5 1 3 2 2 2 5 4 4 5 4 1 3 2 2 3 2 2 3 5 1 4 4 4 4 4 2 2 3 1 1 1 2 3 2 3 4
 [477] 2 3 4 2 5 2 4 2 3 1 5 4 3 4 3 3 2 4 3 3 5 2 5 4 4 1 1 2 3 1 5 4 3 1 2 5 2 2 4 2 3 3 2 4 1 3 5 2 1 3 5 5 2 1 3 1 1 5 5 2 1 5 3 2 2 3 3 2
 [545] 2 1 4 4 4 1 2 1 5 5 2 1 4 3 3 2 5 5 2 4 3 2 4 1 3 3 1 3 4 3 2 4 2 5 4 4 3 1 4 5 4 2 1 4 4 2 3 1 2 2 2 4 1 2 2 1 5 5 2 2 4 4 1 5 4 4 4 4
 [613] 2 3 3 1 1 3 3 1 4 4 5 1 2 1 1 4 3 2 5 4 2 5 3 3 3 1 4 1 2 3 5 2 2 4 4 5 2 3 3 1 3 4 3 5 2 2 2 2 4 3 3 2 3 2 4 3 2 2 1 3 3 3 4 3 2 3 3 1
 [681] 3 2 2 5 4 2 4 4 5 2 1 3 1 2 4 1 3 3 4 1 4 4 3 2 2 4 4 3 5 4 1 1 5 2 2 3 5 4 1 1 4 2 5 3 3 1 2 1 2 4 4 2 1 3 2 2 2 3 1 4 1 1 1 4 3 2 3 5
 [749] 1 4 4 3 4 4 4 2 4 2 3 3 1 1 1 4 2 3 1 4 1 4 3 2 3 2 2 4 1 5 1 4 2 4 2 2 1 4 3 4 5 2 3 4 4 2 2 1 5 1 2 1 2 1 1 5 1 5 2 4 1 2 1 2 2 3 1 4
 [817] 5 1 4 2 4 4 4 5 3 2 1 4 1 3 4 2 1 5 2 1 2 5 1 1 1 2 1 4 1 4 5 1 2 5 3 5 4 1 4 1 4 1 3 2 4 3 1 3 5 4 3 1 5 4 3 2 4 3 3 4 4 3 5 4 2 4 2 1
 [885] 1 1 4 2 4 5 1 2 5 4 2 2 3 3 3 3 2 4 1 5 4 2 2 2 4 1 4 3 4 1 2 4 2 1 4 3 5 1 1 5 5 4 1 2 2 2 2 2 3 4 5 1 2 3 2 1 1 1 1 3 2 4 1 1 4 2 5 2
 [953] 3 4 5 2 5 4 1 3 5 2 4 3 4 4 2 4 2 4 1 1 1 1 2 2 4 1 1 3 2 4 3 1 5 1 2 5 4 2 3 2 3 1 2 1 2 1 5 4
 [ reached getOption("max.print") -- omitted 2950 entries ]

更新:

而且,我验证了真正的问题是第二个问题(即您的数据是有序的)。你不需要改变手段。您可以使用相同的方法(与您的原始代码一样),只要您对数据进行混洗,您应该会看到分配给多个集群的点。


推荐阅读