首页 > 解决方案 > 创建最近的中心项目列表

问题描述

我目前正在做一个在大型数据集上使用 k-means 的项目。我想稍微扩展一下我的大脑并在不使用任何外部库的情况下仅通过创建我自己的函数来做到这一点。我已经走了很远,但遇到了一个问题,即不打算根据集群中心所在的位置创建列表。

为方便起见,我在下面创建了一个小的子集数据以供使用,而不是使用我拥有的整个数据集

dataset1 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323), 
            (4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)]

cluster_1 = (0, 1)
cluster_2 = (1, 2)
clusters = [cluster_1, cluster_2] # although clusters not near data, it is to practise my model

下面我有 3 个函数与为集群创建中心点的过程有关

  1. 计算数据和聚类中心之间的距离,其中将 中的每个点与中的dataset每个点进行比较cluster_list
def calculate_distance(point1, point2):
    distance = 0
    for i in range(len(point1)):
        # Euclidian distance formula
        distance += (point1[i] - point2[i])**2
    # result then square rooted for distance
    return distance**0.5
    # end of function
  1. 确定某个点最接近哪个聚类中心
def find_nearest_centre(dataset1, clusters):
    nearest_point = []
    min_distance = 100000
    # obtaining sample from cluster list
    for c in clusters:
        # using distance formula above to calculate distance between points
        distance = calculate_distance(c, dataset)
        if distance < min_distance:
            min_distance = distance
        nearest_point.append(min_distance)
        
    return nearest_point
  1. 创建两个列表,每个集群一个,包含属于该集群的数据的坐标。
def create_list(dataset1, clusters):
    # new lists created for 2 clusters
    list_1 = []
    list_2 = []
    for d in dataset1:
        # using nearest_centre formula to determine which points are closest to centres
        nearest_centre = find_nearest_centre(d, clusters)
        # adding closest coordinates to list_1 for cluster 1 and list_2 for cluster 2
        if nearest_centre == clusters[0]:
            list_1.append(d)
        elif nearest_centre == clusters[1]:
            list_2.append(d)
        
    return list_1, list_2

现在到我的问题。当我运行该函数 create_list时,它只创建两个空列表,而不是按预期附加每个坐标。虽然不现实,但如果前 3 个值在第一个集群中,而最后 3 个值最接近第二个集群,则所需的输出将是:

create_list(dataset1, clusters) # this is only function needed to operate ideally

list_1 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323)] # list of tuples output
list_2 = [(4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)] # list of tuples output

我会很感激我能得到的任何帮助,显然坚持不使用外部包的主题。谢谢!

标签: pythonk-means

解决方案


您将获得空列表,因为您正在将集群与点进行比较,因此没有可能的匹配项。

返回最近的集群而不是来自的点

def find_nearest_centre(dataset, clusters):
    min_distance = float("inf")
    # obtaining sample from cluster list
    for c in clusters:
        # using distance formula above to calculate distance between points
        distance = calculate_distance(c, dataset)
        if distance < min_distance:
            min_distance = distance
            nearest_cluster = c

    return nearest_cluster

然后将集群与集群进行比较

def create_list(dataset1, clusters):
    # new lists created for 2 clusters
    list_1 = []
    list_2 = []
    for d in dataset1:
        # using nearest_centre formula to determine which points are closest to centres
        nearest_cluster = find_nearest_centre(d, clusters)
        # adding closest coordinates to list_1 for cluster 1 and list_2 for cluster 2
        if nearest_cluster == clusters[0]:
            list_1.append(d)
        elif nearest_cluster == clusters[1]:
            list_2.append(d)
        else:
            print("No match")

    return list_1, list_2

输出与您预期的不一样,但仅从外观来看,我认为cluster_1在这种情况下应该总是更接近。

list_1 = []
list_2 = [(6.08804, 3.457729), (4.147974, 5.275341), (6.538759, 3.670323), (4.579573, 4.03559), (4.756026, 4.184762), (5.221742, 2.872705)]

推荐阅读