首页 > 解决方案 > Clusterization algorithm

问题描述

I have problem with clusterization of clients.

I have a dataset with columns such as name, address, email, phone, etc. (in a example A,B,C). Each row has unique identifier (ID). I need to assign CLUSTER_ID (X) to each row. In one cluster all rows have one or more the same attributes as other rows. So clients with ID=1,2,3 have the same A attribute and clients with ID=3,10 have the same B attribute then ID=1,2,3,10 should be in the same cluster.

How can I solve this problem using SQL? If it's not possible how to write the algorithm (pseudocode)? The performance is very important, because the dataset contains milions of rows.

Sample Input:

ID  A   B   C
1   A1  B3  C1
2   A1  B2  C5
3   A1  B10 C10
4   A2  B1  C5
5   A2  B8  C1
6   A3  B1  C4
7   A4  B6  C3
8   A4  B3  C5
9   A5  B7  C2
10  A6  B10 C3
11  A8  B5  C4

Sample Output:

ID  A   B   C   X
1   A1  B3  C1  1
2   A1  B2  C5  1
3   A1  B10 C10 1
4   A2  B1  C5  1
5   A2  B8  C1  1
6   A3  B1  C4  1
7   A4  B6  C3  1
8   A4  B3  C5  1
9   A5  B7  C2  2
10  A6  B10 C3  1
11  A8  B5  C4  1

Thanks for any help.

标签: algorithmsascluster-analysis

解决方案


一种可能的方法是对空 X 重复更新。

从 cluster_id 1. Fe 开始,使用变量。

SET @CurrentClusterID = 1

取前 1 条记录,并将其 X 更新为 1。

现在循环更新具有空 X 的所有记录,并且可以链接到 X = 1 并且具有相同 A 或 B 或 C 的记录

免责声明:
该声明将因 RDBMS 而异。
这只是作为伪代码。

WHILE (<<some check to see if there were records updated>>) 
BEGIN
  UPDATE yourtable t
  SET t.X = @CurrentClusterID
  WHERE t.X IS NULL
    AND EXISTS (
      SELECT 1 FROM yourtable d 
      WHERE d.X =  @CurrentClusterID
        AND (d.A = t.A OR d.B = t.B OR d.C = t.C)
  );
END

循环直到它更新 0 条记录。

现在对其他集群重复该方法,直到表中不再有空 X。

1) 将 @CurrentClusterID 增加 1
2) 将下一个带有空 X 的前 1 条记录更新为新的 @CurrentClusterID
3) 循环更新,直到不再进行更新。

MS Sql Serverdb<>fiddle示例测试。


推荐阅读