algorithm - Clusterization algorithm
问题描述
I have problem with clusterization of clients.
I have a dataset with columns such as name
, address
, email
, phone
, etc. (in a example A
,B
,C
). Each row has unique identifier (ID
). I need to assign CLUSTER_ID
(X
) to each row. In one cluster all rows have one or more the same attributes as other rows. So clients with ID=1,2,3
have the same A
attribute and clients with ID=3,10
have the same B
attribute then ID=1,2,3,10
should be in the same cluster.
How can I solve this problem using SQL? If it's not possible how to write the algorithm (pseudocode)? The performance is very important, because the dataset contains milions of rows.
Sample Input:
ID A B C
1 A1 B3 C1
2 A1 B2 C5
3 A1 B10 C10
4 A2 B1 C5
5 A2 B8 C1
6 A3 B1 C4
7 A4 B6 C3
8 A4 B3 C5
9 A5 B7 C2
10 A6 B10 C3
11 A8 B5 C4
Sample Output:
ID A B C X
1 A1 B3 C1 1
2 A1 B2 C5 1
3 A1 B10 C10 1
4 A2 B1 C5 1
5 A2 B8 C1 1
6 A3 B1 C4 1
7 A4 B6 C3 1
8 A4 B3 C5 1
9 A5 B7 C2 2
10 A6 B10 C3 1
11 A8 B5 C4 1
Thanks for any help.
解决方案
一种可能的方法是对空 X 重复更新。
从 cluster_id 1. Fe 开始,使用变量。
SET @CurrentClusterID = 1
取前 1 条记录,并将其 X 更新为 1。
现在循环更新具有空 X 的所有记录,并且可以链接到 X = 1 并且具有相同 A 或 B 或 C 的记录
免责声明:
该声明将因 RDBMS 而异。
这只是作为伪代码。
WHILE (<<some check to see if there were records updated>>)
BEGIN
UPDATE yourtable t
SET t.X = @CurrentClusterID
WHERE t.X IS NULL
AND EXISTS (
SELECT 1 FROM yourtable d
WHERE d.X = @CurrentClusterID
AND (d.A = t.A OR d.B = t.B OR d.C = t.C)
);
END
循环直到它更新 0 条记录。
现在对其他集群重复该方法,直到表中不再有空 X。
1) 将 @CurrentClusterID 增加 1
2) 将下一个带有空 X 的前 1 条记录更新为新的 @CurrentClusterID
3) 循环更新,直到不再进行更新。
MS Sql Server的db<>fiddle示例测试。
推荐阅读
- java - 部署 Springboot Web 服务 Heroku。配置 Dyno 形成时出现错误 [无法访问 jarfile server.port]
- angular - Angular 7+ - 注入和子类
- scala - 如何在 Scala Spark 中设计一个抽象阅读器?
- css - 覆盖引导类“导航项”
- c++ - 无法使用 Adafruit Motor Shield V2.3 驱动直流电机
- visual-studio-code - Visual Studio 代码 & gulp
- vue.js - vue插件(学习)发布后或本地测试导致无法挂载组件:模板或渲染函数未定义
- c++ - C++11:无法推断模板参数
- php - 新发布的帖子的 Wordpress 钩子,可以访问帖子元数据
- python - 在 Openpyxl 中创建饼图,但它只抓取了一半的数据