sql - SQL Solution to Grab Largest Group with Overlap
问题描述
I have an input of a names and associated groups. I desire an output of a de-duplicated version of the input, except for when groups overlap, I want the summation of the overlapping groups.
Name Group
Jeff 1
John 1
Frank 1
Jeff 2
John 2
Frank 2
Fred 2
Steve 3
Sam 3
Jim 3
So for this example there are three groups, group 1, 2, and 3. There is overlap between group 1 and 2, (Jeff, John, and Frank all belong to both 1 and 2, and Fred is only in group 2). Given that there's overlap in group 1 and 2, I want to combine them into one single group, with all of the names in either group. I also want to keep groups without overlap separate. This is my desired output:
Name Group
Jeff 1
John 1
Frank 1
Fred 1
Steve 2
Sam 2
Jim 2
Is this possible in SQL?
解决方案
我有一个解决方案,尽管对于大型数据集来说它可能真的很昂贵。
这也很难解释,但我会尝试......
首先,看看我们如何仅基于“人”来组合组:
PersonA is a member of Groups {1, 2, 3}
Group2 can therefore be combined in to Group1
AND
Group3 can therefore be combined in to Group1
PersonB is a member of Groups {2, 3, 4}
Group3 can therefore be combined in to Group2
AND
Group4 can therefore be combined in to Group2
PersonC is a member of Groups {3, 4, 5}
Group4 can therefore be combined in to Group3
AND
Group5 can therefore be combined in to Group3
This gives us...
From PersonA : (G2 → G1) and (G3 → G1)
From PersonB : (G3 → G2) and (G4 → G2)
From PersonC : (G4 → G3) and (G5 → G3)
然后“垂直”查看它,这样如果一个组可以组合成多个其他组,请选择最低的选项。
Overall : (G2 → G1) and (G3 → G1) and (G4 → G2) and (G5 → G3)
如果我们将这些“组组合”应用于原始数据,我们会得到这个......
| person | original groups | revised groups |
+--------+-----------------+------------------------------+
| A | {1, 2, 3} | {1, 1, 1} => {1} |
| B | {2, 3, 4} | {1, 1, 2} => {1, 2} |
| C | {3, 4, 5} | {1, 2, 3} => {1, 2, 3} |
所以,我们已经从 5 个不同的组变成了三个不同的组。
如果我们再次重复该过程,从 开始revised groups
,所有三个人最终都成为 just 的成员Group1
。
您需要重复此操作的次数取决于有多少组相互重叠以及以何种方式重叠。
以下代码应该能够根据需要多次应用该过程。它一直持续到每个人都只是一个组的成员。最后可能还有不止一个组,它们是彼此不重叠的组。
这可能需要多次迭代,每次都与每个人打交道,这可能很慢,尤其是在较大的数据集上。
但是,我认为,它适用于所有情况。 (即使您可能不应该在 SQL 中执行此操作,但至少现在您可以。)
包含测试数据的演示: dbfiddle.uk demo
WITH
RECURSIVE "membership"
AS
(
SELECT
"name",
"group_id",
(SELECT COUNT(DISTINCT "name") FROM "name_group_links") AS unique_names,
0 AS current_depth,
COUNT(*) OVER () AS current_links
FROM
"name_group_links"
UNION ALL
SELECT
*,
COUNT(*) OVER () AS "current_links"
FROM
(
SELECT DISTINCT
"name",
MIN("min_group_by_name") OVER (PARTITION BY "group_id") AS "group",
"unique_names",
"current_depth" + 1 AS "current_depth"
FROM
(
SELECT
*,
MIN("group_id") OVER (PARTITION BY "name") AS "min_group_by_name"
FROM
"membership"
WHERE
"current_links" > "unique_names"
)
AS "collapse_groups_by_name"
)
AS "collapse_groups_by_group"
)
SELECT
"current_depth",
"name",
"group_id"
FROM
"membership"
WHERE
"current_links" = "unique_names"
ORDER BY
"current_depth",
"name",
"group_id"
;
推荐阅读
- typescript - 开玩笑地测试异常
- vue.js - vuetify v-select 的 vue-material-dashboard 问题
- python - 使用键和值作为子列表反转字典
- android - 某些文件已损坏
- git - 如何添加子模块作为对存储库的引用?
- android - Firebase 云消息传递。令牌检索失败:SERVICE_NOT_AVAILABLE
- c - 如何动态获取整数输入并在按 Enter 时终止循环?
- microsoft-cognitive - 为什么表单识别器返回的“boundingBox”坐标对于我的 PDF 表单似乎不正确?
- sql-server - 执行 SQL 任务的行为不同的加载开始日期和结束日期
- python - gRPC 客户端无法使用 TLS 证书连接到服务器