首页 > 解决方案 > SQL Solution to Grab Largest Group with Overlap

问题描述

I have an input of a names and associated groups. I desire an output of a de-duplicated version of the input, except for when groups overlap, I want the summation of the overlapping groups.

Name    Group
Jeff    1
John    1
Frank   1
Jeff    2
John    2
Frank   2
Fred    2
Steve   3
Sam     3
Jim     3

So for this example there are three groups, group 1, 2, and 3. There is overlap between group 1 and 2, (Jeff, John, and Frank all belong to both 1 and 2, and Fred is only in group 2). Given that there's overlap in group 1 and 2, I want to combine them into one single group, with all of the names in either group. I also want to keep groups without overlap separate. This is my desired output:

Name    Group
Jeff    1
John    1
Frank   1
Fred    1
Steve   2
Sam     2
Jim     2

Is this possible in SQL?

标签: sqlgroup-bysnowflake-cloud-data-platform

解决方案


我有一个解决方案,尽管对于大型数据集来说它可能真的很昂贵。

这也很难解释,但我会尝试......


首先,看看我们如何仅基于“人”来组合组:

PersonA is a member of Groups {1, 2, 3}
   Group2 can therefore be combined in to Group1
   AND
   Group3 can therefore be combined in to Group1

PersonB is a member of Groups    {2, 3, 4}
   Group3 can therefore be combined in to Group2
   AND
   Group4 can therefore be combined in to Group2

PersonC is a member of Groups       {3, 4, 5}
   Group4 can therefore be combined in to Group3
   AND
   Group5 can therefore be combined in to Group3

This gives us...
   From PersonA : (G2 → G1) and (G3 → G1)
   From PersonB :               (G3 → G2) and (G4 → G2)
   From PersonC :                             (G4 → G3) and (G5 → G3)

然后“垂直”查看它,这样如果一个组可以组合成多个其他组,请选择最低的选项。

        Overall : (G2 → G1) and (G3 → G1) and (G4 → G2) and (G5 → G3)

如果我们将这些“组组合”应用于原始数据,我们会得到这个......

| person | original groups | revised groups               |
+--------+-----------------+------------------------------+
|   A    | {1, 2, 3}       | {1, 1, 1}       => {1}       |
|   B    |    {2, 3, 4}    |    {1, 1, 2}    => {1, 2}    |
|   C    |       {3, 4, 5} |       {1, 2, 3} => {1, 2, 3} |

所以,我们已经从 5 个不同的组变成了三个不同的组。

如果我们再次重复该过程,从 开始revised groups,所有三个人最终都成为 just 的成员Group1

您需要重复此操作的次数取决于有多少组相互重叠以及以何种方式重叠。


以下代码应该能够根据需要多次应用该过程。它一直持续到每个人都只是一个组的成员。最后可能还有不止一个组,它们是彼此不重叠的组。

这可能需要多次迭代,每次都与每个人打交道,这可能很慢,尤其是在较大的数据集上。

但是,我认为,它适用于所有情况。 (即使您可能不应该在 SQL 中执行此操作,但至少现在您可以。)

包含测试数据的演示: dbfiddle.uk demo

WITH
  RECURSIVE "membership"
AS
(
  SELECT
    "name",
    "group_id",
    (SELECT COUNT(DISTINCT "name") FROM "name_group_links")   AS unique_names,
    0                                                         AS current_depth,
    COUNT(*) OVER ()                                          AS current_links
  FROM
    "name_group_links"

  UNION ALL

  SELECT
    *,
    COUNT(*) OVER ()   AS "current_links"
  FROM
  (
    SELECT DISTINCT
      "name",
      MIN("min_group_by_name") OVER (PARTITION BY "group_id")   AS "group",
      "unique_names",
      "current_depth" + 1                                       AS "current_depth"
    FROM
    (
      SELECT
        *,
        MIN("group_id") OVER (PARTITION BY "name")  AS "min_group_by_name"
      FROM
        "membership"
      WHERE
        "current_links" > "unique_names"
    )
      AS "collapse_groups_by_name"
  )
    AS "collapse_groups_by_group"
)
SELECT
  "current_depth",
  "name",
  "group_id"
FROM
  "membership"
WHERE
  "current_links" = "unique_names"
ORDER BY
  "current_depth",
  "name",
  "group_id"
;

推荐阅读