首页 > 解决方案 > 根据两列之一标记重复项

问题描述

假设我的数据集如下所示:

email name
  a    f
  b    g
  a    g
  o    k

我想要的输出是:

email name  group 
  a    f      1
  b    g      1
  a    g      1
  o    k      2

因为前三行是同一个人,因为他们共享电子邮件或姓氏。我正在努力弄清楚如何编写这样的查询来获取组列。

标签: sqlsnowflake-cloud-data-platform

解决方案


这需要递归 CTE。您可以通过在电子邮件(或名称)之间创建边然后遍历图表来分配组:

with edges as (
      select t1.email as email1, t2.email as email2
      from t join
           t t2
           on t1.name = t2.name
     ),
     cte as (
      select email1, email2, least(email1, email2) as min_email
             array_construct(email1, email2) as visited
      from edges e
      union all
      select cte.email1, e.email2, least(cte.min_email, e.email2),
             array_append(cte.visited, e.email2)
      from cte join
           edges e
           on cte.email2 = e.email1
      where not array_contains(cte.visited, e.email2)
     )
select email1, min(min_email),
       dense_rank() over (order by min_email) as grp
from cte
group by email1;

对此进行调整将 分配grp给原始数据:

with edges as (
      select t1.email as email1, t2.email as email2
      from t join
           t t2
           on t1.name = t2.name
     ),
     cte as (
      select email1, email2, least(email1, email2) as min_email
             array_construct(email1, email2) as visited
      from edges e
      union all
      select cte.email1, e.email2, least(cte.min_email, e.email2),
             array_append(cte.visited, e.email2)
      from cte join
           edges e
           on cte.email2 = e.email1
      where not array_contains(cte.visited, e.email2)
     )
select t.*, grp
from t join
     (select email1, min(min_email) as min_email,
             dense_rank() over (order by min_email) as grp
      from cte
      group by email1
     ) e
     on t.email = e.email;

推荐阅读