首页 > 解决方案 > 根据两列之一标记重复项



email name
  a    f
  b    g
  a    g
  o    k


email name  group 
  a    f      1
  b    g      1
  a    g      1
  o    k      2


标签: sqlsnowflake-cloud-data-platform


这需要递归 CTE。您可以通过在电子邮件(或名称)之间创建边然后遍历图表来分配组:

with edges as (
      select t1.email as email1, t2.email as email2
      from t join
           t t2
           on t1.name = t2.name
     cte as (
      select email1, email2, least(email1, email2) as min_email
             array_construct(email1, email2) as visited
      from edges e
      union all
      select cte.email1, e.email2, least(cte.min_email, e.email2),
             array_append(cte.visited, e.email2)
      from cte join
           edges e
           on cte.email2 = e.email1
      where not array_contains(cte.visited, e.email2)
select email1, min(min_email),
       dense_rank() over (order by min_email) as grp
from cte
group by email1;

对此进行调整将 分配grp给原始数据:

with edges as (
      select t1.email as email1, t2.email as email2
      from t join
           t t2
           on t1.name = t2.name
     cte as (
      select email1, email2, least(email1, email2) as min_email
             array_construct(email1, email2) as visited
      from edges e
      union all
      select cte.email1, e.email2, least(cte.min_email, e.email2),
             array_append(cte.visited, e.email2)
      from cte join
           edges e
           on cte.email2 = e.email1
      where not array_contains(cte.visited, e.email2)
select t.*, grp
from t join
     (select email1, min(min_email) as min_email,
             dense_rank() over (order by min_email) as grp
      from cte
      group by email1
     ) e
     on t.email = e.email;
