首页 > 解决方案 > How to count duplicates based on group by as well as see if values are present in the same field values?

问题描述

I have a table that contains ~1mil rows that looks like so

GROUP  |  ID  

A        AED12
A        6D04K
A        AED12 
B        VFR54
B        VFR54
B        9KMN2
B        AED12 
C        9KMN2
C        9KMN2
C        VFR54

I would like to have an output that counts the number of duplicates that is grouped by 'GROUP' WHILE also seeing which 'IDs' are present across multiple groups (reference 'AED12' that is in both GROUP A and B).

I am using this query for first question of duplicates across each group

select GROUP, ID, count(ID)
from TABLE1
group by GROUP, ID
having count(ID) > 1
;

output of query above

GROUP  |  ID   | COUNT

A        AED12     2
B        VFR54     2
C        9KMN2     2

I would like to sum the number of duplicates per group (how many dups are in Group A, B, etc.) which would like so (made up #s for example purposes)

GROUP | Sum of Dups IDs
A         2134
B         23321
C         11235

While also identifying which IDs are present in more than one group and count how many are present across the groups. Output would be something like a case when field that says is ID present in more than 1 group, if so 'true', if not - 'false'. Still thinking about how to count/group what the output would look like since there are million plus rows..

Any help would be much appreciated - new to Snowflake.

标签: sqlgroup-bycountsnowflake-cloud-data-platform

解决方案


您可以使用窗口函数计算 id 出现在不同组中的数量,然后进行聚合。这是一个计算单例和对的示例:

select group,
       count(*) as num_elements,
       count_if(num_groups = 1) as num_onesies,
       count_if(num_groups = 2) as num_twosies
from (select t.*,
             count(*) over (partition by id) as num_groups
      from table1 t
     ) t
group by group;

编辑:

如果您有重复项,最好在进行上述处理之前将其删除:

select group,
       count(*) as num_elements,
       count_if(num_groups = 1) as num_onesies,
       count_if(num_groups = 2) as num_twosies
from (select t.*,
             count(*) over (partition by id) as num_groups
      from (select distinct group, id
            from table1 t
           ) t
     ) t
group by group

推荐阅读