首页 > 解决方案 > 如果它们有多个元组,则选择具有两列相同但另一列不同的行

问题描述

我有一个包含这些列的数据库表:

local  domain  email_sha256  password  password_sha256
a      b       ...           C         ...
a      bb      ...           C         ...
a      bb      ...           CC        ...
a      bbb     ...           C         ...
aa     bb      ...           CCC       ...
aa     bb      ...           CC        ...

localdomain部分本质上是在@ 字符处拆分的电子邮件。

test@gmail.com

本地 = 测试

域 = gmail.com

我想找到所有具有相同localpassword对但具有不同的行domainlocal如果我只使用,domainpassword列,这将返回类似的东西

local  domain  password
a      b       C
a      bb      C
a      bbb     C

我一直在尝试首先识别所有具有重复项的 对localpassword

SELECT local, password 
FROM tablename
GROUP BY local, password
HAVING count(*) > 1

现在要获得更多的列,而不仅仅是GROUP BYJOIN在桌子上做的

SELECT local, domain, password 
FROM tablename
JOIN (SELECT local, domain FROM tablename GROUP BY local, password HAVING count(*) > 1)
USING (local, password)

现在为了确保域不同,我再次加入表本身并添加一个WHERE子句。为了避免重复,我使用了GROUP BY. 这是我的最后一个查询。

SELECT A.local, A.domain, A.password
FROM tablename as A
JOIN 
    (SELECT  local, domain, password 
    FROM tablename
    JOIN 
        (SELECT local, password 
        FROM tablename 
        GROUP BY local, password 
        HAVING count(*) > 1) 
    USING (local, password)) as B
USING (local, password)
WHERE A.password = B.password AND A.domain != B.domain AND A.local = B.local
GROUP BY local, domain, password
ORDER BY local, password

我是否使用此查询删除了潜在的有效结果?此外,是否有更快/更好的查询来运行并获得相同的结果?

谢谢。

注意:此表没有唯一的 id,但我可能没有重复email_sha256的 ,password_sha256对,因此它们可以用作 id。

标签: sqlgoogle-bigquery

解决方案


以下是 BigQuery 标准 SQL

#standardSQL
WITH remove_dup_domains AS (
  SELECT rec.* FROM (
    SELECT local, domain, password, ANY_VALUE(t) rec
    FROM `project.dataset.table` t
    GROUP BY local, domain, password
  )
)
SELECT y.* FROM (
  SELECT ARRAY_AGG(t) bin 
  FROM remove_dup_domains t
  GROUP BY local, password
  HAVING COUNT(1) > 1
)x, x.bin y

您可以使用您问题中的示例数据进行测试,使用上面的示例数据,如下例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'a' local, 'b' domain, 'C' password, 'whatever else1' other_cols UNION ALL
  SELECT 'a', 'bb', 'C', 'whatever else2' UNION ALL
  SELECT 'a', 'bb', 'CC', 'whatever else3' UNION ALL
  SELECT 'a', 'bbb', 'C', 'whatever else4' UNION ALL
  SELECT 'a', 'bbbb', 'D', 'whatever else5' UNION ALL
  SELECT 'a', 'bbbbb', 'E', 'whatever else6' UNION ALL
  SELECT 'aa', 'bb', 'CCC', 'whatever else7' UNION ALL
  SELECT 'aa', 'bb', 'CC', 'whatever else8' UNION ALL
  SELECT 'aaa', 'com', 'H', 'whatever else9' UNION ALL
  SELECT 'aaa', 'com', 'H', 'whatever else10' 
), remove_dup_domains AS (
  SELECT rec.* FROM (
    SELECT local, domain, password, ANY_VALUE(t) rec
    FROM `project.dataset.table` t
    GROUP BY local, domain, password
  )
)
SELECT y.* FROM (
  SELECT ARRAY_AGG(t) bin 
  FROM remove_dup_domains t
  GROUP BY local, password
  HAVING COUNT(1) > 1
)x, x.bin y   

结果

Row local   domain  password    other_cols   
1   a       b       C           whatever else1   
2   a       bb      C           whatever else2   
3   a       bbb     C           whatever else4    

推荐阅读