首页 > 解决方案 > 使用表中对象的计数(*)来限制 SQL 查询的最有效方法?

问题描述

我有下表:

CREATE TABLE example (
    accountNumber,
    z, --long string that contains accountID within
    status, -- red green or blue
    partition_0, 
   
);
INSERT INTO example
Values(123, 'abcde: xyz, accountID: 10434', 'green', 20211010),
(124, 'abcde: xhf, accountID: 25949', 'green', 20211010),--accountNumber 124 would be excluded because it has >= different accountID (25949 and 34322)
(124, 'abcde: xhf, accountID: 34322', 'green', 20211010),
(124, 'abcde: xhf, accountID: 25949', 'red', 20211011),
(154, 'abcde: dasd, accountID: 43341', 'green', 20211010);

该表可以有 100,000,000->10,000,000,000 行。有时填充此表的系统无法正常工作,我得到具有多个 accountID 的 accountNumbers。它们应该是 1:1。那些不是工件,我想将它们过滤掉。我有以下设置:

with cte as (
select accountNumber, 
upper(substr(z,strpos(z,'accountID')+11,5)) as accountID, partition_0 from example),

cte_limiter_raw as (
select accountNumber, count(distinct accountID) as countAccountID from cte
group by accountNumber), --I chose to put the limiter here because I could call accountID instead of the substr function

cte_limiter as (
select accountNumber from cte_limiter_raw
where countAccountID = 1)

select accountNumber, accountID, partition_0,
sum(case when status like 'Green' then 1 end) as countGreen,
sum(case when status like 'Red' then 1 end) as countRed
from cte
where accountNumber in (select accountNumber from cte_limiter)
group by accountNumber, accountID, partition_0

这似乎有点笨拙。有没有更有效的方法来做到这一点?我在亚马逊雅典娜上查询。

标签: sqlperformanceamazon-s3amazon-athena

解决方案


如果我理解正确,你想这样做:

SELECT accountNumber
FROM example
GROUP BY accountNumber
HAVING COUNT(DISTINCT UPPER(SUBSTR(z, STRPOS(z, 'accountID') + 11, 5))) > 1

即找到所有accountNumberz.

以上是表达查询的一种更简单的方式,但它是全表扫描。如果您有某种方式仅对可能限制影响的新数据运行查询。

您的查询将执行两次全表扫描,因为它会执行SELECT … FROM cte两次。Athena 不会重用中间结果(这是 Presto 的限制,它比寻找分布式引擎来做到这一点要困难得多)。您可以重写查询以避免相关子查询,但我认为使用HAVING更容易。

HAVING类似于WHERE,但在不同的阶段进行评估,因此您可以表达组的属性。另一种编写相同内容(并获得相同查询计划)的方式如下:

WITH counts AS (
  SELECT
    accountNumber,
    COUNT(DISTINCT UPPER(SUBSTR(z, STRPOS(z, 'accountID') + 11, 5))) AS count
  FROM example
  GROUP BY accountNumber
)
SELECT accountNumber
FROM counts
WHERE "count" > 1

推荐阅读