sql - 使用表中对象的计数(*)来限制 SQL 查询的最有效方法?
问题描述
我有下表:
CREATE TABLE example (
accountNumber,
z, --long string that contains accountID within
status, -- red green or blue
partition_0,
);
INSERT INTO example
Values(123, 'abcde: xyz, accountID: 10434', 'green', 20211010),
(124, 'abcde: xhf, accountID: 25949', 'green', 20211010),--accountNumber 124 would be excluded because it has >= different accountID (25949 and 34322)
(124, 'abcde: xhf, accountID: 34322', 'green', 20211010),
(124, 'abcde: xhf, accountID: 25949', 'red', 20211011),
(154, 'abcde: dasd, accountID: 43341', 'green', 20211010);
该表可以有 100,000,000->10,000,000,000 行。有时填充此表的系统无法正常工作,我得到具有多个 accountID 的 accountNumbers。它们应该是 1:1。那些不是工件,我想将它们过滤掉。我有以下设置:
with cte as (
select accountNumber,
upper(substr(z,strpos(z,'accountID')+11,5)) as accountID, partition_0 from example),
cte_limiter_raw as (
select accountNumber, count(distinct accountID) as countAccountID from cte
group by accountNumber), --I chose to put the limiter here because I could call accountID instead of the substr function
cte_limiter as (
select accountNumber from cte_limiter_raw
where countAccountID = 1)
select accountNumber, accountID, partition_0,
sum(case when status like 'Green' then 1 end) as countGreen,
sum(case when status like 'Red' then 1 end) as countRed
from cte
where accountNumber in (select accountNumber from cte_limiter)
group by accountNumber, accountID, partition_0
这似乎有点笨拙。有没有更有效的方法来做到这一点?我在亚马逊雅典娜上查询。
解决方案
如果我理解正确,你想这样做:
SELECT accountNumber
FROM example
GROUP BY accountNumber
HAVING COUNT(DISTINCT UPPER(SUBSTR(z, STRPOS(z, 'accountID') + 11, 5))) > 1
即找到所有accountNumber
在z
.
以上是表达查询的一种更简单的方式,但它是全表扫描。如果您有某种方式仅对可能限制影响的新数据运行查询。
您的查询将执行两次全表扫描,因为它会执行SELECT … FROM cte
两次。Athena 不会重用中间结果(这是 Presto 的限制,它比寻找分布式引擎来做到这一点要困难得多)。您可以重写查询以避免相关子查询,但我认为使用HAVING
更容易。
HAVING
类似于WHERE
,但在不同的阶段进行评估,因此您可以表达组的属性。另一种编写相同内容(并获得相同查询计划)的方式如下:
WITH counts AS (
SELECT
accountNumber,
COUNT(DISTINCT UPPER(SUBSTR(z, STRPOS(z, 'accountID') + 11, 5))) AS count
FROM example
GROUP BY accountNumber
)
SELECT accountNumber
FROM counts
WHERE "count" > 1
推荐阅读
- android - 使用 SharedPreferences 存储数据
- javascript - 如何在苗条框架的树枝视图中包含外部 css 和 javascript 文件
- ruby-on-rails - 基于模型属性的 StateMachine 转换
- python - 使用 python GDAL API 创建栅格的空副本
- corda - Gradle 任务执行失败 org.gradle.api.GradleException:无效距离太远
- html - 如何根据窗口高度使元素滚动或居中
- twitter-bootstrap - Bootstrap 4中包含两行的表单
- neo4j - Neo4J - Cypher:通过多个标签的多个节点的 allShortestPaths
- vba - 使用复制函数 VBA 时应用程序定义或对象定义的错误
- c# - 最小化或背景时的 UWP/C# 后台下载器