首页 > 解决方案 > 使用 CTE 与 SubQuery 重构 SQL 查询

问题描述

我正在从 S3 存储桶创建数据集,目前我正在尝试提高查询的性能,因为我目前有两种方法可以工作,但我希望看到更好的查询并学习如何提高我的 sql 技能。抱歉,没有可使用的示例数据集,因为在从 S3 中的 .json 文件中提取数据时,我还没有找到一种实用的方法来提供模拟数据。

查询 #1

 WITH block_1 AS
    (
    SELECT 
    VALUE:COL1 AS COL1, 
    VALUE:COL2 AS COL2, 
    VALUE:COL3 AS COL3,
    VALUE:COL4 AS COL4
    from '@S3_BUCKET/', 
     lateral flatten( input => $1:value)), block_2 as 

(
SELECT 
VALUE:COL1 AS COL1, 
max(VALUE:COL4) AS MaxCOL4
from '@S3_BUCKET/', 
lateral flatten( input => $1:value)
group by COL1
 )

select b.COL1 as COL1B, b.COLB as COL1B, 
 a.COL3, a.COL4 from block_1 as A
join block_2 b 
on a.COL1 = b.COL1  and a.COL4 = b.MaxCOL4
 ;

QUERY #2 ,我觉得这是一个改进,特别是因为您不需要在最终SELECT语句中指定您想要的列(就像我在上面所做的那样)

select a.* from 
(
SELECT 
VALUE:COL1 AS COL1, 
VALUE:COL2 AS COL2, 
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/', 
lateral flatten( input => $1:value))a 
join 
(
select COL1, MAX(COL4) COL4
from 
(
SELECT 
VALUE:COL1 AS COL1, 
VALUE:COL2 AS COL2, 
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/', 
 lateral flatten( input => $1:value))
group by COL1) b
on a.COL1 = b.COL1 and a.COL4 = b.Col4;

以上两个是我目前的尝试,想知道是否有办法让这个查询更好?我想的另一条路线可能是使用 "where in" 和 COL1 的列表,但基本上我仍然必须点击 s3 2x ,如上面的查询。

标签: sqlrefactoringsnowflake-cloud-data-platform

解决方案


您应该能够使用window functions,特别RANK()是简化此查询:

WITH block_1 AS (
    SELECT 
    VALUE:COL1 AS COL1, 
    VALUE:COL2 AS COL2, 
    VALUE:COL3 AS COL3,
    VALUE:COL4 AS COL4,
    RANK() OVER (PARTITION BY VALUE:COL1 ORDER BY VALUE:COL4 DESC) AS rk
    FROM '@S3_BUCKET/', 
     lateral flatten( input => $1:value)
)
SELECT COL1, COL2, COL3, COL4
FROM block_1
WHERE rk = 1

这可以通过 Snowflake 的QUALIFY子句来简化,它允许您在有效的HAVING子句中使用窗口函数的别名:

SELECT 
    VALUE:COL1 AS COL1, 
    VALUE:COL2 AS COL2, 
    VALUE:COL3 AS COL3,
    VALUE:COL4 AS COL4,
    RANK() OVER (PARTITION BY VALUE:COL1 ORDER BY VALUE:COL4 DESC) AS rk
FROM '@S3_BUCKET/', 
     lateral flatten( input => $1:value)
QUALIFY rk = 1

推荐阅读