首页 > 解决方案 > 如何确保在雪花中使用 mode() 的查询的确定性结果

问题描述

我使用雪花,我想mode()在一个选择语句中使用多个表达式。所以它看起来像:

SELECT
x,
y,
mode(col1),
mode(col2),
...
mode(col15)
FROM table
GROUP BY x, y

我的问题是,如果有关系,它会产生不确定的输出。该文档没有准确解释这些关系是如何解决的。它只说:

如果最频繁的值存在平局(两个或多个值彼此出现的频率相同,并且比任何其他值更频繁),则 MODE 返回其中一个值。

https://docs.snowflake.net/manuals/sql-reference/functions/mode.html 我需要一些解决方法来获得等效的mode(),这总是会产生确定性的输出。类似的东西:使用mode(),但在某些列的平局的情况下,选择第一个值。

我没有提供复制不确定结果的示例,因为它似乎只发生在更大的数据集或复杂的查询中。

标签: sqlsnowflake-cloud-data-platformmode

解决方案


所以 mode 似乎更喜欢它在决胜局中看到的第一个值。

with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
    ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
)
select x
    ,mode(col1)
    ,mode(col2)
    ,mode(col3)
from data 
group by 1
order by 1;

交换 2/20 或 3/30 对的第一个值显示了这一点。

所以建立一个模式试图在一个表达式中解决这个问题:

with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
    ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
)
select x
    ,col1
    ,col2
    ,col3
    ,count(col1)over(partition by x,col1) c_col1
    ,count(col2)over(partition by x,col2) c_col2
    ,count(col3)over(partition by x,col3) c_col3
from data ;

借给它自己:

with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
    ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
)
select x
    ,col1
    ,col2
    ,col3 
    ,row_number() over (partition by x order by c_col1 desc, col1) as r1
    ,row_number() over (partition by x order by c_col2 desc, col2) as r2
    ,row_number() over (partition by x order by c_col3 desc, col3) as r3
from (
  select x
      ,col1
      ,col2
      ,col3
      ,count(col1)over(partition by x,col1) c_col1
      ,count(col2)over(partition by x,col2) c_col2
      ,count(col3)over(partition by x,col3) c_col3
  from data 
)
order by 1;

虽然有这些结果:

X   COL1    COL2    COL3    R1  R2  R3
1   1   2   3   2   1   1
1   2   2   3   3   2   2
1   1   1   3   1   3   3
4   1   2   3   2   1   1
4   2   20  3   4   4   2
4   2   2   30  3   2   3
4   1   20  30  1   3   4

你不能像这样使用逻辑

QUALIFY row_number() over (partition by x order by c_col1 desc, col1) = 1
  AND row_number() over (partition by x order by c_col2 desc, col2) = 1
  AND row_number() over (partition by x order by c_col3 desc, col3 desc) = 1

选择最好的,因为每列的最佳行没有对齐。

这导致每列都有一个 CTE(或子查询),与 Gorndon 展示的模式非常相似。

with data as (
select x, col1, col2, col3 from values (1, 1, 1, 3), (1, 1, 2,3), (1, 2, 2,3)
    ,(4, 1, 20, 30), (4, 1, 2, 3), (4, 2, 2, 30), (4,2,20,3) v(x,col1,col2,col3)
),col1_m as (
    select x, col1, count(*) as c 
    from data 
    group by 1,2
    QUALIFY row_number() over (partition by x order by c desc, col1) = 1
),col2_m as (
    select x, col2, count(*) as c 
    from data 
    group by 1,2
    QUALIFY row_number() over (partition by x order by c desc, col2) = 1
),col3_m as (
    select x, col3, count(*) as c 
    from data 
    group by 1,2
    QUALIFY row_number() over (partition by x order by c desc, col3) = 1
), base as (
select distinct x from data
)
select b.x
    ,c1.col1
    ,c2.col2
    ,c3.col3
from base as b
left join col1_m as c1 on b.x = c1.x
left join col2_m as c2 on b.x = c2.x
left join col3_m as c3 on b.x = c3.x
order by 1;

这给出了您期望的结果

X   COL1    COL2    COL3
1   1   2   3
4   1   2   3

但是您需要将 X 扩展为您关心的一组事物 (x,y,..) 等。


推荐阅读