首页 > 解决方案 > Spark SQL - 使用 Spark SQL 窗口函数获取每个窗口的行数

问题描述

我想使用 spark SQL 窗口函数来做一些聚合和窗口化。

假设我正在使用此处提供的示例表:https ://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

在此处输入图像描述

我想运行查询给我每个类别的最大 2 收入以及每个类别的产品数量

在我运行这个查询之后

SELECT
  product,
  category,
  revenue
FROM (
  SELECT
    product,
    category,
    revenue,
    dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
    count(*) OVER (PARTITION BY category ORDER BY revenue DESC) as count
  FROM productRevenue) tmp
WHERE
  rank <= 2

我得到了这样的表:

product category    revenue count
pro2    tablet  6500    1
mini    tablet  5500    2

代替

product category    revenue count
pro2    tablet  6500    5
mini    tablet  5500    5

这是我所期望的。

我应该如何编写代码以获得每个类别的正确计数(而不是使用另一个单独的Group By语句)?

标签: apache-sparkapache-spark-sql

解决方案


在具有按窗口顺序排列的Sparkif窗口子句中,默认为.ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

对于您的情况,请添加ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGcount(*) 窗口子句。

Try with:

 SELECT
  product,
  category,
  revenue,count
FROM (
  SELECT
    product,
    category,
    revenue,
    dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
    count(*) OVER (PARTITION BY category ORDER BY revenue DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as count
  FROM productRevenue) tmp
WHERE
  rank <= 2

推荐阅读