首页 > 解决方案 > 我如何获得价格的中位数?

问题描述

在数据集中,每家商店都在卖一些书,每家商店都有自己的每本书价格。在数据中,我有每本书的价格信息。通过 Amazon Athena 中的查询,我想计算每个商店和每个产品在特定时间段内的平均价格。

但老实说,我不知道该怎么做。到目前为止,这是我的查询:

SELECT product_id,
       shop_id,
       XXX AS median_price
FROM data_f
    WHERE site_id = 10
            AND year || month || day || hour >= '2020022500'
            AND year || month || day || hour < '2020022600'
GROUP BY product_id, shop_id

谢谢!

标签: sqlamazon-athena

解决方案


不幸的是,AWS 不支持median()聚合函数或percentile()函数。也许最简单的方法是ntile(2)在子查询中使用,然后取第一个图块的最大值(或第二个图块的最小值:

SELECT product_id, shop_id,
       MAX(CASE WHEN tile2 = 1 THEN price END) as median
FROM (SELECT d.*, NTILE(2) OVER (PARTITION BY product_id, shop_id ORDER BY price) as tile2
      FROM data_f d
      WHERE site_id = 10 AND
            action NOT IN ('base', 'delete') AND
            year || month || day || hour >= '2020022500' AND
            year || month || day || hour < '2020022600'
     ) d
GROUP BY product_id, shop_id;

注意:这对于任何实际目的来说无疑是足够好的。但是,“中位数”通常定义为总行数为偶数时两个中间值的平均值。如果你想学究气:

SELECT product_id, shop_id,
       (CASE WHEN COUNT(*) % 2 = 0
             THEN (MAX(CASE WHEN tile2 = 1 THEN price END) +
                   MIN(CASE WHEN tile2 = 2 THEN price END)
                  ) / 2.0
             ELSE MAX(CASE WHEN tile2 = 1 THEN price END)
        END) as median

推荐阅读