首页 > 解决方案 > SQL:用条件填充缺失的记录

问题描述

我需要按日期计算库存中存在的产品数量。然而,在数据库中,产品仅在消费者查看时才被记录。

例如考虑这个基本的表结构:

date      |  productId |   views
July 1    |  A         |   8
July 2    |  A         |   6
July 2    |  B         |   4
July 3    |  A         |   2
July 4    |  A         |   8
July 4    |  B         |   6
July 4    |  C         |   4
July 5    |  C         |   2
July 10   |  A         |   17

使用以下查询,我尝试确定给定日期库存中的产品数量。

select date, count(distinct productId) as Inventory, sum(views) as views
from (
   select date, productId, count(*) as views
   from SomeTable
   group by date, productID
   order by date asc, productID asc
)
group by date

这是输出

date      |  Inventory |   views
July 1    |  1         |   8
July 2    |  2         |   10
July 3    |  1         |   2
July 4    |  3         |   18
July 5    |  1         |   2
July 10   |  1         |   17

由于缺少行,我的输出不能准确反映库存中有多少产品。

对库存的正确理解如下:
- 产品 A 于 7 月 1 日至 7 月 10 日存在库存。
- 产品 B 存在于 7 月 2 日至 7
月 4 日的库存。 - 产品 C 存在于 7 月 4 日至 7 月 5 日的库存。

正确的 SQL 输出应该是:

date      |  Inventory |   views
July 1    |  1         |   8
July 2    |  2         |   10
July 3    |  2         |   2
July 4    |  3         |   18
July 5    |  2         |   2
July 6    |  1         |   0
July 7    |  1         |   0
July 8    |  1         |   0
July 9    |  1         |   0
July 10   |  1         |   17

如果您继续跟进,请让我确认我很乐意将“库存中”定义为第一个视图和最后一个视图之间的日期差。

我遵循了以下错误过程:

首先,我创建了一个表格,它是每个 productID 和每个日期的笛卡尔积。''' 日期为(从 SomeTable 中选择日期,按日期分组),产品为(从 SomeTable 中按 productId 选择 productId)选择 Dates.date,Products.productId 从 Dates 中交叉加入产品 '''

然后我尝试进行右外连接以将其减少为仅丢失的记录:

with Records as (
select date, productId, count(*) as views
from SomeTable
group by date, productId
),
Cartesian as (
{See query above}
)
Select Cartesian.date, Cartesian.productId, 0 as views #for upcoming union
from Cartesian right outer join Records 
on Cartesian.date = Records.date 
where Records.productId is null

然后,将缺少的行合并到记录中。这样做,我创建了一个新问题:额外的行。

date      |  productId |   views
July 1    |  A         |   8
July 1    |  B         |   0
July 1    |  C         |   0
July 2    |  A         |   6
July 2    |  B         |   4
July 2    |  C         |   0
July 3    |  A         |   2
July 3    |  B         |   0
July 3    |  C         |   0
July 4    |  A         |   8
July 4    |  B         |   6
July 4    |  C         |   4
July 5    |  A         |   2
July 5    |  B         |   0
July 5    |  C         |   0
July 6    |  A         |   0
July 6    |  B         |   0
July 6    |  C         |   0
July 7    |  A         |   0
July 7    |  B         |   0
July 7    |  C         |   0
July 8    |  A         |   0
July 8    |  B         |   0
July 8    |  C         |   0
July 9    |  A         |   0
July 9    |  B         |   0
July 9    |  C         |   0
July 10   |  A         |   17
July 10   |  B         |   0
July 10   |  C         |   0

select date, count(distinct productId) as Inventory, sum(views) as views 当我在该表上运行我的简单查询 时,我再次得到错误的输出:

date      |  Inventory |   views
July 1    |  3         |   8
July 2    |  3         |   10
July 3    |  3         |   2
July 4    |  3         |   18
July 5    |  3         |   2
July 6    |  3         |   0
July 7    |  3         |   0
July 8    |  3         |   0
July 9    |  3         |   0
July 10   |  3         |   17

我的下一个想法是遍历每个 productId,确定它的第一个和最后一个日期,然后将它与笛卡尔表联合,条件是 Cartesian.date 介于每个特定产品的第一个和最后一个日期之间。
必须有一种更简单的方法来做到这一点。谢谢。

标签: sqlgoogle-bigquery

解决方案


以下是 BigQuery 标准 SQL

#standardSQL
WITH dates AS (
  SELECT day FROM (
    SELECT MIN(day) min_day, MAX(day) max_day
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day
), ranges AS (
  SELECT productId, MIN(day) min_day, MAX(day) max_day
  FROM `project.dataset.table` t
  GROUP BY productId
)
SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views
FROM dates d, ranges r 
LEFT JOIN `project.dataset.table` USING(day, productId)
WHERE day BETWEEN min_day AND max_day 
GROUP BY day

如果适用于您的问题中的示例数据,如下例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT DATE '2019-07-01' day, 'A' productId, 8 views UNION ALL
  SELECT '2019-07-02', 'A', 6 UNION ALL
  SELECT '2019-07-02', 'B', 4 UNION ALL
  SELECT '2019-07-03', 'A', 2 UNION ALL
  SELECT '2019-07-04', 'A', 8 UNION ALL
  SELECT '2019-07-04', 'B', 6 UNION ALL
  SELECT '2019-07-04', 'C', 4 UNION ALL
  SELECT '2019-07-05', 'C', 2 UNION ALL
  SELECT '2019-07-10', 'A', 17 
), dates AS (
  SELECT day FROM (
    SELECT MIN(day) min_day, MAX(day) max_day
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day
), ranges AS (
  SELECT productId, MIN(day) min_day, MAX(day) max_day
  FROM `project.dataset.table` t
  GROUP BY productId
)
SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views
FROM dates d, ranges r 
LEFT JOIN `project.dataset.table` USING(day, productId)
WHERE day BETWEEN min_day AND max_day 
GROUP BY day
-- ORDER BY day

结果是

Row day         Inventory   views    
1   2019-07-01  1           8    
2   2019-07-02  2           10   
3   2019-07-03  2           2    
4   2019-07-04  3           18   
5   2019-07-05  2           2    
6   2019-07-06  1           0    
7   2019-07-07  1           0    
8   2019-07-08  1           0    
9   2019-07-09  1           0    
10  2019-07-10  1           17   

推荐阅读