首页 > 解决方案 > 加入而不创建重复记录

问题描述

我有 2 个表产品表和类别表。以下是类别表的外观 -

Product_ID     Category             sub-category     
   123         Furniture             Bookcases             
   123         Furniture             Chairs                
   456         Office supplies       Labels                
   456         Office supplies       Binders               
   456         Office supplies       Binders

产品表如下所示 -

Product_ID   Quantities_sold
   123           300
   456           400

当我内部加入产品和类别表时,我看到许多重复项,因为如下所示的不同子类别 -

Product_ID     Category             sub-category     Quantities_sold
   123         Furniture             Bookcases             300
   123         Furniture             Chairs                300
   456         Office supplies       Labels                400
   456         Office supplies       Binders               400
   456         Office supplies       Binders               400

预期的结果是这样的——

Product_ID     Category             sub-category     Quantities_sold
   123         Furniture             Bookcases             300
                                     Chairs                
   456         Office supplies       Labels                400
                                     Binders               

有没有办法拥有所有子类别但没有重复记录?

注意——我有一个庞大的数据集,包含数百万条记录、26 个类别和 135 个子类别。

标签: sqlgoogle-bigquery

解决方案


以下是 BigQuery 标准 SQL

#standardSQL
SELECT Product_ID, Category, SubCategories, Quantities_sold
FROM `project.dataset.product` p
LEFT JOIN (
  SELECT Product_ID, Category, STRING_AGG(DISTINCT SubCategory ) SubCategories
  FROM `project.dataset.category`
  GROUP BY Product_ID, Category
) c
USING (Product_ID)  

如果适用于您的问题的样本数据 - 结果是

Row Product_ID  Category        SubCategories       Quantities_sold  
1   123         Furniture       Bookcases,Chairs    300  
2   456         Office supplies Labels,Binders      400   

或使用 ARRAY_AGG 而不是 STRING_AGG 将子类别作为数组获取,如下例所示

#standardSQL
SELECT Product_ID, Category, SubCategories, Quantities_sold
FROM `project.dataset.product` p
LEFT JOIN (
  SELECT Product_ID, Category, ARRAY_AGG(DISTINCT SubCategory ) SubCategories
  FROM `project.dataset.category`
  GROUP BY Product_ID, Category
) c
USING (Product_ID)   

在这种情况下,结果如下

Row Product_ID  Category        SubCategories   Quantities_sold  
1   123         Furniture       Bookcases       300  
                                Chairs       
2   456         Office supplies Labels          400  
                                Binders      

推荐阅读