首页 > 解决方案 > 基于两列删除除大查询标准sql中最新出现的所有重复项

问题描述

如果我有一个包含以下内容的大查询表

fruit   color   quantity   age   other_field
apple   red      3         1      foo
grapes  green    5         1      young
apple   green    1         3      word
apple   red      4         5      bar

我将如何删除除包含相同水果和颜色列的最后一个实例之外的所有行,以便我的表看起来像这样

fruit   color   quantity   age   other_field
grapes  green    5         1      young
apple   green    1         3      word
apple   red      4         5      bar

基本上只为大查询标准sql中的每对独特的水果和颜色保留一行?

标签: google-bigqueryduplicates

解决方案


以下是 BigQuery 标准 SQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'apple' fruit, 'red' color, 3 quantity, 1 age, 'foo' other_field UNION ALL
  SELECT 'grapes', 'green', 5, 1, 'young' UNION ALL
  SELECT 'apple', 'green', 1, 3, 'word' UNION ALL
  SELECT 'apple', 'red', 4, 5, 'bar' 
)
SELECT fruit, color, 
  ARRAY_AGG(STRUCT(quantity, age, other_field) ORDER BY age DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table` t
GROUP BY fruit, color   

结果

Row fruit   color   quantity    age other_field  
1   apple   red     4           5   bar  
2   grapes  green   5           1   young    
3   apple   green   1           3   word     

另一个版本是:

#standardSQL
SELECT AS VALUE
  ARRAY_AGG(t ORDER BY age DESC LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY fruit, color

结果相同......但显然我更喜欢这个版本:o)


推荐阅读