首页 > 解决方案 > BigQuery - 按组折叠行

问题描述

在 BigQuery 中,我有一个包含来自不同数据源的各种项目的表。在各个列之间执行记录链接,结果存储在前面显示的表中。我现在需要折叠结果,以便对于每一列,每个字段只有一个值。例如,我有下表:

在此处输入图像描述

具有以下值:

在此处输入图像描述

我正在寻找结果如下:

---------------------------------------
| Row | col_1 | col_2 | col_3 | col_4 |
---------------------------------------
|  1  |   4   | NULL  | NULL  |  NULL |
---------------------------------------
|  2  |   0   |   1   |  2    |  NULL |
---------------------------------------
|  3  |   5   |  NULL |  6    |   7   |
---------------------------------------

在上面可以观察到第 2 行和第 3 行是折叠的,因为它们的值都是1for col_2。同样,第 4 行和第 5 行被折叠,因为它们都有一个6in col_3。我尝试了以下但我无法让它工作。有没有人有什么建议?

SELECT
  FIRST_VALUE(c1.col_1) OVER (PARTITION BY c1.col_1 ORDER BY 1 ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS col_1,
  FIRST_VALUE(c2.col_2 IGNORE NULLS) OVER (PARTITION BY c1.col_1 ORDER BY 1 ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS col_2,
  FIRST_VALUE(c3.col_3 IGNORE NULLS) OVER (PARTITION BY c2.col_2 ORDER BY 1 ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS col_3,
  FIRST_VALUE(c4.col_4 IGNORE NULLS) OVER (PARTITION BY c3.col_3 ORDER BY 1 ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS col_4,
FROM
  dataset.collapse_test  c1
LEFT JOIN
  dataset.collapse_test c2
ON
  c1.col_2 = c2.col_2
LEFT JOIN
  dataset.collapse_test c3
ON
  c2.col_2 = c3.col_2
LEFT JOIN
  dataset.collapse_test c4
ON
  c3.col_3 = c4.col_3

标签: sqlgoogle-bigquery

解决方案


此查询存在一些问题,因为正如@Gordon Linoff 在评论中所说,当多个值匹配时,某些行可能会出现问题。

因此,在这里我尝试生成您想要的值但也保留旧的值,因为在某些情况下很难理解何时保留它们,如前所述。

首先,我们创建表:

CREATE TABLE `dataset.collapse` (
   col_1 INT64,
   col_2 INT64,
   col_3 INT64,
   col_4 INT64,
) as SELECT 4 as col_1, null as col_2, null as col_3, null as col_4 UNION ALL
     SELECT 0 as col_1, 1 as col_2, null as col_3, null as col_4 UNION ALL
     SELECT null as col_1, 1 as col_2, 2 as col_3, null as col_4 UNION ALL
     SELECT null as col_1, null as col_2, 6 as col_3, 7 as col_4 UNION ALL
     SELECT 5 as col_1, null as col_2, 6 as col_3, null as col_4

然后我们针对这种情况运行我们的解决方案,我使用该IFNULL函数仅保留非空值并INNER JOIN作为等值条件:

SELECT
   IF
    (t1.col_1 IS NULL,
      CAST(CONCAT(IFNULL(t1.col_1,
          t2.col_1)) AS INT64),
    t1.col_1) AS c1,
   IF
    (t1.col_2 IS NULL,
     CAST(CONCAT(IFNULL(t1.col_2,
         t2.col_2)) AS INT64),
    t1.col_2) AS c2,
   IF
    (t1.col_3 IS NULL,
     CAST(CONCAT(IFNULL(t1.col_3,
          t2.col_3)) AS INT64),
    t1.col_3) AS c3,
   IF
    (t1.col_4 IS NULL,
     CAST(CONCAT(IFNULL(t1.col_4,
          t2.col_4)) AS INT64),
     t1.col_4) AS c4
FROM
  `project.dataset.collapse` AS t1
INNER JOIN
  `project.dataset.collapse` AS t2
ON
  t1.col_2 = t2.col_2
  OR t1.col_1 = t2.col_1
  OR t1.col_3 = t2.col_3
  OR t1.col_4 = t2.col_4
GROUP BY
  c1,
  c2,
  c3,
  c4

我们得到的解决方案是这样的:

在此处输入图像描述

为了删除旧值,解决方案类似于:

SELECT
   *
FROM (previous_query)
   WHERE
     CONCAT(IFNULL(c1,
           -1),IFNULL(c2,
           -1),IFNULL(c3,
           -1),IFNULL(c4,
           -1)) 
     NOT IN (
        SELECT
            CONCAT(IFNULL(col_1,
             -1),IFNULL(col_2,
             -1),IFNULL(col_3,
             -1),IFNULL(col_4,
             -1))
        FROM
           `project.dataset.collapse`
        WHERE
           CONCAT(IFNULL(col_1,
              -1),IFNULL(col_2,
              -1),IFNULL(col_3,
              -1),IFNULL(col_4,
              -1)) = CONCAT(IFNULL(c1,
                    -1),IFNULL(c2,
                    -1),IFNULL(c3,
                    -1),IFNULL(c4,
                    -1)) ))

请记住,此解决方案还将删除 [4 null null null]行。

结果:

在此处输入图像描述


推荐阅读