首页 > 解决方案 > BigQuery - 从按日期分区的表中删除行

问题描述

我有一个按日期(100 个分区)分区的 dataset.table,如下所示:

table_name_(100) 这意味着:table_name_20200101, table_name_20200102, table_name_20200103, ...

table_name_20200101 的示例:

|       id      |       col_1       |       col_2       |       col_3       |
-----------------------------------------------------------------------------
|       xxx     |       2           |       6           |       10          |
|       yyy     |       1           |       60          |       29          |
|       zzz     |       12          |       61          |       78          |
|       aaa     |       18          |       56          |       80          |

我想删除所有表(分区)中的行 ID = yyy:

DELETE FROM `project_id.dataset_id.table_name_*`
WHERE id = 'yyy'

我收到了这个错误:

元表 project_id:dataset_id.table_name_* 上的非法操作(写入)

有没有办法删除所有表(分区)中的行'yyy'?

谢谢

标签: python-3.xgoogle-cloud-platformgoogle-bigquery

解决方案


好的,这里需要注意一些不同的事情,以确保我们使用一致的术语。

  1. 您说的是分片表,而不是分区表。在分区表中,表中的数据是根据分区规范组织的。在这里,您只有一系列使用公共前缀和基于日期的后缀命名的表。

  2. 语法的使用table_prefix*称为通配符表,并且通过通配符表明确不允许 DML:https ://cloud.google.com/bigquery/docs/querying-wildcard-tables

  3. table_name_(100)是 BigQuery UI 如何折叠一系列同名表以节省导航窗格空间的一个方面。这根本不是服务本身引用表的方式。

实现此目的的方法是利用 BigQuery 的其他方面:INFORMATION_SCHEMA 表和脚本功能。

有关数据集中哪些表的信息可通过TABLES视图获得:https ://cloud.google.com/bigquery/docs/information-schema-tables

可以在此处找到有关脚本的信息:https ://cloud.google.com/bigquery/docs/reference/standard-sql/scripting

现在,这是一个结合了这些概念的示例:

DECLARE myTables ARRAY<STRING>;
DECLARE X INT64 DEFAULT 0;
DECLARE queryStr STRING;

# First, we query INFORMATION_SCHEMA to generate an array of the tables we want to process.
# This INFORMATION_SCHEMA query currently has a LIMIT clause so that if you get it wrong,
# you won't bork all the tables in the dataset in one go.
SET myTables = (
  SELECT
    ARRAY_AGG(t)
  FROM (
    SELECT
      TABLE_NAME as t
    FROM `my-project-id`.my_dataset.INFORMATION_SCHEMA.TABLES 
    WHERE
      TABLE_TYPE = 'BASE TABLE' AND
      STARTS_WITH(TABLE_NAME, 'table_name_')
    ORDER BY TABLE_NAME
    LIMIT 2
  )
);

# Now, we process that array of tables using scripting's loop construct,
# one at a time.
LOOP
  IF X >= ARRAY_LENGTH(myTables)
    THEN LEAVE;
  END IF;
  # DANGER WILL ROBINSON: This mutates tables!!!
  # 
  # The next line constructs the SQL statement we want to run for each table.
  #
  # In this example, we're constructing the same DML DELETE
  # statement to run on each table.  For safety sake, you may want to start with 
  # something like a SELECT query to validate your assumptions and project the 
  # myTables values to see what you're getting.
  SET queryStr = "DELETE FROM `my-project-id`.my_dataset." || myTables[SAFE_OFFSET(X)] || " WHERE id = 'yyy'";
  # Now, run the generated SQL via EXECUTE IMMEDIATE.
  EXECUTE IMMEDIATE queryStr;
  SET X = X + 1;
END LOOP;

推荐阅读