google-bigquery - 如何在 BIGQUERY 上导出空字段

问题描述

当我尝试通过 BigQuery 导出 JSON 对象时，当存在具有“null”值的字段时，它会从结果下载中消失。

下载查询示例：

EXPORT DATA OPTIONS(
  uri='gs://analytics-export/_*',
  format='JSON',
  overwrite=true) AS


SELECT NULL AS field1

实际结果是：{}

当预期结果为：{field1: null}

如何像我在预期结果中显示的那样强制使用空值导出？

标签： google-bigquerybigquery-udf

对于此 OP，您可以使用：

Select TO_JSON_STRING(NULL) as field1
Select 'null' as field1

在导出数据文档中，没有提到在输出中包含空值的选项，因此我认为您可以转到功能请求报告页面并为其创建一个请求。此外，对其他项目也有类似的观察和尚未支持的点，请参阅此处的详细信息。

有很多解决方法，让我向您展示 2 个选项，见下文：

选项 1：使用 bigquery 客户端库直接从 python 调用

from google.cloud import bigquery
import json

client = bigquery.Client()

query = "select null as field1, null as field2"
query_job = client.query(query)

json_list = {}
for row in query_job:
    json_row = {'field1':row[0],'field2':row[1]}
    json_list.update(json_row)
    
with open('test.json','w+') as file:
    file.write(json.dumps(json_list))

选项 2：将 apache beam 数据流与 python 和 BigQuery 结合使用以产生所需的输出

import argparse
import re
import json

import apache_beam as beam
from apache_beam.io import BigQuerySource
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions



def add_null_field(row, field):
  if field!='skip':
    row.update({field: row.get(field, None)})
  return row


def run(argv=None, save_main_session=True):
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--output',
        dest='output',
        required=True,
        help='Output file to write results to.')
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = save_main_session

    with beam.Pipeline(options=pipeline_options) as p:

        (p
        | beam.io.Read(beam.io.BigQuerySource(query='SELECT null as field1, null as field2'))
        | beam.Map(add_null_field, field='skip')
        | beam.Map(json.dumps)
        | beam.io.Write(beam.io.WriteToText(known_args.output, file_name_suffix='.json')))

if __name__ == '__main__': 
  run()

要运行它：

python -m export --output gs://my_bucket_id/output/ \
                 --runner DataflowRunner \
                 --project my_project_id \
                 --region my_region \
                 --temp_location gs://my_bucket_id/tmp/

Note: Just replace my_project_id,my_bucket_id and my_region with the appropriate values. Look on your cloud storage bucket for output file.

这两个选项都会为您生成您正在寻找的输出：

{"field1": null, "field2": null}

请让我知道它是否对您有帮助并为您提供想要达到的结果。

google-bigquery - 如何在 BIGQUERY 上导出空字段

问题描述

解决方案

推荐阅读