首页 > 解决方案 > Pubsub 到 BQ Dataflow 模板未解析 RECORD 类型数据

问题描述

我正在使用此 Dataflow 模板“Pub/Sub Topic to BigQuery”来解析具有 RECORD 类型数据结构的 json 架构。示例示例:

{
"url":"/i?session_duration=61&app_key=123456&device_id=gdfttyty&sdk_name=javascript_native_web&sdk_version=18.04",
"body":
    {
    "session_duration":"61",
    "app_key":"eyrttyuyyu78jkjk",
    "device_id":"h1bh41yptik1vtwr8",
    "sdk_name":"javascript_native_web",
    "sdk_version":"18.04",
    "timestamp":"1597057884636",
    "hour":"10",
    "dow":"1"
    },
"app_key":"eyrttyuyyu78jkjk",
"timestamp":"1597057884636",
"ip_address":"0.0.0.0"
}

BigQuery 中定义的架构如下:

[

   {
      "name":"url",
      "type":"STRING",
      "mode":"NULLABLE"
   },
   {
      "name":"body",
      "type":"RECORD",
      "mode":"REPEATED",
      "fields":[
         {
            "name":"session_duration",
            "type":"STRING",
            "mode":"NULLABLE"
         },
         {
            "name":"app_key",
            "type":"STRING",
            "mode":"NULLABLE"
         },
         {
            "name":"device_id",
            "type":"STRING",
            "mode":"NULLABLE"
         }, 
         {
            "name":"sdk_name",
            "type":"STRING",
            "mode":"NULLABLE"
         },
         {
            "name":"sdk_version",
            "type":"STRING",
            "mode":"NULLABLE"
         }, 
         {
            "name":"timestamp",
            "type":"TIMESTAMP",
            "mode":"NULLABLE"
         }, 
         {
            "name":"hour",
            "type":"TIME",
            "mode":"NULLABLE"
         },      
         {
            "name":"dow",
            "type":"STRING",
            "mode":"NULLABLE"
         }

      ]
   },
   {
      "name":"app_key",
      "type":"STRING",
      "mode":"NULLABLE"
   },
   {
      "name":"timestamp",
      "type":"STRING",
      "mode":"NULLABLE"
   },
   {
      "name":"ip_address",
      "type":"STRING",
      "mode":"NULLABLE"
   }
]

错误信息:

{"errors":[{"debugInfo":"","location":"","message":"Repeated record added outside of an array.","reason":"invalid"}],"index":0}

如果我解析没有 RECORD 类型的数据,它会在适当的 bigquery 表中正确解析,但使用 RECORD 类型,它会被摄取到 bq 生成的 <error_records> 表中。

标签: google-bigquerygoogle-cloud-dataflowgoogle-cloud-pubsub

解决方案


通过应用一些修改,我成功地使用 DataflowPub/Sub to Bigquery模板将您的示例插入到 BigQuery 中:

  • 我通过将重复字段放在方括号内将其包含在数组中[...]
  • body.timestamp值无效。您可以在此处TIMESTAMP阅读 BigQuery数据类型和 UNIX 时间戳之间的区别。您可以选择如何处理此问题,具体取决于您要对此时间戳执行的操作。如果您不需要它进行分析,您可以轻松地将字段的数据类型更改为INT64STRING像您timestamp对表的列所做的那样。

所以消息应该是这样的:

{
"url":"/i?session_duration=61&app_key=123456&device_id=gdfttyty&sdk_name=javascript_native_web&sdk_version=18.04",
"body": [
    {
    "session_duration":"61",
    "app_key":"eyrttyuyyu78jkjk",
    "device_id":"h1bh41yptik1vtwr8",
    "sdk_name":"javascript_native_web",
    "sdk_version":"18.04",
    "timestamp":"1597057884636",
    "hour":"10",
    "dow":"1"
    }],
"app_key":"eyrttyuyyu78jkjk",
"timestamp":"1597057884636",
"ip_address":"0.0.0.0"
}

以及body.timestamp字段的数据类型已更改的架构,如下所示:

[

   ...
   
   ,
   {
      "name":"body",
      "type":"RECORD",
      "mode":"REPEATED",
      "fields":[
         
         ...

         , 
         {
            "name":"timestamp",
            "type":"STRING",
            "mode":"NULLABLE"
         }, 
         
         ...
         
      ]
   },
   
   ...
   
]

推荐阅读