google-bigquery - Pubsub 到 BQ Dataflow 模板未解析 RECORD 类型数据
问题描述
我正在使用此 Dataflow 模板“Pub/Sub Topic to BigQuery”来解析具有 RECORD 类型数据结构的 json 架构。示例示例:
{
"url":"/i?session_duration=61&app_key=123456&device_id=gdfttyty&sdk_name=javascript_native_web&sdk_version=18.04",
"body":
{
"session_duration":"61",
"app_key":"eyrttyuyyu78jkjk",
"device_id":"h1bh41yptik1vtwr8",
"sdk_name":"javascript_native_web",
"sdk_version":"18.04",
"timestamp":"1597057884636",
"hour":"10",
"dow":"1"
},
"app_key":"eyrttyuyyu78jkjk",
"timestamp":"1597057884636",
"ip_address":"0.0.0.0"
}
BigQuery 中定义的架构如下:
[
{
"name":"url",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"body",
"type":"RECORD",
"mode":"REPEATED",
"fields":[
{
"name":"session_duration",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"app_key",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"device_id",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"sdk_name",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"sdk_version",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"timestamp",
"type":"TIMESTAMP",
"mode":"NULLABLE"
},
{
"name":"hour",
"type":"TIME",
"mode":"NULLABLE"
},
{
"name":"dow",
"type":"STRING",
"mode":"NULLABLE"
}
]
},
{
"name":"app_key",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"timestamp",
"type":"STRING",
"mode":"NULLABLE"
},
{
"name":"ip_address",
"type":"STRING",
"mode":"NULLABLE"
}
]
错误信息:
{"errors":[{"debugInfo":"","location":"","message":"Repeated record added outside of an array.","reason":"invalid"}],"index":0}
如果我解析没有 RECORD 类型的数据,它会在适当的 bigquery 表中正确解析,但使用 RECORD 类型,它会被摄取到 bq 生成的 <error_records> 表中。
解决方案
通过应用一些修改,我成功地使用 DataflowPub/Sub to Bigquery
模板将您的示例插入到 BigQuery 中:
- 我通过将重复字段放在方括号内将其包含在数组中
[...]
- 该
body.timestamp
值无效。您可以在此处TIMESTAMP
阅读 BigQuery数据类型和 UNIX 时间戳之间的区别。您可以选择如何处理此问题,具体取决于您要对此时间戳执行的操作。如果您不需要它进行分析,您可以轻松地将字段的数据类型更改为INT64
或STRING
像您timestamp
对表的列所做的那样。
所以消息应该是这样的:
{
"url":"/i?session_duration=61&app_key=123456&device_id=gdfttyty&sdk_name=javascript_native_web&sdk_version=18.04",
"body": [
{
"session_duration":"61",
"app_key":"eyrttyuyyu78jkjk",
"device_id":"h1bh41yptik1vtwr8",
"sdk_name":"javascript_native_web",
"sdk_version":"18.04",
"timestamp":"1597057884636",
"hour":"10",
"dow":"1"
}],
"app_key":"eyrttyuyyu78jkjk",
"timestamp":"1597057884636",
"ip_address":"0.0.0.0"
}
以及body.timestamp
字段的数据类型已更改的架构,如下所示:
[
...
,
{
"name":"body",
"type":"RECORD",
"mode":"REPEATED",
"fields":[
...
,
{
"name":"timestamp",
"type":"STRING",
"mode":"NULLABLE"
},
...
]
},
...
]
推荐阅读
- sapui5 - includeScript 函数包含的外部 JS 库不可访问
- javascript - 如何在 react-apollo-hook 的 useMutation 中包含 ID 和变量
- ios - 如何使“自由形式”的 UIViewController 适合多种屏幕尺寸?
- thymeleaf - 如何百里香th:每个数字变量输入图像src?
- python - 如果满足条件,如何仅在列表中调用字典值?
- shared-memory - shm_open 函数的正确 Java 映射是什么?
- ember.js - 渲染表单时出现 ember-bootstrap 错误
- r - 如何修复“HTTP状态为'409 Conflict'文件中的错误(con,“r”)”
- python - Selenium PhantomJS 和 Chrome 驱动程序在特定 URL 上返回空白页
- python - 由于张量未初始化而导致的错误