scala - 更改 AWS Glue 中的读取将如何更改列的数据类型?
问题描述
我有一个稍作修改的 AWS Glue 作业,仅更改了读取内容,该作业运行良好,但是我的列上的数据类型已更改。我以前有 BigInt,现在我只有 Int。由于架构不匹配,这会导致依赖于这些文件的 EMR 作业出错。我不确定是什么导致了这个问题,因为映射没有改变,所以如果有人有洞察力,这里是旧代码和新代码:
///OLD
val inputsourceDF = spark.read.format("json").load(inputFilePath)
val inputsource = DynamicFrame(inputsourceDF, glueContext)
///NEW
val inputsource = glueContext.getSourceWithFormat(connectionType = "s3", options = JsonOptions(Map("paths" -> Set(inputFilePath))), format = "json", transformationContext = "inputsource").getDynamicFrame()
///WRITE which did not change
val inputsink = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(s"""{"path": "$inputOutputFilePath"}"""), transformationContext = "inputdatasink", format = "parquet").writeDynamicFrame(inputdropnullfields.coalesce(inputPartitionCount))
这些是在粘合作业后抓取文件时创建的表
CREATE EXTERNAL TABLE `input_new`(`id` int)
CREATE EXTERNAL TABLE `input_old`(`id` bigint)
我们添加了此更改,以便我们可以使用书签,我们将不胜感激。
解决方案
Both spark DataFrame
and glue DynamicFrame
infer the schema when reading data from json, but evidently, they do it differently: sparks treats all numerical values as bigint
, while glue is trying to be clever, and (I guess) looks at the actual range of values on the fly.
Some more info about DynamicFrame
schema inference can be found here.
If you are going to write parquet in the end anyway, and want the schema stable and consistent, I'd say your easiest way around this is to just revert your change and go back to spark DataFrame
.
You can also use apply_mapping to change the types explicitly after reading the data, but it seems like defeating the purpose of having the dynamic frame in the first place.
推荐阅读
- amazon-cloudwatch - ECS AWS Cloudwatch 日志
- mysql - MySQL 可以显示表但不能访问数据 (errno: 13 - Permission denied)
- html - 我的 SVG 似乎有数据,但没有呈现
- java - 无法将 [...ReloadableResourceBundleMessageSource] 转换为所需类型 [...ResourceBundleMessageSource]
- apollo - 如何为不使用 Apollo 引擎的项目正确配置 `apollo` CLI?
- python - 在 Flask 中使用 for 循环更新数据库?
- asp.net - Web 窗体用户控制事件,需要在页面加载后添加
- r - R:如何使用别名调用数据框列?
- c# - 如何在 MVC 主控制器中将 http post 请求数据读取为 JSON?
- memcached - 是否可以选择使用 spring 缓存从缓存中执行 getBulk