首页 > 解决方案 > 更改 AWS Glue 中的读取将如何更改列的数据类型?

问题描述

我有一个稍作修改的 AWS Glue 作业,仅更改了读取内容,该作业运行良好,但是我的列上的数据类型已更改。我以前有 BigInt,现在我只有 Int。由于架构不匹配,这会导致依赖于这些文件的 EMR 作业出错。我不确定是什么导致了这个问题,因为映射没有改变,所以如果有人有洞察力,这里是旧代码和新代码:

///OLD
val inputsourceDF = spark.read.format("json").load(inputFilePath)
val inputsource = DynamicFrame(inputsourceDF, glueContext)
///NEW
val inputsource = glueContext.getSourceWithFormat(connectionType = "s3",  options = JsonOptions(Map("paths" -> Set(inputFilePath))), format = "json", transformationContext = "inputsource").getDynamicFrame()

///WRITE which did not change
val inputsink = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(s"""{"path": "$inputOutputFilePath"}"""), transformationContext = "inputdatasink", format = "parquet").writeDynamicFrame(inputdropnullfields.coalesce(inputPartitionCount))

这些是在粘合作业后抓取文件时创建的表

CREATE EXTERNAL TABLE `input_new`(`id` int)

CREATE EXTERNAL TABLE `input_old`(`id` bigint)

我们添加了此更改,以便我们可以使用书签,我们将不胜感激。

标签: scalaaws-glueaws-glue-spark

解决方案


Both spark DataFrame and glue DynamicFrame infer the schema when reading data from json, but evidently, they do it differently: sparks treats all numerical values as bigint, while glue is trying to be clever, and (I guess) looks at the actual range of values on the fly.

Some more info about DynamicFrame schema inference can be found here.

If you are going to write parquet in the end anyway, and want the schema stable and consistent, I'd say your easiest way around this is to just revert your change and go back to spark DataFrame. You can also use apply_mapping to change the types explicitly after reading the data, but it seems like defeating the purpose of having the dynamic frame in the first place.


推荐阅读