amazon-web-services - AWS Glue:架构中未找到列“column_name”
问题描述
我正在尝试在 AWS Glue 中创建 ETL 作业。用例如下:当在运行 ETL 作业后在一个源表中添加一列时,当我们尝试重新运行 etl 作业时,etl 作业失败,说找不到列(在目标表中)
如何启用 ETL 在目标表中创建该列。因为 ETL 已经有权在表不存在时创建表。
例子:
源表:
Table X: column_1, column_2
Table Y: column_1, column_3, column_4
ETL 作业配置为加入它们两者,导致
Table_XY: column_1, column_2, column_3, column_4
在此之前,它可以完美运行。
现在,如果表 Y 得到如下修改
Table Y: column_1, column_3, column_4, **column_5**
我重新运行爬虫(检测源列)
然后我重新运行 ETL 作业,它失败并显示以下错误消息
在架构中找不到列“column_5”
我该如何解决这个问题?
用胶水脚本更新:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "db_source", table_name = "sourc_table_x", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_source", table_name = "sourc_table_x", transformation_ctx = "datasource0")
## @type: DataSource
## @args: [database = "db_source", table_name = "sourc_table_y", redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasource1"]
## @return: datasource1
## @inputs: []
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "db_source", table_name = "sourc_table_y", redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasource1")
## @type: Join
## @args: [keys1 = ['column_1'], keys2 = ['column_1']]
## @return: join2
## @inputs: [frame1 = datasource0, frame2 = datasource1]
join2 = Join.apply(frame1 = datasource0, frame2 = datasource1, keys1 = ['column_1'], keys2 = ['column_1'], transformation_ctx = "join2")
## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = join2]
resolvechoice2 = ResolveChoice.apply(frame = join2, choice = "make_cols", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [catalog_connection = "my-db-connection", connection_options = {"dbtable": "target_table_xy", "database": "db_target"}, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "my-db-connection", connection_options = {"dbtable": "target_table_xy", "database": "db_target"}, transformation_ctx = "datasink4")
job.commit()
解决方案
推荐阅读
- r - 无法在 R 中编写用于回归分析的循环
- javascript - 如果计算的总计为负值,请将其设置为零。应提醒用户发生错误。使用提示框
- python - 会话期间的 django 会话密钥
- kernel - 如何在我的内核中启用 BT_RFCOMM、HID、注入、Nexmon 功能?
- reactjs - 我想将我的侧边栏传递给 React 中的一些视图
- python - 获取 pandas 中每个 id 在过去 12(可变)个月内的百分比变化
- firebase - Firestore 安全规则:根据数据库中的其他文档评估传入请求
- windows - 通过命令行使用 explorer.exe 打开位置时如何显示错误消息
- user-interface - UI 交互式信息娱乐
- sql - 如何将整数值分配给字符串并在 SQL 中找到总和