apache-spark - Issue with Apache Hudi Update and Delete Operation on Parquet S3 File
问题描述
Here I am trying to simulate updates and deletes over a Hudi dataset and wish to see the state reflected in Athena table. We use EMR, S3 and Athena services of AWS.
- Attempting Record Update with a withdrawal object
withdrawalID_mutate = 10382495
updateDF = final_df.filter(col("withdrawalID") == withdrawalID_mutate) \
.withColumn("accountHolderName", lit("Hudi_Updated"))
updateDF.write.format("hudi") \
.options(**hudi_options) \
.mode("append") \
.save(tablePath)
hudiDF = spark.read \
.format("hudi") \
.load(tablePath).filter(col("withdrawalID") == withdrawalID_mutate).show()
Shows the updated record but it is actually appended in the Athena table. Probably something to do with Glue Catalogue?
- Attempting Record Delete
deleteDF = updateDF #deleting the updated record above
deleteDF.write.format("hudi") \
.option('hoodie.datasource.write.operation', 'upsert') \
.option('hoodie.datasource.write.payload.class', 'org.apache.hudi.common.model.EmptyHoodieRecordPayload') \
.options(**hudi_options) \
.mode("append") \
.save(tablePath)
still reflects the deleted record in the Athena table
Also tried using mode("overwrite")
but as expected it deletes the older partitions and keeps only the latest.
Did anyone faced same issue and can guide in the right direction
解决方案
推荐阅读
- sed - 将 TSV 转换为 CSV,其中 TSV 字段中包含逗号
- python - 我可以训练预测多个项目特征的模型吗?
- wpf - 在 xamdatagrid 中使用 Ctrl + Shift + Up 或 Down 键时如何禁用行选择?
- reactjs - 如何一起工作 react-table 和 react-contextmenu frankensteined
- java - 在 Kafka Stream 中处理消息时发生错误时重新处理消息
- .net - TCP Socket 两个节点交谈时的“事件流”是什么
- angular - 每秒构建的 Karma afterAll-Error
- r - 将前一个匹配行的值复制到新的匹配行
- html - 是否可以重用从 HTMLCanvasElement.toDataURL() 或 canvas.toBlob() 返回的 DOMString?
- scala - Log4j 在多个节点中创建日志。想在一个节点上创建一个日志