首页 > 解决方案 > Issue with Apache Hudi Update and Delete Operation on Parquet S3 File

问题描述

Here I am trying to simulate updates and deletes over a Hudi dataset and wish to see the state reflected in Athena table. We use EMR, S3 and Athena services of AWS.

  1. Attempting Record Update with a withdrawal object
withdrawalID_mutate = 10382495
updateDF = final_df.filter(col("withdrawalID") == withdrawalID_mutate) \ 
    .withColumn("accountHolderName", lit("Hudi_Updated"))  
    
updateDF.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save(tablePath) 
    
hudiDF = spark.read \
    .format("hudi") \
    .load(tablePath).filter(col("withdrawalID") == withdrawalID_mutate).show() 

Shows the updated record but it is actually appended in the Athena table. Probably something to do with Glue Catalogue?

  1. Attempting Record Delete
deleteDF = updateDF #deleting the updated record above 
    
deleteDF.write.format("hudi") \ 
    .option('hoodie.datasource.write.operation', 'upsert') \
    .option('hoodie.datasource.write.payload.class', 'org.apache.hudi.common.model.EmptyHoodieRecordPayload') \
    .options(**hudi_options) \
    .mode("append") \
    .save(tablePath) 

still reflects the deleted record in the Athena table

Also tried using mode("overwrite") but as expected it deletes the older partitions and keeps only the latest.

Did anyone faced same issue and can guide in the right direction

标签: apache-sparkspark-streamingamazon-emrapache-hudiiceberg

解决方案


推荐阅读