apache-spark - How to handle failure scenario in Spark write to orc file
问题描述
I have a use case where I am pushing the data from Mongodb to HDFS in orc file which runs every 1 day interval and appends the data in orc file existing in hdfs.
Now my concern is if while writing to orc file , the job somehow gets failed or stopped. How should I handle that scenario taking in consideration that some data is already written in orc file. I want to avoid duplicate in orc file.
Snippet for writing to orc file format -
val df = sparkSession
.read
.mongo(ReadConfig(Map("database" -> "dbname", "collection" -> "tableName")))
.filter($"insertdatetime" >= fromDateTime && $"insertdatetime" <= toDateTime)
df.write
.mode(SaveMode.Append)
.format("orc")
.save(/path_to_orc_file_on_hdfs)
I don't want to go for checkpoint the complete RDD as it will be very expensive operation. Also, I don't want to create multiple orc file. Requirement is to maintain single file only.
Any other solution or approach I should try ?
解决方案
嗨,最好的方法之一是将数据写入 HDFS 下每天一个文件夹。
因此,如果您的 ORC 编写工作失败,您将能够清理文件夹。
清洁应该发生在您工作的 bash 方面。如果返回码 != 0 则删除 ORC 文件夹。然后重试。
编辑:通过写入日期进行分区将对您 ORC 稍后使用 spark 阅读更强大
推荐阅读
- android - 如何延迟已经在可运行文件中的 Kotlin (Android Studio) 中的 if-else 执行?
- java - 如何删除 TextChangedListener?
- c# - C# HttpClient 自定义标头每个请求
- excel - 通过单击按钮根据计数增加值
- ios - 未调用 iOS 委托方法
- google-cloud-platform - Google BigQuery 的 HLL+ 精度
- javascript - 仅在 Travis CI 上使用扩展运算符构建失败
- php - Laravel 5.8 Show 方法的自定义路由
- php - php SoapClient wsdl SoapFault 异常:[HTTP] 无法连接到主机
- php - 将 PHP 脚本和 html 呈现在一个函数/视图中