apache-spark - Aggregating scattered files in Spark
问题描述
I have a job that ingests data on a daily basis in S3 partitioning by a specific field, e.g.:
...
result_df.write.partitionBy("my_field").parquet("s3://my/location/")
This ingestion process will write to already existing partitions every day, adding files containing one or just a few record. I want to emphasize that this is gonna happen every day: with time, this is going to generate many small files which everybody hates. You would probably tell me that is not the best field for partitioning, but this is the field needed by the business.
So I was thinking to run another job that reviews partitions containing too many files and coalesce them on a daily basis. But unfortunately I can't think of an efficient way to coalesce these files with Spark. The only solution that came to my mind is
- reading the partition with too many small files
- repartition and write the results on a support folder
- delete the source partition
- move the data generated in step 2 to the original partition
I really don't like the idea of moving data so many times, and I find it inefficient. The ideal is to group all files in the same partition in a smaller number, but with Spark it doesn't look feasible to me.
Are there any best practices regarding this use case? Or any improvement to the suggested process?
解决方案
推荐阅读
- django - 数据不会出现在 Django 的 html 页面中
- php - PHP为什么来自两个不同字符串的两个相同值不相等
- sql - 找出重叠 er 数据库中的总人数
- c++ - 如何解析 yaml-cpp 中的 yaml 部分?
- facebook - 如何防止 Facebook 加载我的 Blogger 帖子的预览图片?
- amazon-web-services - 使用有状态集存储 Cassandra 数据
- typescript - 将类型映射到字符串属性
- ffmpeg - FFMPEG 实时流发送消息并在发送某些帧后退出
- linux - 如何使用 fortran (MKL) 将静态库中的函数打包到 .so 文件中
- python - Python 库不会延续到 Atom