首页 > 解决方案 > Aggregating scattered files in Spark

问题描述

I have a job that ingests data on a daily basis in S3 partitioning by a specific field, e.g.:

...
result_df.write.partitionBy("my_field").parquet("s3://my/location/")

This ingestion process will write to already existing partitions every day, adding files containing one or just a few record. I want to emphasize that this is gonna happen every day: with time, this is going to generate many small files which everybody hates. You would probably tell me that is not the best field for partitioning, but this is the field needed by the business.

So I was thinking to run another job that reviews partitions containing too many files and coalesce them on a daily basis. But unfortunately I can't think of an efficient way to coalesce these files with Spark. The only solution that came to my mind is

  1. reading the partition with too many small files
  2. repartition and write the results on a support folder
  3. delete the source partition
  4. move the data generated in step 2 to the original partition

I really don't like the idea of moving data so many times, and I find it inefficient. The ideal is to group all files in the same partition in a smaller number, but with Spark it doesn't look feasible to me.

Are there any best practices regarding this use case? Or any improvement to the suggested process?

标签: apache-sparkamazon-s3partitioning

解决方案


推荐阅读