首页 > 解决方案 > Performance issue with AWS EMR S3DistCp

问题描述

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).

Here is my EMR configuration:

1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0

The command:

s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128

Am I missing something ? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.

Thank you.

标签: amazon-web-servicesperformancehadoopamazon-emrs3distcp

解决方案


以下是推荐

  1. 使用 R 类型实例。与 M 类型实例相比,它将提供更多内存
  2. 使用 coalesce 合并源中的文件,因为您有许多小文件
  3. 检查映射器任务的数量。任务越多,性能越差

推荐阅读