amazon-web-services - Performance issue with AWS EMR S3DistCp
问题描述
I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).
Here is my EMR configuration:
1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0
The command:
s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128
Am I missing something ? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.
Thank you.
解决方案
以下是推荐
- 使用 R 类型实例。与 M 类型实例相比,它将提供更多内存
- 使用 coalesce 合并源中的文件,因为您有许多小文件
- 检查映射器任务的数量。任务越多,性能越差
推荐阅读
- python - SqlAlchemy:使用 Dual EXISTS (OR) 和额外的布尔检查进行子选择
- compilation - 编译器优化更容易添加到命令式语言中意味着什么?
- python - Plotly go.Choroplethmapbox 不显示颜色
- r - R函数将chr转换为时间
- amazon-web-services - 启动多个实例时,有没有办法为每个 EC2 实例分配一个预先知道的唯一 ID?
- angular - 如何覆盖导入模块中提供的服务
- css - 计算元素之间的空间以将元素放在正确的位置
- node.js - 如何注册猫鼬模式?
- powershell - 如何使 repo 状态出现在 posh git (oh-my-posh)
- c++ - 在实现 OIT 的同时,不同窗口的缓冲区和纹理能否在 OpenGL 中共享同一个绑定点?