php - Move millions of files to root of S3
问题描述
I have 20 million local files. Each file is represented by a numeric ID that is hashed.
File 1 is named 356a192b7913b04c54574d18c28d46e6395428ab
(sha1 of "1")
File 2 is named da4b9237bacccdf19c0760cab7aec4a8359010b0
(sha1 of "2")
etc. etc.
Not every number represent a file, but i have a list of all numbers that do.
The files are placed in folders named after the first two characters in the hash, followed by the next two, followed by the next two.
For file 1 (da4b9237bacccdf19c0760cab7aec4a8359010b0
) the folder structure is
da/4b/92/
In that folder the file is placed and it's named it's full hash, so the full path of the file is
da/4b/92/da4b9237bacccdf19c0760cab7aec4a8359010b0
I now want to move all the files from the file system to a bucket at Amazon S3, and while doing so I want to move them out to the root of that bucket.
As there are so many files it would be good if there was a way to log what files have been moved and what files might have failed for some reson, i need to be able to resume the operation if it fails.
My plan is to create a table in mysql called moved_files
and then run a PHP script that fetches X number of ID's from the files
table, uses the AWS SDK for PHP to copy the file to S3, if it succeeds it add that ID to the moved_files
table. However I'm not sure if this would be the fastest way to do it, maybe I should look into writing a bash script using the AWS Cli.
Any suggestions would be appreciated!
解决方案
我不使用 AWS S3,但一点谷歌搜索建议你需要一个命令,如:
aws s3 cp test.txt s3://mybucket/test2.txt
因此,如果您想为所有文件运行它,我建议您使用GNU Parallel来充分利用您的连接并减少延迟。
请创建一个包含几十个文件的测试目录进行测试,然后cd
到该目录并尝试以下命令:
find . -type f -print0 | parallel -0 --dry-run aws s3 cp {} s3://rootbucket/{/}
样本输出
aws s3 cp ./da/4b/92/da4b9237bacccdf19c0760cab7aec4a8359010b0 s3://rootbucket/da4b9237bacccdf19c0760cab7aec4a8359010b0
aws s3 cp ./da/4b/92/da4b9237bacccdf19c0760cab7aec4a8359010b1 s3://rootbucket/da4b9237bacccdf19c0760cab7aec4a8359010b1
如果您有 8 个 CPU 内核,那么将一次运行 8 个并行副本,aws
直到您的所有文件都被复制。
{}
扩展为“当前文件”,{/}
扩展为“没有目录的当前文件”。您还可以添加--bar
以获取进度条。
如果这看起来很有希望,我们可以bash
为每个更新数据库或删除本地文件的文件添加一个小函数,条件是aws
命令成功。看起来像这样 - 从底部开始阅读 ;-)
#!/bin/bash
# bash function to upload single file
upload() {
local="$1" # Pick up parameters
remote="$2" # Pick up parameters
echo aws s3 cp "$local" "s3://rootbucket/$remote" # Upload to AWS
if [ $? -eq 0 ] ; then
: # Delete locally or update database with success
else
: # Log error somewhere
fi
}
# Export the upload function so processes started by GNU Parallel can find it
export -f upload
# Run GNU Parallel on all files
find . -type f -print0 | parallel -0 upload {} {/}
推荐阅读
- visual-studio-code - 当我在 VSCode 中“步入”时出现 LOAD_GLOBAL(dispatcher)
- c++ - 在 TCP 接收中,opencv imdecode 函数在 while 循环中从第二轮返回零
- python-3.x - Many2many字段表未定义的odoo,无法获取多个many2many字段
- python - python lambda函数是否可以更新对象的属性?
- python - 通过一系列奇数/偶数跳转到达数组的末尾
- python-3.x - 为什么我不能从 mac 终端安装 AppKit?
- flutter - 具有动态宽度的行内的 ListView
- api - 使用已弃用的 Google Infographics API 是个好主意
- javascript - Angularx Social Facebook 登录弹出对话框在 FB 移动浏览器中被阻止
- javascript - 在新选项卡中打开时显示“404 page not found”错误的 React 应用程序