首页 > 解决方案 > Move millions of files to root of S3

问题描述

I have 20 million local files. Each file is represented by a numeric ID that is hashed.

File 1 is named 356a192b7913b04c54574d18c28d46e6395428ab (sha1 of "1")

File 2 is named da4b9237bacccdf19c0760cab7aec4a8359010b0 (sha1 of "2")

etc. etc.

Not every number represent a file, but i have a list of all numbers that do.

The files are placed in folders named after the first two characters in the hash, followed by the next two, followed by the next two.

For file 1 (da4b9237bacccdf19c0760cab7aec4a8359010b0) the folder structure is

da/4b/92/

In that folder the file is placed and it's named it's full hash, so the full path of the file is

da/4b/92/da4b9237bacccdf19c0760cab7aec4a8359010b0

I now want to move all the files from the file system to a bucket at Amazon S3, and while doing so I want to move them out to the root of that bucket.

As there are so many files it would be good if there was a way to log what files have been moved and what files might have failed for some reson, i need to be able to resume the operation if it fails.

My plan is to create a table in mysql called moved_files and then run a PHP script that fetches X number of ID's from the files table, uses the AWS SDK for PHP to copy the file to S3, if it succeeds it add that ID to the moved_files table. However I'm not sure if this would be the fastest way to do it, maybe I should look into writing a bash script using the AWS Cli.

Any suggestions would be appreciated!

标签: phpbashamazon-web-servicesamazon-s3file-io

解决方案


我不使用 AWS S3,但一点谷歌搜索建议你需要一个命令,如:

aws s3 cp test.txt s3://mybucket/test2.txt

因此,如果您想为所有文件运行它,我建议您使用GNU Parallel来充分利用您的连接并减少延迟。

请创建一个包含几十个文件的测试目录进行测试,然后cd到该目录并尝试以下命令:

find . -type f -print0 | parallel -0 --dry-run aws s3 cp {} s3://rootbucket/{/}

样本输出

aws s3 cp ./da/4b/92/da4b9237bacccdf19c0760cab7aec4a8359010b0 s3://rootbucket/da4b9237bacccdf19c0760cab7aec4a8359010b0
aws s3 cp ./da/4b/92/da4b9237bacccdf19c0760cab7aec4a8359010b1 s3://rootbucket/da4b9237bacccdf19c0760cab7aec4a8359010b1

如果您有 8 个 CPU 内核,那么将一次运行 8 个并行副本,aws直到您的所有文件都被复制。

{}扩展为“当前文件”{/}扩展为“没有目录的当前文件”。您还可以添加--bar以获取进度条。

如果这看起来很有希望,我们可以bash为每个更新数据库或删除本地文件的文件添加一个小函数,条件是aws命令成功。看起来像这样 - 从底部开始阅读 ;-)

#!/bin/bash

# bash function to upload single file
upload() {
   local="$1"                                         # Pick up parameters
   remote="$2"                                        # Pick up parameters
   echo aws s3 cp "$local" "s3://rootbucket/$remote"  # Upload to AWS
   if [ $? -eq 0 ] ; then
      : # Delete locally or update database with success
   else
      : # Log error somewhere
   fi
}

# Export the upload function so processes started by GNU Parallel can find it
export -f upload

# Run GNU Parallel on all files
find . -type f -print0 | parallel -0 upload {} {/}

推荐阅读