AWS Batch 将 tar.gz 文件提取到 /opt/ml/model 失败并出现 OSError:[Errno 30] 只读文件系统


我们有 python 代码,它在 docker 容器中执行以下操作

import boto3
import tarfile

s3 = boto3.client('s3')

s3.download_file("dev-bucket", "test/model.tar.gz", "/opt/ml/model/model.tar.gz")

tar = tarfile.open("/opt/ml/model/model.tar.gz", 'r:gz')

但是,作业在提取时失败,并显示 "OSError: [Errno 30] Read-only file system" 。完整的堆栈跟踪是:


>   File "inference.py", line 6
>     tar.extractall(path="/opt/ml/model")   File "/opt/conda/lib/python3.7/tarfile.py", line 2002, in extractall
>     numeric_owner=numeric_owner)   File "/opt/conda/lib/python3.7/tarfile.py", line 2044, in extract
>     numeric_owner=numeric_owner)   File "/opt/conda/lib/python3.7/tarfile.py", line 2114, in _extract_member
>     self.makefile(tarinfo, targetpath)   File "/opt/conda/lib/python3.7/tarfile.py", line 2163, in makefile
>     copyfileobj(source, target, tarinfo.size, ReadError, bufsize)   File "/opt/conda/lib/python3.7/tarfile.py", line 250, in copyfileobj
>     dst.write(buf) OSError: [Errno 30] Read-only file system


FROM continuumio/miniconda3

# use python3.7
RUN /opt/conda/bin/conda install python=3.7

# Update conda
RUN /opt/conda/bin/conda update -n base conda

# Install build-essential
RUN apt-get update && apt-get install -y build-essential \
    wget \
    nginx \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN conda install -y pandas==0.25.1 scikit-learn==0.21.2 s3fs==0.4.2
RUN pip install pyarrow==1.0.0 mxnet joblib==0.13.2 boto3

CMD [ "/bin/bash" ]

ENV PATH="/opt/program:${PATH}"

RUN mkdir -p /opt/ml/model
RUN chmod -R +w /opt/ml/model
RUN mkdir -p /opt/ml/input/data
# Set up the program in the image
COPY helloworld /opt/program
WORKDIR /opt/program

标签: dockeroperating-systemaws-batchtarfile


数据量很大,10GB 默认卷被填满,导致卷变为只读。解决方案是使用启动模板并附加额外的卷。这为我解决了这个问题。


Please note that by default, Docker allocates 10 gibibytes (GiB) of storage for each volume it creates on an Amazon ECS container instance. If a volume reaches the 10-GiB limit, then you can't write any more data to that volume without causing the container instance to crash or the filesystem turns to read only mode. This is applicable only f you're using Amazon Linux 1 AMIs to launch container instances in your ECS cluster. Amazon Linux 2 AMIs use the Docker overlay2 storage driver, which gives you a base storage size of the space left on your disk. [Batch by default launches Amazon Linux 1 AMIs.]
To increase the default storage allocation for Docker volumes, you need to set the dm.basesize storage option to a value higher than 10 GiB in the Docker daemon configuration file /etc/sysconfig/docker on the container instance. This dm.basesize value can be increased upto your EBS volume size and will allow container/batch job to utilise the full space for execution. After setting the dm.basesize value, any new images that are pulled by Docker use the new storage value that you set. Any containers/batch job that were created or running before you changed the value still use the previous storage value.
To apply the dm.basesize option to all your containers, set the value of the option before the Docker service starts.  You can use a launch template to build a configuration template that applies to all your Amazon Elastic Compute Cloud (Amazon EC2) instances launched by AWS Batch. The following example MIME multi-part file overrides the default Docker image settings for a compute resource:
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0
Content-Type: text/cloud-boothook; charset="us-ascii"
cloud-init-per once docker_options echo 'OPTIONS="${OPTIONS} --storage-opt dm.basesize=20G"' >> /etc/sysconfig/docker
Content-Type: text/x-shellscript; charset="us-ascii"
# Set any ECS agent configuration options
echo ECS_CLUSTER=default>>/etc/ecs/ecs.config
echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
