首页 > 解决方案 > 在 WSL 上本地使用 Sagemaker Python SDK 进行训练时没有 /opt/ml/input/config/resourceconfig.json 错误

问题描述

目标是能够使用 AWS 提供的 Docker 映像运行 Sagemaker 本地开发和本地训练。我已经能够在 Ubuntu 20.04 VM 上运行,代码和 Docker 都在同一个 VM 上运行,但无法使用 WSL Ubuntu 20.04 + Docker 桌面设置。

通过下面的设置,我已经能够创建一个简单的 Docker 映像,该映像调用 Python 脚本来读取和写入数据到 WSL 目录和 Windows 自动挂载的目录,以证明 WSL Ubuntu + Docker Desktop 工作正常。

任何帮助表示赞赏 - 一直在努力让这个工作!

环境

设置

[automount]
enabled = true
root = /
options = "metadata,umask=22,fmask=11,case=off"

问题

以下代码在fit()调用时抛出错误。

    mnist_estimator = TensorFlow(entry_point='mnist_tf2.py',
                                 role=dummy_role,
                                 instance_count=1,
                                 instance_type='local',
                                 framework_version='2.2',
                                 source_dir='/home/a632940/dev/sagemaker-local',
                                 py_version='py37',
                                 session=local_session,
                                 distribution={'parameter_server': {'enabled': True}})

    mnist_estimator.fit({'train': training_dataset_path})

错误

Creating network "sagemaker-local" with the default driver
Creating 0j3k45995o-algo-1-prqxb ... done
Attaching to 0j3k45995o-algo-1-prqxb
0j3k45995o-algo-1-prqxb | Reporting training FAILURE
0j3k45995o-algo-1-prqxb | framework error: 
0j3k45995o-algo-1-prqxb | Traceback (most recent call last):
0j3k45995o-algo-1-prqxb |   File "/usr/local/lib/python3.7/site-packages/sagemaker_training/trainer.py", line 66, in train
0j3k45995o-algo-1-prqxb |     env = environment.Environment()
0j3k45995o-algo-1-prqxb |   File "/usr/local/lib/python3.7/site-packages/sagemaker_training/environment.py", line 498, in __init__
0j3k45995o-algo-1-prqxb |     resource_config = resource_config or read_resource_config()
0j3k45995o-algo-1-prqxb |   File "/usr/local/lib/python3.7/site-packages/sagemaker_training/environment.py", line 239, in read_resource_config
0j3k45995o-algo-1-prqxb |     return _read_json(resource_config_file_dir)
0j3k45995o-algo-1-prqxb |   File "/usr/local/lib/python3.7/site-packages/sagemaker_training/environment.py", line 191, in _read_json
0j3k45995o-algo-1-prqxb |     with open(path, "r") as f:
0j3k45995o-algo-1-prqxb | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
0j3k45995o-algo-1-prqxb | 
0j3k45995o-algo-1-prqxb | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
0j3k45995o-algo-1-prqxb exited with code 2

源代码

import os

import boto3
import numpy as np
import sagemaker.session
from sagemaker.local import LocalSession
from sagemaker.tensorflow import TensorFlow

data_files_list = ('train_data.npy', 'train_labels.npy',
                   'eval_data.npy', 'eval_labels.npy')

def download_training_and_eval_data(aws_session):
    if os.path.isfile('./data/train_data.npy') and \
            os.path.isfile('./data/train_labels.npy') and \
            os.path.isfile('./data/eval_data.npy') and \
            os.path.isfile('./data/eval_labels.npy'):
        print('Training and evaluation datasets exist. Skipping Download')
    else:
        print('Downloading training and evaluation dataset')
        s3 = aws_session.resource('s3')
        for filename in data_files_list:
            s3.meta.client.download_file('sagemaker-sample-data-us-east-1', 'tensorflow/mnist/' + filename,
                                         './data/' + filename)


def do_inference_on_local_endpoint(predictor):
    print(f'\nStarting Inference on endpoint.')
    correct_predictions = 0

    train_data = np.load('./data/train_data.npy')
    train_labels = np.load('./data/train_labels.npy')

    predictions = predictor.predict(train_data[:50])
    for i in range(0, 50):
        prediction = np.argmax(predictions['predictions'][i])
        label = train_labels[i]
        print('prediction is {}, label is {}, matched: {}'.format(
            prediction, label, prediction == label))
        if prediction == label:
            correct_predictions = correct_predictions + 1

    print('Calculated Accuracy from predictions: {}'.format(
        correct_predictions / 50))


def main():

    # AWS Setup
    aws_session = boto3.session.Session(profile_name='default')

    download_training_and_eval_data(aws_session)

    local_session = sagemaker.LocalSession()
    local_session.config = {'local': {'local_code': True}}
    dummy_role = 'arn:aws:iam::999999999999:role/Dummy-SageMaker--Role'
    
    training_dataset_path = "file://./data/"

    print('Starting model training.')

    mnist_estimator = TensorFlow(entry_point='mnist_tf2.py',
                                 role=dummy_role,
                                 instance_count=1,
                                 instance_type='local',
                                 framework_version='2.2',
                                 source_dir='/home/a632940/dev/sagemaker-local',
                                 py_version='py37',
                                 session=local_session,
                                 distribution={'parameter_server': {'enabled': True}})

    mnist_estimator.fit({'train': training_dataset_path})
    print('Completed model training')

if __name__ == "__main__":
    main()

标签: windows-subsystem-for-linuxamazon-sagemakerdocker-desktop

解决方案


推荐阅读