首页 > 解决方案 > AWS Sagemaker KeyError:调整超参数时出现“SM_CHANNEL_TRAINING”

问题描述

当我尝试在 Sagemaker 上使用超参数调整时,我收到此错误:

UnexpectedStatusException: Error for HyperParameterTuning job imageclassif-job-10-21-47-43: Failed. Reason: No training job succeeded after 5 attempts. Please take a look at the training job failures to get more details.

当我在 CloudWatch 上查找日志时,所有 5 个失败的训练作业最后都有相同的错误:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/ml/code/train.py", line 117, in <module>
    parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
  File "/usr/lib/python3.5/os.py", line 725, in __getitem__
    raise KeyError(key) from None

KeyError: 'SM_CHANNEL_TRAINING'

问题出在项目的第 4 步:https ://github.com/petrooha/Deploying-LSTM/blob/main/SageMaker%20Project.ipynb

非常感谢任何关于下一步看哪里的提示

标签: python-3.xdeep-learninglstmamazon-sagemakerhyperparameters

解决方案


在您的train.py文件中,将环境变量从

parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])

parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])应该解决这个问题。

Torch 的 framework_version 1.3.1 就是这种情况,但其他版本也可能受到影响。这是供您参考的链接。


推荐阅读