首页 > 解决方案 > 带有光线的 AWS SageMaker RL:ray.tune.error.TuneError:未指定可训练

问题描述

我有一个基于 AWS SageMaker RL 示例 rl_network_compression_ray_custom 的训练脚本,但更改了环境以制作基本的健身房环境 Asteroids-v0(在训练脚本的主入口点安装依赖项)。当我在 RLEstimator 上运行拟合时,ray.tune.error.TuneError: No trainable specified!即使在训练配置中将运行指定为 DQN,也会出现以下错误。

有谁知道这个问题以及如何解决它?

这是较长的日志:

Running experiment with config {
  "training": {
    "env": "Asteroids-v0",
    "run": "DQN",
    "stop": {
      "training_iteration": 1
    },
    "local_dir": "/opt/ml/output/intermediate",
    "checkpoint_freq": 10,
    "config": {
      "double_q": false,
      "dueling": false,
      "num_atoms": 1,
      "noisy": false,
      "prioritized_replay": false,
      "n_step": 1,
      "target_network_update_freq": 8000,
      "lr": 6.25e-05,
      "adam_epsilon": 0.00015,
      "hiddens": [
        512
      ],
      "learning_starts": 20000,
      "buffer_size": 1000000,
      "sample_batch_size": 4,
      "train_batch_size": 32,
      "schedule_max_timesteps": 2000000,
      "exploration_final_eps": 0.01,
      "exploration_fraction": 0.1,
      "prioritized_replay_alpha": 0.5,
      "beta_annealing_fraction": 1.0,
      "final_prioritized_replay_beta": 1.0,
      "num_gpus": 0.2,
      "timesteps_per_iteration": 10000
    },
    "checkpoint_at_end": true
  },
  "trial_resources": {
    "cpu": 1,
    "extra_cpu": 3
  }
}
Important! Ray with version <=7.2 may report "Did not find checkpoint file" even if the experiment is actually restored successfully. If restoration is expected, please check "training_iteration" in the experiment info to confirm.
Traceback (most recent call last):
  File "train-ray.py", line 83, in <module>
    MyLauncher().train_main()
  File "/opt/ml/code/sagemaker_rl/ray_launcher.py", line 332, in train_main
    launcher.launch()
  File "/opt/ml/code/sagemaker_rl/ray_launcher.py", line 313, in launch
    run_experiments(experiment_config)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/tune.py", line 296, in run_experiments
    experiments = convert_to_experiment_list(experiments)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/experiment.py", line 199, in convert_to_experiment_list
    for name, spec in experiments.items()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/experiment.py", line 199, in <listcomp>
    for name, spec in experiments.items()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/experiment.py", line 122, in from_json
    raise TuneError("No trainable specified!")
ray.tune.error.TuneError: No trainable specified!
2020-04-22 13:21:15,784 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/usr/bin/python train-ray.py --rl.training.checkpoint_freq 1 --rl.training.stop.training_iteration 1 --s3_bucket XXXXX

标签: amazon-web-servicesreinforcement-learningamazon-sagemakerrayrllib

解决方案


日志表明实验配置未正确传递。您能否尝试使用roboschool示例,因为 env 更简单,并在出现时提供错误日志。请确保所有依赖项都包含在 Dockerfile 中以构建自定义映像。


推荐阅读