首页 > 解决方案 > 如何防止对头部执行审判

问题描述

我在 aws“自动缩放 GPU 集群”上使用 ray.tune。目前,我的头和工人都有一个GPU,都用来执行试验。我正在尝试转移到头部没有 GPU 的设置——按照 Ray 的文档定义“自动缩放 GPU 集群”的方式。但是,我一直在头上遇到 CUDA 问题,这是有道理的,因为它用于执行试验。解决方案看起来很简单:我想我需要防止在头上执行试验,但我找不到方法。我尝试了各种resources_per_trial值,与ray.init()但没有得到它的工作。

额外细节:

所以一切都只在 GPU 上运行,这就是为什么我专注于防止头部执行。

关于错误和警告,我得到以下信息:

WARNING tune.py:318 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
WARNING ray_trial_executor.py:549 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().


WARNING worker.py:1047 -- The actor or task with ID ffffffffffffffff128bce290200 is pending and cannot currently be scheduled. It requires {CPU: 1.000000}, {GPU: 1.000000} for execution and {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {node:10.160.26.189: 1.000000}, {object_store_memory: 12.304688 GiB}, {CPU: 3.000000}, {memory: 41.650391 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

即使当我等待 gpu worker 运行时,我仍然得到上述结果。

最后,错误是:

ERROR trial_runner.py:520 -- Trial TrainableAE_a441f_00000: Error processing event.
Traceback (most recent call last):
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 468, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 1467, in get
    values = worker.get_objects(object_ids, timeout=timeout)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 306, in get_objects
    return self.deserialize_objects(data_metadata_pairs, object_ids)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 281, in deserialize_objects
    return context.deserialize_objects(data_metadata_pairs, object_ids)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 312, in deserialize_objects
    self._deserialize_object(data, metadata, object_id))
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 252, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 233, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 221, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/storage.py", line 136, in _load_from_bytes
    return torch.load(io.BytesIO(b))
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 593, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 773, in _legacy_load
    result = unpickler.load()
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 729, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 178, in default_restore_location
    result = fn(storage, location)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 154, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 138, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

标签: ray

解决方案


感谢richliaw 的评论。一旦我不再试图阻止头部试验的执行,而是专注于首先找出这些发生的原因,解决方案就变得显而易见了。我的集群头部的 AMI 上安装了 NVidia 驱动程序和 cuda。在我移除那些射线后,不再尝试在头部执行。所以我想这就是 ray 决定在resources_per_trial={'gpu': 1}.


推荐阅读