ray - 如何防止对头部执行审判
问题描述
我在 aws“自动缩放 GPU 集群”上使用 ray.tune。目前,我的头和工人都有一个GPU,都用来执行试验。我正在尝试转移到头部没有 GPU 的设置——按照 Ray 的文档定义“自动缩放 GPU 集群”的方式。但是,我一直在头上遇到 CUDA 问题,这是有道理的,因为它用于执行试验。解决方案看起来很简单:我想我需要防止在头上执行试验,但我找不到方法。我尝试了各种resources_per_trial
值,与ray.init()
但没有得到它的工作。
额外细节:
- 我使用射线 0.8.6。
- 我设置
resources_per_trial={'gpu': 1}
- 我
torch.device("cuda:0")
到处设置 - 我使用 1 个头(仅限 cpu)和 1 个工作人员(仅限 gpu),我至少需要 1 个工作人员。
所以一切都只在 GPU 上运行,这就是为什么我专注于防止头部执行。
关于错误和警告,我得到以下信息:
WARNING tune.py:318 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
WARNING ray_trial_executor.py:549 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
WARNING worker.py:1047 -- The actor or task with ID ffffffffffffffff128bce290200 is pending and cannot currently be scheduled. It requires {CPU: 1.000000}, {GPU: 1.000000} for execution and {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {node:10.160.26.189: 1.000000}, {object_store_memory: 12.304688 GiB}, {CPU: 3.000000}, {memory: 41.650391 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
即使当我等待 gpu worker 运行时,我仍然得到上述结果。
最后,错误是:
ERROR trial_runner.py:520 -- Trial TrainableAE_a441f_00000: Error processing event.
Traceback (most recent call last):
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 468, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 1467, in get
values = worker.get_objects(object_ids, timeout=timeout)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 306, in get_objects
return self.deserialize_objects(data_metadata_pairs, object_ids)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 281, in deserialize_objects
return context.deserialize_objects(data_metadata_pairs, object_ids)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 312, in deserialize_objects
self._deserialize_object(data, metadata, object_id))
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 252, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 233, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 221, in _deserialize_pickle5_data
obj = pickle.loads(in_band)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/storage.py", line 136, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 773, in _legacy_load
result = unpickler.load()
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 729, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 178, in default_restore_location
result = fn(storage, location)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 154, in _cuda_deserialize
device = validate_cuda_device(location)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 138, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
解决方案
感谢richliaw 的评论。一旦我不再试图阻止头部试验的执行,而是专注于首先找出这些发生的原因,解决方案就变得显而易见了。我的集群头部的 AMI 上安装了 NVidia 驱动程序和 cuda。在我移除那些射线后,不再尝试在头部执行。所以我想这就是 ray 决定在resources_per_trial={'gpu': 1}
.
推荐阅读
- vscode-remote - 如何使用 chroot 启动代码服务器终端?
- elasticsearch - 这个类型定义有什么问题
- reactjs - React/Redux:预先计算显示数据并存储在 Redux 中还是在组件中计算?
- javascript - 如何避免底部导航重叠内容?
- c# - EBCDIC COMP 值转整数
- c# - 如何从c#中的不同函数关闭线程
- java - 我在从 wsdl 和 xsd 生成 java 类时遇到了这个 iisue
- python - 解压缩从 windows zip 创建的 zip 文件(发送到 -> zip)
- graphql - 你如何向不同的用户公开 GraphQL 类型的不同子集?
- postgresql - '每月的第三个星期五'到 PLPGSQL 中的时间戳?