首页 > 解决方案 > 如何运行依赖于项目代码的 ray 任务

问题描述

我有一个包含许多文件夹的大型 python 项目

-model
-utils
-compute

我的光线远程代码是计算文件夹中的一些功能,我需要在模型和实用程序的远程任务代码中运行

目前,我收到错误,没有针对不同项目文件夹的此类模块

from utils.osops import run_command
from model.model_desc import ModelInsance
from compute.ray_remote import 

@ray.remote
def run_eval_remote(cmd_data, model_json):
    model_ins = ModelInsance.read_from_json(model_json)
    run_command(model_ins.bash_cmd) 
    # do some more staff
    return some_value 

如何正确地做到这一点?

这是一个堆栈跟踪:

  "/Users/me/proj/compute/evaluator_ray.py", line 178, in <listcomp>
ray_res = [self.eval_instance(instance, eval_metric) for instance in mutations_for_search]
  File "/Users/me/proj/compute/evaluator_ray.py", line 175, in eval_instance
return run_eval_remote.remote(cmd_data, instance_json)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/remote_function.py", line 114, in _remote_proxy
return self._remote(args=args, kwargs=kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 292, in _invocation_remote_span
return method(self, args, kwargs, *_args, **_kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/remote_function.py", line 202, in _remote
return client_mode_convert_function(
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 133, in client_mode_convert_function
return client_func._remote(in_args, in_kwargs, **kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/common.py", line 98, in _remote
return self.options(**option_args).remote(*args, **kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/common.py", line 296, in remote
 return return_refs(ray.call_remote(self, *args, **kwargs))
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/api.py", line 103, in call_remote
return self.worker.call_remote(instance, *args, **kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/worker.py", line 322, in call_remote
task = instance._prepare_client_task()
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/common.py", line 302, in _prepare_client_task
task = self.remote_stub._prepare_client_task()
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/common.py", line 119, in _prepare_client_task
self._ensure_ref()
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/common.py", line 115, in _ensure_ref
self._ref = ray.put(
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/api.py", line 52, in put
return self.worker.put(*args, **kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/worker.py", line 260, in put
out = [self._put(x, client_ref_id=client_ref_id) for x in to_put]
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/worker.py", line 260, in <listcomp>
out = [self._put(x, client_ref_id=client_ref_id) for x in to_put]
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/worker.py", line 280, in _put
raise cloudpickle.loads(resp.error)
ModuleNotFoundError: No module named 'compute'

标签: pythonray

解决方案


我遇到了类似的问题,并解决如下:

  • 将代码分发到每个节点(我只是git clone在每个节点中 d)
  • 确保每个节点中代码的版本/分支/等相同
  • 在每个节点中设置一个虚拟环境,在其中安装 ray(和其他项目依赖项)
  • 从 virtualenv 启动 ray,并加入集群

现在,当您从头节点(或集群外部)启动作业时,依赖项存在并且作业运行良好。

当然,更简洁的分发方式是通过容器,但就我的目的而言,这种方法效果很好。


推荐阅读