首页 > 解决方案 > Bash、Conda、Docker 和 Ray:应该向 Ray 提供哪些启动命令,以便在运行时正确获取 docker 容器中的 bash 配置文件?

问题描述

我正在尝试使用 Ray 和 Docker 在 EC2 上以编程方式启动作业。我想在我的 Docker 容器中使用 conda 进行包管理。我已经想出了如何构建容器,这样如果我运行, docker run -i -t my_container:my_tag /bin/bash我就可以在本地容器中启动我的作业。问题是,当我将 Ray 添加到图片中以远程启动作业时,Ray 失败并出现以下错误:

start: ray: command not found
Cluster: my-cluster

Checking AWS environment settings
AWS config
  IAM Profile: ray-head-v1
  EC2 Key pair (head & workers): [redacted]
  VPC Subnets (head & workers): [redacted]
  EC2 Security groups (head & workers): [redacted]
  EC2 AMI (head & workers): [redacted]

No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]

Acquiring an up-to-date head node
  Launched 1 nodes [subnet_id=[redacted]]
    Launched instance i-067e250cc8591da86 [state=pending, info=pending]
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/6] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Received: 3.21.104.163
    SSH still not available SSH command failed., retrying in 5 seconds.
    SSH still not available SSH command failed., retrying in 5 seconds.
    Success.
  Updating cluster configuration. [hash=1e011279ffec6f94b2bff4ebf536e6966be5c79a]
  New status: syncing-files
  [3/6] Processing file mounts
  [4/6] No worker file mounts to sync
  New status: setting-up
  [3/6] No initialization commands to run.
  [4/6] No setup commands to run.
  [6/6] Starting the Ray runtime
  New status: update-failed
  !!!
  SSH command failed.
  !!!

  Failed to setup head node.

在这一点上,我已经达到了我对 Ray 和 Docker 如何交互的理解的极限。我认为问题在于以某种方式head_start_ray_commands传递给了它。docker run由于 Docker 使用shshell 来运行命令,因此 bash 配置文件的来源不正确,因此 conda 和 ray 等软件包无法正常工作。这就解释了为什么当我在本地容器实例中以交互模式启动 bash shell 时容器没有任何问题。我已经尝试/bin/bash --login在开始时添加,head_start_ray_commands但这似乎只会导致整个程序冻结。

在执行命令之前让 Ray 获取 bash 配置文件的正确方法是什么?如果这是不可能的,有没有更好的方法来做到这一点?作为参考,这是我当前的光线配置:

init:
  address: null
remote: {}
cluster:
  cluster_name: my-cluster
  min_workers: 0
  max_workers: 2
  initial_workers: 0
  autoscaling_mode: default
  target_utilization_fraction: 0.8
  idle_timeout_minutes: 5
  docker:
    image: [redacted]
    container_name: 'my-container'
    pull_before_run: true
    run_options: ["--gpus 'all'"]
  provider:
    type: aws
    region: us-east-2
    availability_zone: us-east-2a,us-east-2b
    cache_stopped_nodes: false
    key_pair:
      key_name: [redacted]
  auth:
    ssh_user: ubuntu
  head_node:
    IamInstanceProfile:
      Arn: [redacted]
    InstanceType: p2.xlarge
    ImageId: ami-08e16447bd5caf26a
  worker_nodes:
    IamInstanceProfile:
      Arn: [redacted]
    InstanceType: p2.xlarge
    ImageId: ami-08e16447bd5caf26a
  file_mounts: {}
  initialization_commands: []
  setup_commands: []
  head_setup_commands: []
  worker_setup_commands: []
  head_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml
  worker_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

编辑

最简单的解决方法似乎只是完全避免 conda 以支持venv

标签: amazon-web-servicesdockercondapython-venvray

解决方案


推荐阅读