amazon-web-services - Bash、Conda、Docker 和 Ray:应该向 Ray 提供哪些启动命令,以便在运行时正确获取 docker 容器中的 bash 配置文件?
问题描述
我正在尝试使用 Ray 和 Docker 在 EC2 上以编程方式启动作业。我想在我的 Docker 容器中使用 conda 进行包管理。我已经想出了如何构建容器,这样如果我运行,
docker run -i -t my_container:my_tag /bin/bash
我就可以在本地容器中启动我的作业。问题是,当我将 Ray 添加到图片中以远程启动作业时,Ray 失败并出现以下错误:
start: ray: command not found
Cluster: my-cluster
Checking AWS environment settings
AWS config
IAM Profile: ray-head-v1
EC2 Key pair (head & workers): [redacted]
VPC Subnets (head & workers): [redacted]
EC2 Security groups (head & workers): [redacted]
EC2 AMI (head & workers): [redacted]
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
Acquiring an up-to-date head node
Launched 1 nodes [subnet_id=[redacted]]
Launched instance i-067e250cc8591da86 [state=pending, info=pending]
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/6] Waiting for SSH to become available
Running `uptime` as a test.
Waiting for IP
Not yet available, retrying in 10 seconds
Not yet available, retrying in 10 seconds
Not yet available, retrying in 10 seconds
Received: 3.21.104.163
SSH still not available SSH command failed., retrying in 5 seconds.
SSH still not available SSH command failed., retrying in 5 seconds.
Success.
Updating cluster configuration. [hash=1e011279ffec6f94b2bff4ebf536e6966be5c79a]
New status: syncing-files
[3/6] Processing file mounts
[4/6] No worker file mounts to sync
New status: setting-up
[3/6] No initialization commands to run.
[4/6] No setup commands to run.
[6/6] Starting the Ray runtime
New status: update-failed
!!!
SSH command failed.
!!!
Failed to setup head node.
在这一点上,我已经达到了我对 Ray 和 Docker 如何交互的理解的极限。我认为问题在于以某种方式head_start_ray_commands
传递给了它。docker run
由于 Docker 使用sh
shell 来运行命令,因此 bash 配置文件的来源不正确,因此 conda 和 ray 等软件包无法正常工作。这就解释了为什么当我在本地容器实例中以交互模式启动 bash shell 时容器没有任何问题。我已经尝试/bin/bash --login
在开始时添加,head_start_ray_commands
但这似乎只会导致整个程序冻结。
在执行命令之前让 Ray 获取 bash 配置文件的正确方法是什么?如果这是不可能的,有没有更好的方法来做到这一点?作为参考,这是我当前的光线配置:
init:
address: null
remote: {}
cluster:
cluster_name: my-cluster
min_workers: 0
max_workers: 2
initial_workers: 0
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
docker:
image: [redacted]
container_name: 'my-container'
pull_before_run: true
run_options: ["--gpus 'all'"]
provider:
type: aws
region: us-east-2
availability_zone: us-east-2a,us-east-2b
cache_stopped_nodes: false
key_pair:
key_name: [redacted]
auth:
ssh_user: ubuntu
head_node:
IamInstanceProfile:
Arn: [redacted]
InstanceType: p2.xlarge
ImageId: ami-08e16447bd5caf26a
worker_nodes:
IamInstanceProfile:
Arn: [redacted]
InstanceType: p2.xlarge
ImageId: ami-08e16447bd5caf26a
file_mounts: {}
initialization_commands: []
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
编辑
最简单的解决方法似乎只是完全避免 conda 以支持venv。
解决方案
推荐阅读
- python - 我可以通过带有 LinearRegressor 的钩子记录训练损失吗?
- c# - 如何在 Visual Studio 中的对象初始化的“新”行上有 {
- php - 如果文本中使用了某个单词,您可以更改文本颜色吗?
- avfoundation - AVFoundation Recording Audio Feature,如何更改语言?
- java - java - 如何在Java Spring Boot Web应用程序中分配目录级用户访问权限
- android - 如何修复这个浮动按钮?
- javascript - JavaScript 中的 forEach 无法收集对象值
- html - 元素之间的间距,如选项卡功能
- celery - zerok/celery-prometheus-exporter 将 celery_workers 计数为 0
- batch-file - 使用延迟变量值作为子字符串中的长度值