首页 > 解决方案 > 使用谷歌云设置射线的问题

问题描述

我尝试根据https://ray.readthedocs.io/en/latest/autoscaling.html#kubernetes使用 Kubernetes 设置 Ray 集群。这是我的步骤:

  1. 在谷歌云平台创建Kubernetes集群
  2. 通过云shell连接集群
  3. 运行以下命令:sudo pip install -U ray, sudo pip install kubernetes
  4. 运行 ray up(示例配置文件)

然后我被问到是否创建一个集群。我回答是。它不断输出“来自服务器的错误(错误请求):pod ray-head-242dd 没有分配主机”

然后我尝试https://ray.readthedocs.io/en/latest/autoscaling.html#gcp方法。我在 example-full yaml 中更改了项目名称。然后我运行 ray up yaml。这是输出:

   WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
2019-10-28 17:06:58,254 WARNING __init__.py:44 -- file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
2019-10-28 17:06:58,258 INFO discovery.py:271 -- URL being requested: GET https://www.googleapis.com/discovery/v1/apis/cloudresourcemanager/v1/rest
2019-10-28 17:06:58,397 WARNING __init__.py:44 -- file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
2019-10-28 17:06:58,398 INFO discovery.py:271 -- URL being requested: GET https://www.googleapis.com/discovery/v1/apis/iam/v1/rest
2019-10-28 17:06:58,448 WARNING __init__.py:44 -- file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
2019-10-28 17:06:58,448 INFO discovery.py:271 -- URL being requested: GET https://www.googleapis.com/discovery/v1/apis/compute/v1/rest
2019-10-28 17:06:58,609 INFO discovery.py:867 -- URL being requested: GET https://cloudresourcemanager.googleapis.com/v1/projects/project?alt=json
2019-10-28 17:06:58,700 INFO discovery.py:867 -- URL being requested: GET https://iam.googleapis.com/v1/projects/project/serviceAccounts/ray-autoscaler-sa-v1@project.iam.gserviceaccount.com?alt=json
2019-10-28 17:06:58,764 INFO config.py:165 -- _configure_iam_role: Creating new service account ray-autoscaler-sa-v1
2019-10-28 17:06:58,772 INFO discovery.py:867 -- URL being requested: POST https://iam.googleapis.com/v1/projects/project/serviceAccounts?alt=json
2019-10-28 17:06:59,449 INFO discovery.py:867 -- URL being requested: POST https://cloudresourcemanager.googleapis.com/v1/projects/project:getIamPolicy?alt=json
2019-10-28 17:06:59,591 INFO discovery.py:867 -- URL being requested: POST https://cloudresourcemanager.googleapis.com/v1/projects/project:setIamPolicy?alt=json
2019-10-28 17:07:00,095 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project?alt=json
2019-10-28 17:07:00,319 INFO config.py:238 -- _configure_key_pair: Creating new key pair ray-autoscaler_gcp_us-west1_project_ubuntu
2019-10-28 17:07:00,409 INFO discovery.py:867 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/project/setCommonInstanceMetadata?alt=json
2019-10-28 17:07:01,025 INFO config.py:59 -- wait_for_compute_global_operation: Waiting for operation operation-1572296820417-595fee1766329-d528523f-5b1ebecc to finish...
2019-10-28 17:07:01,031 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/global/operations/operation-1572296820417-595fee1766329-d528523f-5b1ebecc?alt=json
2019-10-28 17:07:06,261 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/global/operations/operation-1572296820417-595fee1766329-d528523f-5b1ebecc?alt=json
2019-10-28 17:07:11,491 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/global/operations/operation-1572296820417-595fee1766329-d528523f-5b1ebecc?alt=json
2019-10-28 17:07:11,744 INFO config.py:70 -- wait_for_compute_global_operation: Operation done.
2019-10-28 17:07:11,745 INFO config.py:265 -- _configure_key_pair: Private key not specified in config, using/home/zh2408/.ssh/ray-autoscaler_gcp_us-west1_project_ubuntu.pem
2019-10-28 17:07:11,755 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/regions/us-west1/subnetworks?alt=json
2019-10-28 17:07:11,908 WARNING __init__.py:44 -- file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
2019-10-28 17:07:11,909 INFO discovery.py:271 -- URL being requested: GET https://www.googleapis.com/discovery/v1/apis/compute/v1/rest
2019-10-28 17:07:12,040 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28labels.ray-node-type+%3D+head%29%29+AND+%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
This will create a new cluster [y/N]: y
2019-10-28 17:07:17,457 INFO commands.py:201 -- get_or_create_head_node: Launching new head node...
2019-10-28 17:07:17,472 INFO discovery.py:867 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?alt=json
2019-10-28 17:07:19,474 INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1572296837479-595fee27abde7-e9b428db-4d0e22ec to finish...
2019-10-28 17:07:19,476 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296837479-595fee27abde7-e9b428db-4d0e22ec?alt=json
2019-10-28 17:07:24,717 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296837479-595fee27abde7-e9b428db-4d0e22ec?alt=json
2019-10-28 17:07:25,039 INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1572296837479-595fee27abde7-e9b428db-4d0e22ec finished.
2019-10-28 17:07:25,055 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28labels.ray-launch-config+%3D+07f3c1fd9b3e0be05984f720952adf2b99563d9d%29+AND+%28labels.ray-node-type+%3D+head%29+AND+%28labels.ray-node-name+%3D+ray-default-head%29%29+AND+%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
2019-10-28 17:07:25,802 INFO commands.py:214 -- get_or_create_head_node: Updating files on head node...
2019-10-28 17:07:25,806 INFO updater.py:356 -- NodeUpdater: ray-default-head-f3ed05cc: Updating to 2ae7e7f3db51902552832d843b3db964635184e5
2019-10-28 17:07:25,820 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
2019-10-28 17:07:26,030 INFO discovery.py:867 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances/ray-default-head-f3ed05cc/setLabels?alt=json
2019-10-28 17:07:26,766 INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1572296846037-595fee2fd53e7-f3e51edb-17229134 to finish...
2019-10-28 17:07:26,768 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296846037-595fee2fd53e7-f3e51edb-17229134?alt=json
2019-10-28 17:07:32,033 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296846037-595fee2fd53e7-f3e51edb-17229134?alt=json
2019-10-28 17:07:32,336 INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1572296846037-595fee2fd53e7-f3e51edb-17229134 finished.
2019-10-28 17:07:32,337 INFO updater.py:398 -- NodeUpdater: ray-default-head-f3ed05cc: Waiting for remote shell...
2019-10-28 17:07:32,337 INFO updater.py:210 -- NodeUpdater: ray-default-head-f3ed05cc: Waiting for IP...
2019-10-28 17:07:32,337 INFO log_timer.py:21 -- NodeUpdater: ray-default-head-f3ed05cc: Got IP [LogTimer=0ms]
2019-10-28 17:07:32,354 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:07:38,502 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:07:43,602 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:07:48,686 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:07:53,792 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:07:58,878 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:08:03,965 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:08:09,053 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
ssh: connect to host 34.82.120.14 port 22: Connection refused
2019-10-28 17:08:14,143 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running uptime on 34.82.120.14...
Warning: Permanently added '34.82.120.14' (ECDSA) to the list of known hosts.
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
 21:08:15 up 0 min,  0 users,  load average: 1.10, 0.32, 0.11
2019-10-28 17:08:15,103 INFO log_timer.py:21 -- NodeUpdater: ray-default-head-f3ed05cc: Got remote shell [LogTimer=42766ms]
2019-10-28 17:08:15,129 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
2019-10-28 17:08:15,348 INFO discovery.py:867 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances/ray-default-head-f3ed05cc/setLabels?alt=json
2019-10-28 17:08:16,008 INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1572296895356-595fee5edde25-16887d46-c522d063 to finish...
2019-10-28 17:08:16,011 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296895356-595fee5edde25-16887d46-c522d063?alt=json
2019-10-28 17:08:21,313 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296895356-595fee5edde25-16887d46-c522d063?alt=json
2019-10-28 17:08:21,581 INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1572296895356-595fee5edde25-16887d46-c522d063 finished.
2019-10-28 17:08:21,582 INFO updater.py:262 -- NodeUpdater: ray-default-head-f3ed05cc: Running mkdir -p ~ on 34.82.120.14...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-10-28 17:08:21,741 INFO updater.py:460 -- NodeUpdater: ray-default-head-f3ed05cc: Syncing /tmp/ray-bootstrap-5XD_Sh to ~/ray_bootstrap_config.yaml...
2019-10-28 17:08:21,755 INFO log_timer.py:21 -- NodeUpdater: ray-default-head-f3ed05cc: Synced /tmp/ray-bootstrap-5XD_Sh to ~/ray_bootstrap_config.yaml [LogTimer=174ms]
2019-10-28 17:08:21,756 INFO log_timer.py:21 -- NodeUpdater: ray-default-head-f3ed05cc: Applied config 2ae7e7f3db51902552832d843b3db964635184e5 [LogTimer=55949ms]
2019-10-28 17:08:21,756 ERROR updater.py:367 -- NodeUpdater: ray-default-head-f3ed05cc: Error updating [Errno 2] No such file or directory
2019-10-28 17:08:21,770 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
2019-10-28 17:08:22,006 INFO discovery.py:867 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances/ray-default-head-f3ed05cc/setLabels?alt=json
2019-10-28 17:08:22,649 INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1572296902019-595fee65389b8-c0cc26c3-1813a77e to finish...
2019-10-28 17:08:22,651 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296902019-595fee65389b8-c0cc26c3-1813a77e?alt=json
2019-10-28 17:08:27,936 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/operations/operation-1572296902019-595fee65389b8-c0cc26c3-1813a77e?alt=json
2019-10-28 17:08:28,180 INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1572296902019-595fee65389b8-c0cc26c3-1813a77e finished.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.7/dist-packages/ray/autoscaler/updater.py", line 370, in run
    raise e
OSError: [Errno 2] No such file or directory
2019-10-28 17:08:28,214 INFO discovery.py:867 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/project/zones/us-west1-a/instances?filter=%28%28labels.ray-launch-config+%3D+07f3c1fd9b3e0be05984f720952adf2b99563d9d%29+AND+%28labels.ray-node-type+%3D+head%29+AND+%28labels.ray-node-name+%3D+ray-default-head%29%29+AND+%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+default%29&alt=json
2019-10-28 17:08:28,431 ERROR commands.py:277 -- get_or_create_head_node: Updating 34.82.120.14 failed

我只发现已经创建了一个 ray VM 实例。我不知道错误的含义以及如何通过谷歌云设置光线集群。

标签: kubernetesray

解决方案


与主机相关的错误消息:

error from server (badrequest): pod ray-head-242dd does not have a host assigned

表示该 pod 尚未在节点中调度。

根据您问题中共享的文档,这个 Ray 示例应该在2-vCPU机器(n1-standard-2)中运行。

提供的 ray/python/ray/autoscaler/gcp/example-full.yaml 集群配置文件将创建一个带有 n1-standard-2 头节点的小型集群

Pod 定义request为 1 个 vCPU。但是,鉴于其他进程/pod/资源正在同一节点中运行,并且它无法将所有这些都分配给正在运行的 pod,它希望机器具有更多 vCPU。

您可以再次尝试为您的节点池设置不同的机器类型

附带说明一下,您可以通过发出以下命令来检查 pod 失败的原因:

$ kubectl describe pod { YOUR - RAY - POD - NAME }

这将提示您问题的原因,例如阻止调度。


推荐阅读