首页 > 解决方案 > 通过 Ray 配置在 AWS EC2 集群中的节点上禁用超线程

问题描述

我有一个在 EC2 集群上运行的任务,随着虚拟 CPU 的使用(不管 EBS 卷的大小),它开始逐渐变慢。为避免这种情况,我想在所有节点上禁用超线程,并尝试实施此处给出的建议:https ://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-linux /
我正在使用 Ray 在 Ubuntu 18.04 中启动集群,并假设 config.yaml 文件中的 initialization_commands 部分是实现 bash 命令的合适位置(此处不理解 bootcmd: 标题)。我尝试了许多不同的格式,但似乎都不起作用;例如:-

# List of commands run before setup_commands.
initialization_commands:
    - for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un); do echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done

产生此错误:-

bash: syntax error near unexpected token `sudo'
2020-07-26 22:53:04,949 INFO log_timer.py:17 -- NodeUpdater: i-0eefc0511ce029fb3: Initialization commands completed [LogTimer=139ms]
2020-07-26 22:53:04,949 INFO log_timer.py:17 -- NodeUpdater: i-0eefc0511ce029fb3: Applied config 39910e8bc12541ca5e316063231a2493642efee4 [LogTimer=60603ms]
2020-07-26 22:53:04,950 ERROR updater.py:348 -- NodeUpdater: i-0eefc0511ce029fb3: Error updating (Exit Status 1) ssh -i /home/haines/.ssh/ray-key2_us-east-1.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_98734ce2b6/5f5c61af53/%C -o ControlPersist=10s -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 ubuntu@3.93.77.73 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr '"'"','"'"' '"'"'\n'"'"' | sort -un); sudo echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done'
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 351, in run
    raise e
  File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 341, in run
    self.do_update()
  File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 426, in do_update
    self.cmd_runner.run(cmd)
  File "/home/haines/Projects/VF83/Ray_Cloud/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 263, in run
    self.process_runner.check_call(final_cmd)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/haines/.ssh/ray-key2_us-east-1.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_98734ce2b6/5f5c61af53/%C', '-o', 'ControlPersist=10s', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', 'ubuntu@3.93.77.73', 'bash', '--login', '-c', '-i', '\'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr \'"\'"\',\'"\'"\' \'"\'"\'\\n\'"\'"\' | sort -un); sudo echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done\'']' returned non-zero exit status 1.

2020-07-26 22:53:05,018 INFO log_timer.py:17 -- AWSNodeProvider: Set tag ray-node-status=setting-up on ['i-0eefc0511ce029fb3'] [LogTimer=205ms]
2020-07-26 22:53:05,140 ERROR commands.py:285 -- get_or_create_head_node: Updating 3.93.77.73 failed

我尝试使用单独的行,并将命令放在 setup_commands 部分,但这些都不起作用。有没有更简单的方法?

更新:我猜语法错误可能与一些空格或字符有关(尽管我尝试了许多变体),但即使没有循环,即只有 sudo echo 命令写入一个 cpu,我得到一个权限错误:-

bash: /sys/devices/system/cpu/cpu50/online: Permission denied

更新 2:我发现有一个更简单的方法:“export OMP_NUM_THREADS=1”,但是如果通过设置中的 bash 命令完成,这似乎没有效果。我使用的是 Ray 0.8.6,我认为它应该设置 OMP_NUM_THREADS=1,但是当集群启动并运行时,它没有在头节点上定义。

标签: linuxamazon-ec2rayhyperthreading

解决方案


好吧,设置 OMP_NUM_THREADS 似乎没用。该解决方案是 AWS 描述的第一个解决方案,但它还需要在 Ray 配置文件中为所有 CPU 在线标志添加写入权限:-

setup_commands:
    - sudo chmod -R 777 /sys/devices/system/cpu/*
    - for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un); do echo 0 > /sys/devices/system/cpu/cpu$cpunum/online; done

这允许任意数量的任务同时在所有实际 CPU 上运行。当然,这也意味着我要运行两倍的工人。


推荐阅读