slurm - 由于 RealMemory 不足,SLURM 停止启动并且节点处于“IDLE+DRAINED”状态
问题描述
我将 SLURM 作为具有单节点配置(只有一个桌面)的任务管理器运行。几天前我开始遇到问题:
- 两者都
slurmd
不要slurmctld
在系统启动时启动。这是输出systemctl status slurmd slurmctld
:
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-08-31 11:39:02 CEST; 4min 44s ago
Docs: man:slurmd(8)
Process: 906 ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 906 (code=exited, status=1/FAILURE)
aug 31 11:39:01 station systemd[1]: Started Slurm node daemon.
aug 31 11:39:02 station systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
aug 31 11:39:02 station systemd[1]: slurmd.service: Failed with result 'exit-code'.
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-08-31 11:39:02 CEST; 4min 44s ago
Docs: man:slurmctld(8)
Process: 905 ExecStart=/usr/sbin/slurmctld -D $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 905 (code=exited, status=1/FAILURE)
aug 31 11:39:01 station systemd[1]: Started Slurm controller daemon.
aug 31 11:39:02 station systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
aug 31 11:39:02 station systemd[1]: slurmctld.service: Failed with result 'exit-code'.
- 在
systemctl restart slurmctld slurmd
两个进程启动时手动重新启动后,但由于节点因“Low RealMemory”原因而被耗尽,因此无法完成任何操作scontrol show node station
:
NodeName=station Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUTot=4 CPULoad=0.05
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=station NodeHostName=station Version=20.11.4
OS=Linux 5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021
RealMemory=15903 AllocMem=0 FreeMem=14415 Sockets=1 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2021-08-31T11:38:52 SlurmdStartTime=2021-08-31T11:51:28
CfgTRES=cpu=4,mem=15903M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2021-08-31T11:37:07]
Comment=(null)
但这根本没有意义,因为在/etc/slurm-llnl/slurm.conf
参数中指定为:
NodeName=station CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=14000
由于建议指定 RealMemory 低于它。我还尝试在配置文件中将此参数删除为 10000 或 12000 而没有结果。
尝试通过scontrol update nodename=station state=resume
或手动修复它
scontrol update nodename=station state=down reason="undraining"
scontrol update nodename=station state=resume
也没有成功。
请帮忙解决。
- 操作系统:具有所有当前更新的 Ubuntu x64 21.04
- SLURM 版本:20.11.4
解决方案
推荐阅读
- c++ - 从 C++ 更改 QML 对象值
- nginx - Nginx Ingress 选择性反向代理位置重写
- quartz-scheduler - 石英和石英工作之间的区别
- python - 在 Pandas 中分组和删除不必要的行
- laravel - IE 和 Safari 上的 Vuejs Laravel 空白页
- android - 工作经理(如果连接到 WFI 并连接到蜂窝网络)
- java - 如何将俄罗斯 WSDL xml 解析为 java 对象
- reactjs - 如果 prop 是组件,如何渲染组件?
- python-3.x - 分组概率分布
- python - 在 Django 可重用应用程序中创建“可选”模型的最佳方法