首页 > 解决方案 > 由于 RealMemory 不足,SLURM 停止启动并且节点处于“IDLE+DRAINED”状态

问题描述

我将 SLURM 作为具有单节点配置(只有一个桌面)的任务管理器运行。几天前我开始遇到问题:

  1. 两者都slurmd不要slurmctld在系统启动时启动。这是输出systemctl status slurmd slurmctld
●  slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2021-08-31 11:39:02 CEST; 4min 44s ago
       Docs: man:slurmd(8)
    Process: 906 ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 906 (code=exited, status=1/FAILURE)

aug 31 11:39:01 station systemd[1]: Started Slurm node daemon.
aug 31 11:39:02 station systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
aug 31 11:39:02 station systemd[1]: slurmd.service: Failed with result 'exit-code'.

● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2021-08-31 11:39:02 CEST; 4min 44s ago
       Docs: man:slurmctld(8)
    Process: 905 ExecStart=/usr/sbin/slurmctld -D $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 905 (code=exited, status=1/FAILURE)

aug 31 11:39:01 station systemd[1]: Started Slurm controller daemon.
aug 31 11:39:02 station systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
aug 31 11:39:02 station systemd[1]: slurmctld.service: Failed with result 'exit-code'.
  1. systemctl restart slurmctld slurmd两个进程启动时手动重新启动后,但由于节点因“Low RealMemory”原因而被耗尽,因此无法完成任何操作scontrol show node station
NodeName=station Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUTot=4 CPULoad=0.05
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=station NodeHostName=station Version=20.11.4
   OS=Linux 5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021
   RealMemory=15903 AllocMem=0 FreeMem=14415 Sockets=1 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2021-08-31T11:38:52 SlurmdStartTime=2021-08-31T11:51:28
   CfgTRES=cpu=4,mem=15903M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2021-08-31T11:37:07]
   Comment=(null)

但这根本没有意义,因为在/etc/slurm-llnl/slurm.conf参数中指定为:

NodeName=station CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=14000

由于建议指定 RealMemory 低于它。我还尝试在配置文件中将此参数删除为 10000 或 12000 而没有结果。

尝试通过scontrol update nodename=station state=resume或手动修复它

scontrol update nodename=station state=down reason="undraining"
scontrol update nodename=station state=resume

也没有成功。

请帮忙解决。

标签: slurm

解决方案


推荐阅读