首页 > 解决方案 > Slurmctld 在重新启动时清除“Defunct Batch Jobid”的文件

问题描述

我的 slurmctld 不保存退出时队列中的作业(通过 ctrl+c)。

我给它大约 1000 个作业,退出 (ctrl+c),然后在重新启动时,它声明每个作业(在本例中为 754)都已失效并清除作业:

slurmctld: Purged files for defunct batch JobId=754

这是退出时的标准输出:

slurmctld: _job_complete: JobId=22 WEXITSTATUS 0
slurmctld: _job_complete: JobId=22 done
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: Saving all slurm state
slurmctld: layouts: all layouts are now unloaded.

这是重新启动服务的标准输出:

jonathan@jonathan-ubuntudesktop:~$ sudo slurmctld -Dcv
slurmctld: slurmctld version 18.08.3 started on cluster jonathan-inspiron-13-7378
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 4
slurmctld: preempt/none loaded
slurmctld: ExtSensors NONE plugin loaded
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: No memory enforcing mechanism configured.
slurmctld: layouts: no layout to initialize
slurmctld: topology NONE plugin loaded
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: route default plugin loaded
slurmctld: layouts: loading entities/relations information
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Purged files for defunct batch JobId=1183
slurmctld: Purged files for defunct batch JobId=1023
...
slurmctld: Purged files for defunct batch JobId=1384
slurmctld: Recovered state of 0 reservations
slurmctld: _preserve_plugins: backup_controller not specified
slurmctld: cons_res: select_p_reconfigure
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Running as primary controller
slurmctld: No parameter for mcs plugin, default values set
slurmctld: mcs: MCSParameters = (null). ondemand set.
slurmctld: job_complete: invalid JobId=986
slurmctld: job_complete: invalid JobId=988
slurmctld: job_complete: invalid JobId=989
slurmctld: job_complete: invalid JobId=987

slurm.conf:

ControlAddr=192.168.1.2
AuthType=auth/munge
CryptoType=crypto/munge
MaxJobCount=1000000
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/home/jonathan/Documents/COMPANYNAME/slurmctl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/home/jonathan/Documents/COMPANYNAME/slurmctl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/home/jonathan/Documents/COMPANYNAME/slurmctl/save_state/slurmd
SlurmUser=jonathan
SlurmdUser=jonathan
StateSaveLocation=/home/jonathan/Documents/COMPANYNAME/slurmctl/save_state
SwitchType=switch/none
TaskPlugin=task/none
TaskPluginParam=Sched
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
SchedulerPort=7321
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=jonathan-Inspiron-13-7378
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmdDebug=3
NodeName=jonathan-Inspiron-13-7378 NodeAddr=192.168.1.4 CPUs=4 State=UNKNOWN
PartitionName=Grid0 Nodes=jonathan-Inspiron-13-7378 Default=YES MaxTime=INFINITE State=UP

“/home/jonathan/Documents/COMPANYNAME/slurmctl/save_state”的所有者是 jonathan:jonathan,拥有 750 个权限。

Slurm-18.08.3 安装只是基本的 ./configure、make 和 make 安装。

我究竟做错了什么?感谢您的帮助,非常感谢!

标签: ubuntu-18.04slurm

解决方案


我是个白痴。我盲目地遵循教程中的命令,而不是阅读每个标志的作用。

这个问题是由 -c 标志引起的,所以我需要运行“slurmctld -Dv”而不是“slurmctld -Dv”,以防其他人遇到这个问题......

干杯!


推荐阅读