ubuntu-18.04 - Slurmctld 在重新启动时清除“Defunct Batch Jobid”的文件
问题描述
我的 slurmctld 不保存退出时队列中的作业(通过 ctrl+c)。
我给它大约 1000 个作业,退出 (ctrl+c),然后在重新启动时,它声明每个作业(在本例中为 754)都已失效并清除作业:
slurmctld: Purged files for defunct batch JobId=754
这是退出时的标准输出:
slurmctld: _job_complete: JobId=22 WEXITSTATUS 0
slurmctld: _job_complete: JobId=22 done
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: Saving all slurm state
slurmctld: layouts: all layouts are now unloaded.
这是重新启动服务的标准输出:
jonathan@jonathan-ubuntudesktop:~$ sudo slurmctld -Dcv
slurmctld: slurmctld version 18.08.3 started on cluster jonathan-inspiron-13-7378
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 4
slurmctld: preempt/none loaded
slurmctld: ExtSensors NONE plugin loaded
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: No memory enforcing mechanism configured.
slurmctld: layouts: no layout to initialize
slurmctld: topology NONE plugin loaded
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: route default plugin loaded
slurmctld: layouts: loading entities/relations information
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Purged files for defunct batch JobId=1183
slurmctld: Purged files for defunct batch JobId=1023
...
slurmctld: Purged files for defunct batch JobId=1384
slurmctld: Recovered state of 0 reservations
slurmctld: _preserve_plugins: backup_controller not specified
slurmctld: cons_res: select_p_reconfigure
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Running as primary controller
slurmctld: No parameter for mcs plugin, default values set
slurmctld: mcs: MCSParameters = (null). ondemand set.
slurmctld: job_complete: invalid JobId=986
slurmctld: job_complete: invalid JobId=988
slurmctld: job_complete: invalid JobId=989
slurmctld: job_complete: invalid JobId=987
slurm.conf:
ControlAddr=192.168.1.2
AuthType=auth/munge
CryptoType=crypto/munge
MaxJobCount=1000000
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/home/jonathan/Documents/COMPANYNAME/slurmctl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/home/jonathan/Documents/COMPANYNAME/slurmctl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/home/jonathan/Documents/COMPANYNAME/slurmctl/save_state/slurmd
SlurmUser=jonathan
SlurmdUser=jonathan
StateSaveLocation=/home/jonathan/Documents/COMPANYNAME/slurmctl/save_state
SwitchType=switch/none
TaskPlugin=task/none
TaskPluginParam=Sched
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
SchedulerPort=7321
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=jonathan-Inspiron-13-7378
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmdDebug=3
NodeName=jonathan-Inspiron-13-7378 NodeAddr=192.168.1.4 CPUs=4 State=UNKNOWN
PartitionName=Grid0 Nodes=jonathan-Inspiron-13-7378 Default=YES MaxTime=INFINITE State=UP
“/home/jonathan/Documents/COMPANYNAME/slurmctl/save_state”的所有者是 jonathan:jonathan,拥有 750 个权限。
Slurm-18.08.3 安装只是基本的 ./configure、make 和 make 安装。
我究竟做错了什么?感谢您的帮助,非常感谢!
解决方案
我是个白痴。我盲目地遵循教程中的命令,而不是阅读每个标志的作用。
这个问题是由 -c 标志引起的,所以我需要运行“slurmctld -Dv”而不是“slurmctld -Dv”,以防其他人遇到这个问题......
干杯!
推荐阅读
- python - 类中的函数错误:TypeError:函数()缺少 1 个必需的位置参数:
- nginx - 如何配置nginx使用www访问网站
- android - Dart :必须向 Text 小部件提供非空字符串
- go - 尝试播放代理应用程序时出现无效的运行时值
- php - 在端口 443 未找到 404
- firebase - Firestore 管理自动增量 ID
- excel - 运行一次,然后再也不会
- javafx - 使用 CSS 将 JavaFX 中的文本区域居中对齐
- javascript - 如何使用 document.getElementById 使用组合框选择显示表格中的文本字段
- machine-learning - 评估多类图像分类的 CNN 模型