centos7 - 当我要在节点中开始 slurm 时出错
问题描述
我正在准备一个带有个人机器的集群,我在服务器上安装了 Centos 7,我正在尝试启动 slurm 客户端,但是当我输入以下命令时:
pdsh -w n[00-09] systemctl start slurmd
我有这个错误:
n07: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.
pdsh@localhost: n07: ssh exited with exit code 1
我有所有节点的消息。
[root@localhost ~]# systemctl status slurmd.service -l
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2020-12-22 18:27:30 CST; 27min ago
Process: 1589 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=203/EXEC)
Dec 22 18:27:30 localhost.localdomain systemd[1]: Starting Slurm node daemon...
Dec 22 18:27:30 localhost.localdomain systemd[1]: slurmd.service: control process exited, code=exited status=203
Dec 22 18:27:30 localhost.localdomain systemd[1]: Failed to start Slurm node daemon.
Dec 22 18:27:30 localhost.localdomain systemd[1]: Unit slurmd.service entered failed state.
Dec 22 18:27:30 localhost.localdomain systemd[1]: slurmd.service failed.
这是 slurm.conf 文件:
ClusterName=linux
ControlMachine=localhost
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
#TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
#SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SelectType=select/linear
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
#LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
#ACCOUNTING #JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 # #AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # #COMPUTE NODES # OpenHPC default configuration TaskPlugin=task/affinity PropagateResourceLimitsExcept=MEMLOCK AccountingStorageType=accounting_storage/filetxt Epilog=/etc/slurm/slurm.epilog.clean NodeName=n[00-09] Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
PartitionName=normal Nodes=n[00-09] Default=YES MaxTime=24:00:00 State=UP
ReturnToService=1
控制机器由主机名 -s 设置
解决方案
推荐阅读
- python - KV 语言矩形 self.size 不起作用
- javascript - 如何解决运行反应项目的问题?
- python - Ubuntu:selenium.common.exceptions:未创建会话:此版本的 ChromeDriver 仅支持 Chrome 版本 79
- asynchronous - 如何在 Dart 中一次处理列表 2 中的项目
- python - 选择不包括某些特定单词的单词 - 正则表达式
- web-hosting - 将外部邮件程序与 IMSCP/I-MSCP 邮件服务器一起使用
- mysql - 如何总结表中每个唯一的最后一个值
- javascript - 脚本中没有定义Javascript函数?
- html - 在 Bootstrap 网格中对齐大小不等的图像
- python - Tkinter 滚动框架:扩展框架以适应一个方向的内容画布