首页 > 技术文章 > Slurm集群部署

liu-shaobo 2020-07-11 22:21 原文

Slurm是面向Linux和Unix的开源作业调度程序,由世界上许多超级计算机使用,主要功能如下:
1、为用户分配计算节点的资源,以执行作业;
2、提供的框架在一组分配的节点上启动、执行和监视作业(通常是并行作业);
3、管理待处理作业的作业队列来仲裁资源争用问题;

Slurm架构:

 

 截图来自:https://slurm.schedmd.com/quickstart.html

PBS-Torque集群部署:https://www.cnblogs.com/liu-shaobo/p/13526084.html 

 

一、基础环境

1、主机名和IP
控制节点:192.168.1.11 m1
计算节点:192.168.1.12 c1
计算节点:192.168.1.13 c2

分别在3个节点设置主机名

# hostnamectl set-hostname m1
# hostnamectl set-hostname c1
# hostnamectl set-hostname c2

 

2、主机配置

系统: Centos7.6 x86_64
CPU: 2C
内存:4G


3、关闭防火墙

# systemctl stop firewalld
# systemctl disable firewalld
# systemctl stop iptables
# systemctl disable iptables

 

4、修改资源限制

# cat /etc/security/limits.conf 
* hard nofile 1000000
* soft nofile 1000000
* soft core unlimited
* soft stack 10240
* soft memlock unlimited
* hard memlock unlimited

 

5、配置时区
配置CST时区

# ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

同步NTP服务器

# ntpdate 210.72.145.44
# yum install ntp -y
# systemctl start ntpd
# systemctl enable ntpd

 安装EPEL源

# yum install http://mirrors.sohu.com/fedora-epel/epel-release-latest-7.noarch.rpm

 

6、安装NFS(控制节点)

# yum -y install nfs-utils rpcbind

编辑/etc/exports文件

# cat /etc/exports
/software/ *(rw,async,insecure,no_root_squash)

启动NFS

# systemctl start nfs
# systemctl start rpcbind
# systemctl enable nfs
# systemctl enable rpcbind

客户端挂载NFS

# yum -y install nfs-utils
# mkdir /software
# mount 192.168.1.11:/software /software

 

7、配置SSH免登陆

# ssh-keygen
# ssh-copy-id -i .ssh/id_rsa.pub c1
# ssh-copy-id -i .ssh/id_rsa.pub c2

 

 

二、配置Munge

1、创建Munge用户
Munge用户要确保Master Node和Compute Nodes的UID和GID相同,所有节点都需要安装Munge;

# groupadd -g 1108 munge
# useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge


2、生成熵池

# yum install -y rng-tools

使用/dev/urandom来做熵源

# rngd -r /dev/urandom
# vim /usr/lib/systemd/system/rngd.service
修改如下参数
[service]
ExecStart=/sbin/rngd -f -r /dev/urandom

# systemctl daemon-reload
# systemctl start rngd
# systemctl enable rngd

 

3、部署Munge

Munge是认证服务,实现本地或者远程主机进程的UID、GID验证。

# yum install munge munge-libs munge-devel -y

 

创建全局密钥
在Master Node创建全局使用的密钥

# /usr/sbin/create-munge-key -r
# dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

 

密钥同步到所有计算节点

# scp -p /etc/munge/munge.key root@c1:/etc/munge
# scp -p /etc/munge/munge.key root@c2:/etc/munge
# chown munge: /etc/munge/munge.key
# chmod 400 /etc/munge/munge.key

 

启动所有节点

# systemctl start munge
# systemctl enable munge

 

4、测试Munge服务
每个计算节点与控制节点进行连接验证

本地查看凭据

# munge -n

本地解码

# munge -n | unmunge

验证compute node,远程解码

# munge -n | ssh c1 unmunge

Munge凭证基准测试

# remunge

 

 

三、配置Slurm

1、创建Slurm用户

# groupadd -g 1109 slurm
# useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm

 

2、安装Slurm依赖

# yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel -y

 

编译Slurm

# wget https://download.schedmd.com/slurm/slurm-19.05.7.tar.bz2 
# rpmbuild -ta slurm-19.05.7.tar.bz2 
# cd /root/rpmbuild/RPMS/x86_64/

 

所有节点安装Slurm

# yum localinstall slurm-*

 

3、配置控制节点Slurm

# cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
# cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
# vim /etc/slurm/slurm.conf
##修改如下部分
ControlMachine=m1
ControlAddr=192.168.1.11
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/ctld
SlurmUser=slurm
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=192.168.1.11
AccountingStoragePort=6819
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=m[1-3] RealMemory=3400 Sockets=1 CoresPerSocket=2 State=IDLE
PartitionName=all Nodes=m[1-3] Default=YES State=UP

 

复制控制节点配置文件到计算节点

# scp /etc/slurm/*.conf c1:/etc/slurm/
# scp /etc/slurm/*.conf c2:/etc/slurm/

 

设置控制、计算节点文件权限

# mkdir /var/spool/slurm
# chown slurm: /var/spool/slurm
# mkdir /var/log/slurm
# chown slurm: /var/log/slurm

 

5、配置控制节点Slurm Accounting
Accounting records为slurm收集作业步骤的信息,可以写入一个文本文件或数据库,但这个文件会变得越来越大,最简单的方法是使用MySQL来存储信息。

创建数据库的Slurm用户(MySQL自行安装)

mysql> grant all on slurm_acct_db.* to 'slurm'@'%' identified by 'slurm*456' with grant option;

 

配置slurmdbd.conf文件

# cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
# cat /etc/slurm/
slurmdbd.conf AuthType=auth/munge AuthInfo=/var/run/munge/munge.socket.2 DbdAddr=192.168.1.11 DbdHost=m1 SlurmUser=slurm DebugLevel=verbose LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid StorageType=accounting_storage/mysql StorageHost=mysql_ip StorageUser=slrum StoragePass=slurm*456 StorageLoc=slurm_acct_db

 

6、开启节点服务

 启动控制节点Slurmdbd服务

# systemctl start slurmdbd
# systemctl status slurmdbd
# systemctl enable slurmdbd

 

启动控制节点slurmctld服务

# systemctl start slurmctld
# systemctl status slurmctld
# systemctl enable slurmctld

 

启动计算节点的服务

# systemctl start slurmd
# systemctl status slurmd
# systemctl enable slurmd

 

 

四、检查Slurm集群

查看集群

# sinfo
# scontrol show partition
# scontrol show node

提交作业    

# srun -N2 hostname
# scontrol show jobs

查看作业

# squeue -a

 

推荐阅读