slurm 安装文档

[TOC]

SLURM是一款集群资源管理调度软件,适合深度学习集群管理调度.

清理软件

卸载已安装的软件

1
2
yum remove -y munge*
yum remove -y slurm*

清除自建文件和目录

  • log文件
  • /var/spool/ 下的目录和文件
    注意检查文件的权属

在master节点安装MariaDB 数据库

1
yum install mariadb-server mariadb-devel -y

安装 munge

1. 在 master 节点创建 UID 和 GID

1
2
3
4
5
6
export MUNGEUSER=1050
groupadd -g $MUNGEUSER munge
useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
export SlurmUSER=1051
groupadd -g $SlurmUSER slurm
useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SlurmUSER -g slurm -s /bin/bash slurm

2. 在所有计算节点创建相同的UID 和 GID

进行 munge 安装

1. 先安装最新的 epel-release RPM

1
yum install epel-release

2. 安装 MUNGE RPM 包

1
yum install munge munge-libs munge-devel -y

3. 检查加密方式

1
2
munge -C
munge -M

4. 在mater 节点创建所有 node 需要的秘钥。

1
2
yum install rng-tool -y
/usr/sbin/create-munge-key -r
1
2
3
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

5. 将 /etc/munge/munge.key 拷贝到其他节点

1
2
3
export NODE=172.16.10.18
scp /etc/munge/munge.key root@login:/etc/munge
scp /etc/munge/munge.key root@node1:/etc/munge

6. 在所有节点上设置权限和所属 :

1
2
chown -R munge: /etc/munge/ /var/log/munge/
chmod 0700 /etc/munge/ /var/log/munge/

7. 在所有节点上运行 munge:

1
2
systemctl enable munge
systemctl start munge

8. 测试

1
2
3
4
munge -n
munge -n | unmunge # Displays information about the MUNGE key
munge -n | ssh somehost unmunge
remunge


二、安装 Slurm

1. 先安装一下支持的软件包:

1
2
3
yum install rpm-build gcc openssl openssl-devel pam-devel numactl numactl-devel \
hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel \
gtk2-devel man2html libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker

2. 下载最新的slurm 版本到存储节点NFS文件下

1
2
3
cd /gensoft/slurm-rpms
export VER=17.02.0
wget http://www.schedmd.com/download/latest/slurm-17.02.0.tar.bz2

3. 在所有节点上编译并安装

1
2
3
4
5
6
7
rpmbuild -ta slurm-$VER.tar.bz2
cd /root/rpmbuild/RPMS/x86_64
yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-munge-$VER*rpm \
slurm-perlapi-$VER*rpm slurm-plugins-$VER*rpm slurm-torque-$VER*rpm \
slurm-seff-$VER*rpm
# OR
yum install slurm*rpm

4. 配置 slurm

访问网站 http://slurm.schedmd.com/configurator.html 进行配置填写,完成后下载文件

将配置文件复制到 /etc/slurm

1
cd /etc/slurm

5. 将配置文件复制到其他节点

1
2
scp slurm.conf root@node1.com/etc/slurm/slurm.conf
scp slurm.conf root@logo.com/etc/slurm/slurm.conf

6. 配置 master 节点

1
2
3
4
5
6
chown slurm:slurm /var/spool
mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chmod 755 /var/spool/slurmct
touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log

7. 配置其他节点

1
2
3
4
5
mkdir /var/spool/slurmd
chown slurm: /var/spool/slurmd
chmod 755 /var/spool/slurmd
touch /var/log/slurmd.log
chown slurm: /var/log/slurmd.log

8. 确认 master 节点配置是否正确

1
slurmd -C

9. 如果像下面这样表示正确

1
2
ClusterName=(null) NodeName=buhpc3 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=7822 TmpDisk=45753
UpTime=13-14:27:52

三、启动 slurm

1. 关闭所有计算节点防火墙

1
2
systemctl stop firewalld
systemctl disable firewalld

2. 在master节点上打开默认的 slurm 使用端口

1
2
3
4
5
6
7
firewall-cmd --permanent --zone=public --add-port=6817/udp
firewall-cmd --permanent --zone=public --add-port=6817/tcp
firewall-cmd --permanent --zone=public --add-port=6818/tcp
firewall-cmd --permanent --zone=public --add-port=6818/tcp
firewall-cmd --permanent --zone=public --add-port=7321/tcp
firewall-cmd --permanent --zone=public --add-port=7321/tcp
firewall-cmd --reload

3. 如果防火墙设置不成功,在所有节点上检查时钟是否同步

1
2
3
4
yum install ntp -y
chkconfig ntpd on
ntpdate pool.ntp.org
systemctl start ntpd

4. 如果时钟同步,在所有计算节点上启动 slurm:

1
2
3
systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service

5. 在master 节点上启动 slurm

1
2
3
systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service

6. 检查运行情况

1
sinfo

7. 如果有问题,查看日志

1
2
Compute node bugs: tail /var/log/slurmd.log
Server node bugs: tail /var/log/slurmctld.log

8. 如果节点 DOWN 了更改 node 状态为 IDLE

1
2
3
scontrol: update NodeName=node1 State=DOWN Reason="undraining"
scontrol: update NodeName=node1 State=RESUME
scontrol:

参考

http://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/
http://blog.csdn.net/datuqiqi/article/details/50827040
http://blog.csdn.net/kongxx/article/details/48173829
http://wildflower.diablonet.net/~scaron/slurmsetup.html
https://slurm.schedmd.com/slurm_ug_2011/Basic_Configuration_Usage.pdf