[TOC]
SLURM是一款集群资源管理调度软件,适合深度学习集群管理调度.
清理软件
卸载已安装的软件
1 2
| yum remove -y munge* yum remove -y slurm*
|
清除自建文件和目录
- log文件
- /var/spool/ 下的目录和文件
注意检查文件的权属
在master节点安装MariaDB 数据库
1
| yum install mariadb-server mariadb-devel -y
|
安装 munge
1. 在 master 节点创建 UID 和 GID
1 2 3 4 5 6
| export MUNGEUSER=1050 groupadd -g $MUNGEUSER munge useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge export SlurmUSER=1051 groupadd -g $SlurmUSER slurm useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SlurmUSER -g slurm -s /bin/bash slurm
|
2. 在所有计算节点创建相同的UID 和 GID
进行 munge 安装
1. 先安装最新的 epel-release RPM
1
| yum install epel-release
|
2. 安装 MUNGE RPM 包
1
| yum install munge munge-libs munge-devel -y
|
3. 检查加密方式
4. 在mater 节点创建所有 node 需要的秘钥。
1 2
| yum install rng-tool -y /usr/sbin/create-munge-key -r
|
1 2 3
| dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key chown munge: /etc/munge/munge.key chmod 400 /etc/munge/munge.key
|
5. 将 /etc/munge/munge.key 拷贝到其他节点
1 2 3
| export NODE=172.16.10.18 scp /etc/munge/munge.key root@login:/etc/munge scp /etc/munge/munge.key root@node1:/etc/munge
|
6. 在所有节点上设置权限和所属 :
1 2
| chown -R munge: /etc/munge/ /var/log/munge/ chmod 0700 /etc/munge/ /var/log/munge/
|
7. 在所有节点上运行 munge:
1 2
| systemctl enable munge systemctl start munge
|
8. 测试
1 2 3 4
| munge -n munge -n | unmunge # Displays information about the MUNGE key munge -n | ssh somehost unmunge remunge
|
二、安装 Slurm
1. 先安装一下支持的软件包:
1 2 3
| yum install rpm-build gcc openssl openssl-devel pam-devel numactl numactl-devel \ hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel \ gtk2-devel man2html libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker
|
2. 下载最新的slurm 版本到存储节点NFS文件下
1 2 3
| cd /gensoft/slurm-rpms export VER=17.02.0 wget http://www.schedmd.com/download/latest/slurm-17.02.0.tar.bz2
|
3. 在所有节点上编译并安装
1 2 3 4 5 6 7
| rpmbuild -ta slurm-$VER.tar.bz2 cd /root/rpmbuild/RPMS/x86_64 yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-munge-$VER*rpm \ slurm-perlapi-$VER*rpm slurm-plugins-$VER*rpm slurm-torque-$VER*rpm \ slurm-seff-$VER*rpm # OR yum install slurm*rpm
|
4. 配置 slurm
访问网站 http://slurm.schedmd.com/configurator.html 进行配置填写,完成后下载文件
将配置文件复制到 /etc/slurm
5. 将配置文件复制到其他节点
1 2
| scp slurm.conf root@node1.com/etc/slurm/slurm.conf scp slurm.conf root@logo.com/etc/slurm/slurm.conf
|
6. 配置 master 节点
1 2 3 4 5 6
| chown slurm:slurm /var/spool mkdir /var/spool/slurmctld chown slurm: /var/spool/slurmctld chmod 755 /var/spool/slurmct touch /var/log/slurmctld.log chown slurm: /var/log/slurmctld.log
|
7. 配置其他节点
1 2 3 4 5
| mkdir /var/spool/slurmd chown slurm: /var/spool/slurmd chmod 755 /var/spool/slurmd touch /var/log/slurmd.log chown slurm: /var/log/slurmd.log
|
8. 确认 master 节点配置是否正确
9. 如果像下面这样表示正确
1 2
| ClusterName=(null) NodeName=buhpc3 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=7822 TmpDisk=45753 UpTime=13-14:27:52
|
三、启动 slurm
1. 关闭所有计算节点防火墙
1 2
| systemctl stop firewalld systemctl disable firewalld
|
2. 在master节点上打开默认的 slurm 使用端口
1 2 3 4 5 6 7
| firewall-cmd --permanent --zone=public --add-port=6817/udp firewall-cmd --permanent --zone=public --add-port=6817/tcp firewall-cmd --permanent --zone=public --add-port=6818/tcp firewall-cmd --permanent --zone=public --add-port=6818/tcp firewall-cmd --permanent --zone=public --add-port=7321/tcp firewall-cmd --permanent --zone=public --add-port=7321/tcp firewall-cmd --reload
|
3. 如果防火墙设置不成功,在所有节点上检查时钟是否同步
1 2 3 4
| yum install ntp -y chkconfig ntpd on ntpdate pool.ntp.org systemctl start ntpd
|
4. 如果时钟同步,在所有计算节点上启动 slurm:
1 2 3
| systemctl enable slurmd.service systemctl start slurmd.service systemctl status slurmd.service
|
5. 在master 节点上启动 slurm
1 2 3
| systemctl enable slurmctld.service systemctl start slurmctld.service systemctl status slurmctld.service
|
6. 检查运行情况
7. 如果有问题,查看日志
1 2
| Compute node bugs: tail /var/log/slurmd.log Server node bugs: tail /var/log/slurmctld.log
|
8. 如果节点 DOWN 了更改 node 状态为 IDLE
1 2 3
| scontrol: update NodeName=node1 State=DOWN Reason="undraining" scontrol: update NodeName=node1 State=RESUME scontrol:
|
参考
http://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/
http://blog.csdn.net/datuqiqi/article/details/50827040
http://blog.csdn.net/kongxx/article/details/48173829
http://wildflower.diablonet.net/~scaron/slurmsetup.html
https://slurm.schedmd.com/slurm_ug_2011/Basic_Configuration_Usage.pdf