horovod

[TOC]

官方介绍

Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. The goal of Horovod is to make distributed Deep Learning fast and easy to use.

官方测试效果

training

Running Horovod

The example commands below show how to run distributed training. See the Running Horovod page for more instructions, including RoCE/InfiniBand tweaks and tips for dealing with hangs.

1. 单机4卡:

1
2
3
4
5
6
7
# docker
nvidia-docker run -it 172.16.10.10:5000/horovod:0.12.1-tf1.8.0-py3.5
mpirun -np 4 -H localhost:4 python keras_mnist_advanced.py

# singularity
singularity shell --nv /scratch/containers/ubuntu.simg
mpirun -np 4 -H localhost:4 python keras_mnist_advanced.py

2. 多机多卡:

1
2
3
4
$ mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
...
python train.py

3. 完整 Docker 使用horovod

4. horovod 完整使用GPU

1. Install NCCL 2.

NCCL 理解

1
2
3
4
# software requirements:
glibc 2.19 or higher
CUDA 8.0 or higher
CUDA devices with a compute capability of 3.0 and higher.

ubuntu install nccl 2

1
2
3
dpkg -i nccl-repo-ubuntu1604-2.1.15-ga-cuda9.1_1-1_amd64.deb #需要登录nvidia申请下载
apt update
apt install libnccl2 libnccl-dev

2. Install Open MPI or another MPI implementation.

1
2
3
4
5
6
7
8
install openmpi
tar xf openmpi-3.1.1.tar.bz2
cd openmpi-3.1.1/
./configure --with-cuda
make -j 12
make install
apt install libopenmpi1.10
mpirun --version

3. Install the horovod pip package.

1
2
3
$ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod #ubuntu have installed nccl2
$ HOROVOD_GPU_ALLREDUCE=MPI pip install --no-cache-dir horovod # use mpi instead nccl2 in allreduce
$ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip install --no-cache-dir horovod # use mpi instead nccl2