[TOC]
官方介绍
Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. The goal of Horovod is to make distributed Deep Learning fast and easy to use.
官方测试效果
Running Horovod
The example commands below show how to run distributed training. See the Running Horovod page for more instructions, including RoCE/InfiniBand tweaks and tips for dealing with hangs.
1. 单机4卡:
1 | # docker |
2. 多机多卡:
1 | $ mpirun -np 16 \ |
3. 完整 Docker 使用horovod
4. horovod 完整使用GPU
1. Install NCCL 2.
1 | # software requirements: |
ubuntu install nccl 2
1 | dpkg -i nccl-repo-ubuntu1604-2.1.15-ga-cuda9.1_1-1_amd64.deb #需要登录nvidia申请下载 |
2. Install Open MPI or another MPI implementation.
1 | install openmpi |
3. Install the horovod
pip package.
1 | $ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod #ubuntu have installed nccl2 |