Recent Improvements of MPI Communication for DDLS

Hyejin Kim
10 min readMay 3, 2021

With the recent advances in big data and computational power, deep learning is widely being used for real-world problems. One of the significant challenges of deep learning would be the highly time-consuming training process. To accelerate the training process of deep learning, researches about distributed deep learning systems (DDLS) technologies have been actively conducted. Distributed deep learning systems train deep neural network models by utilizing the distributed resources of a cluster. Using DDLS, one can distribute computational workloads to multiple resources and can reduce training time.

However, just increasing the number of resources is not a universal solution. We should also consider communication overhead between resources when designing DDLS. For example, in data-parallel synchronized distributed deep learning, when each processing unit or worker is done calculating the gradient for its subset, each result should be aggregated to compute the mean of gradients and update the model. So each process needs to communicate its results with every other process/worker and the latency of exchanging gradients over GPUs can be a severe bottleneck.

Several efficient algorithms for collective communication operations were proposed to overcome this challenge — Message Passing Interface(MPI) reductions. First, let’s take a look at some popular reduction algorithms.

AllReduce

AllReduce enables every process or worker to share data with all other processes. There are two phases in the Allreduce algorithm — aggregation, and broadcast. It first reduces the target arrays in all processes to a single array(aggregation) and returns the resulting array to all processes(broadcast). The reduction operation for elements from each target array can be either concatenation or summation, or any other binary operation. Below is the pictorial description of the Allreduce algorithm using summation when P=4(number of processes) and N=4(length of each array).

Fig 1. copyright by Yuichiro Ueno

Then a question comes — how do we implement the network topology for this reduction operation? Several approaches to perform reduction operations over processes have been proposed and this paper summarizes some of the common communication schemes.

TreeAllReduce connects each process using a tree-like structure as in Fig2(a). Since its aggregation and broadcasting stage forms a binary tree, each phase takes O(log K) communication rounds, thus there are total O(2logK) communication rounds where K is the number of processes(learners). If D is the size of data(ex. the total amount of gradients), D amount of data is passed at each communication round and the total communication cost of TreeAllReduce is O(D*logK). TreeAllReduce minimizes network bandwidth but has high latency since the delay is set by the slowest path in the tree. Thus it is a feasible solution for small and dense data communication, not for large-scale parallelism.

In round-robin Allreduce, each processor communicates with all other processors in a circular order, as presented in Fig2(b). Thus its number of communication rounds is O(K-1). Since D amount of data is passed at each communication round, the total communication cost of round-robin Allreduce is O(D*(K-1)). Even though the round-robin Allreduce achieves asymptotically optimal bandwidth and latency, in practice network packet sizes are limited so too many inter-node transmissions taking place at a time may lead to communication breakdown. Also, the very large number of messages makes this network more prone to failures due to packet corruption.

In butterfly AllReduce, every node computes operations with the values from its neighboring nodes and outputs to its out neighbors. In the binary case, the neighbors at layer d lie on the edges of the hypercube in dimension d. There are O(log K) communication rounds in a K-learner binary butterfly network and the total communication cost is O(D*logK). A binary butterfly Allreduce gives the lowest latency among these three Allreduce operations when messages have fixed costs.

Fig 2. Huasha Zhao, John Canny, Sparse Allreduce: Efficient Scalable Communication for Power-Law Data

Ring-Allreduce

Ring-Allreduce divides the gradient into consecutive blocks at each node and updates each block using the previous node and at the same time sends an update to the next node, making a ring pattern. Ring-Allreduce also has two phases — scatter-reduce and allgather. At the scatter-reduce phase, GPUs exchange data such that every GPU ends up with a chunk of the final result. At allgather phase, GPUs exchange the chunks achieved from the scatter-reduce phase. Fig3 illustrates how scatter-reduce works and Fig4 illustrates how the allgather phase works.

Fig3. copyright by Andrew Gibiansky
Fig4. copyright by Andrew Gibiansky

As we can see in the above figures, there are K-1 iterations at each phase, and at each iteration D/K amount of data is passed. Thus the total communication cost is 2D(K-1)/K. Compared to other approaches, Ring-Allreduce beats all other approaches in terms of the total communication cost. (If K grows very large, the communication cost becomes proportional to the data size D!) Ring-Allreduce has greatly contributed to the improvement of communication efficiency for distributed deep learning systems. Based on this achievement, there have been extensive researches to improve the Ring-Allreduce algorithm. I would like to further introduce recent studies that improved Ring-Allreduce.

Further improving Ring-AllReduce

[2]Ueno et. al studied reducing the latency of Ring-Allreduce for large message communications. They noticed that even though the Ring-AllReduce is the popular method for AllReduce and has the lowest communication cost, it is not free from the latency when using thousands of GPUs. This paper suggests a hierarchical communication method devised with a pure Ring-AllReduce and Rabenseifner’s algorithm to reduce the latency.

[3]Rabenseifner’s Algorithm is a hierarchical version of the Ring-AllReduce Algorithm, where the RingAllReduce is performed on pairs of two recursively. It performs the AllReduce by communicating with processes that have a distance of 1, 2, 4, · · ·, 2^(k−1), therefore requiring only O(log K) steps.

In the first step, each process exchanges data with the neighboring processes. This exchange splits the array into two halves and performs a scatter-reduce step of the ring-Allreduce. pairs are formed among the processes that are two apart. Recursively, the array that was split in the previous step is further split again, and two halves are exchanged again in the same way. This is called ‘recursive halving’ in the paper. Similarly, recursive steps on allgather in Ring-Allreduce are called ‘recursive doubling’. The difference is that while the recursive halving is done with pairs of processes with a distance of 1, 2, 4, · · ·, 2^(k−1), recursive doubling is done with pairs having a distance of 2^(k−1), · · ·, 4, 2, 1. The total communication cost of Rabenseifner’s algorithm is the same as the Ring-Allreduce. Fig 5. illustrates how this algorithm works.

Fig 5. copyright by Rolf Rabenseifner

The paper suggests the hierarchical Ring-AllReduce schema which combines Ring-AllReduce and Rabenseifner’s Algorithm. Fig 5. illustrates this hierarchical Ring-AllReduce schema on 128 processes with 4 × 8 × 4 hierarchical configuration and the 5 steps of communication. They assumed that each node has four GPUs. Step 1 is equivalent to the scatter-reduce is performed on the x-axis and step2 is the scatter-reduce on the y-axis, where each GPU in the node participates in each scatter-reduce process independently. Step 3 performs an AllReduce on the z-axis between 4 processes.

The data size communicated in Step 2 is 1/4 of Step 1 because it uses the result of the scatter-reduce process at Step 1. The data size communicated in Step 3 is 1/4 × 1/8 = 1/32 of Step 1. Steps 4 and 5 perform allgather in a similar fashion with scatter-reduce processes in Steps 2 and 1.

Fig 5. copyright by Yuichiro Ueno, Rio Yokota

This hierarchical approach complements the drawbacks of both Rabenseifner’s algorithm and the Ring-AllReduce algorithm.

The paper states that this approach can alleviate the contention that would occur in an original Rabenseifner’s Algorithm for network topologies that don’t have a full bisection bandwidth. Also, they claim that this hierarchical approach can reduce the number of communications in the Ring-AllReduce and therefore reduce the latency. Since the hierarchy keeps the ring at any given layer from becoming too long, it can reduce the number of communications within a ring.

Fig 6. copyright by Yuichiro Ueno, Rio Yokota

Above Fig 6. shows the experiment result comparing the existing Ring-AllReduce algorithm and the novel hierarchical Ring-AllReduce approach. The communication time of the hierarchical Ring-AllReduce(Orange and Green) was about 0.01second lower than the existing Ring-AllReduce algorithm(Blue).

The paper conducted multiple experiments using various topologies of hierarchical Ring-AllReduce. Fig 7 shows the best topologies for each number of processes based on these experiments. Interestingly, they empirically found that the optimal topology for a larger number of processes is likely to be a simple extension of a smaller scale topology either by:

  1. Adding another layer of hierarchy.
  2. Increasing the number of final layer processes.

For example, for 128 processes the best topology was 4 × 8 × 4, and for 256 processes the best topology was 4 × 8 × 4 × 2. This follows rule 1. The best topology for 512 processes was 4 × 8 × 4 × 4, so this follows rule 2. Using Fig 7. and these rules, one may be able to infer the optimal topology for a certain number of processes.

Fig 7. copyright by Yuichiro Ueno, Rio Yokota

There is another study that attempted to reduce communication latency on top of the AllReduce algorithm. [4]Mikami et al. propose a two-level hierarchical AllReduce that improved the communication time. They focus on two major challenges in large-scale distributed deep learning systems;

1) convergence accuracy degradation when using large mini-batch training, 2) communication overhead of gradient synchronization among GPUs.

They propose a new approach 2D-Torus all-reduce to resolve these problems. Fig 8. shows how 2D-Torus topology works. There are three steps in the 2D-Torus topology; reduce-scatter, all-reduce, and all-gather. (These also look similar to the above hierarchical model of Ueno et. al!) In the first step, reduce-scatter is performed horizontally. In the second step, all-reduce is performed vertically. Finally, all-gather is performed horizontally.

The paper also claims that the communication overhead of the 2D-Torus all-reduce is smaller than that of Ring all-reduce. Let K denote the number of GPUs in the cluster, 𝑋 denote the number of GPUs in the horizontal direction, 𝑌 denote the number of GPUs in the vertical direction. The number of communication rounds of 2D-Torus all-reduce is 2(𝑋 − 1). On the other hand, the number of communication rounds of the Ring-AllReduce scheme executes is 2(K − 1). While the hierarchical Allreduce also does the same amount of communication operation as the 2D-Torus all-reduce, the data size of the second step (vertical all-reduce) of the 2D-Torus all-reduce scheme is 𝑋 times smaller than that of the hierarchical Allreduce.

Fig 8. copyright by Mikami et al

Additionally, Mikami et al apply batch size control, which adaptively changes the minibatch size during the training, to solve the accuracy degradation problem of DDLS when using large mini-batch training.

Using these two approaches, 2D-Torus topology, and batch size control, they benchmark the training performance of their own approach using ImagNet and ResNet-50, which are the most popular datasets, and compare the training time and accuracy with the recent works. Table 1 in Fig 9. shows the training time and top-1 accuracy of their approach compared to existing methods and Table 2 in Fig 9. shows their GPU scaling efficiency. They achieved the training time of 224 sec of ImagNet/Resnet-50 using 2176 Tesla V100 GPUs., which is a great improvement compared to other methods, without degradation of validation accuracy(75.03%). Also, they improved GPU scaling efficiency to 91.62% using 1088 Tesla V100 GPUs without significant accuracy loss. These results support that their approaches are effective for resolving their two target problems, the instability of large mini-batch training and the gradient synchronization.

Fig 9. copyright by Mikami et al

Summary

In this post, several existing MPI Allreduce reduction approaches were introduced and each approach was compared in terms of the communication costs. Then we looked at how the most popular algorithm — Ring-Allreduce — works and reviewed two extensive recent studies conducted to reduce the latency and improve the communication time of Ring-AllReduce.

Developing new approaches for more efficient communication operations is one of the most important topics for DDLS. Researches on this topic are still actively being conducted. It would be helpful to navigate more approaches for reducing latency and study how they solved the challenges in DDLS!

References,

[1] Huasha Zhao, John Canny, Sparse Allreduce: Efficient Scalable Communication for Power-Law Data (https://arxiv.org/abs/1312.3020)

[2]Yuichiro Ueno, Rio Yokota, Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs

[3]Rolf Rabenseifner, Optimization of Collective Reduction Operations

[4]Hiroaki Mikami, Hisahiro Suganuma et al, ImageNet/ResNet-50 Training in 224 Seconds

--

--

Hyejin Kim

I’m expanding my experience to machine learning, software engineering, and distributed systems. I write what I learned here.