Communication-efficient distributed training of machine learning models
In this thesis, we explore techniques for addressing the communication bottleneck in data-parallel distributed training of deep learning models. We investigate algorithms that either reduce the size of the messages that are exchanged between workers, or that reduce the number of messages sent and received.
To reduce the size of messages, we propose an algorithm for lossy compression of gradients. This algorithm is compatible with existing high-performance training pipelines based on the all-reduce primitive and leverages the natural approximate low-rank structure in gradients of neural network layers to obtain high compression rates.
To reduce the number of messages, we study the decentralized learning paradigm where workers do not average their model updates all-to-all in each step of Stochastic Gradient Descent, but only communicate with a small subset of their peers. We extend the aforementioned compression algorithm to operate in this setting. We also study the influence of the communication topology on the performance of decentralized learning, highlighting shortcomings of the typical 'spectral gap' metric to measure the quality of communication topologies, and proposing a new framework for evaluating topologies. Finally, we propose an alternative communication paradigm for distributed learning over sparse topologies. This paradigm, which is based on the concept 'relaying' updates over spanning trees of the communication topology, shows benefits over the typical gossip-based approach, especially when the workers have very heterogeneous data distributions.
EPFL_TH9926.pdf
n/a
openaccess
copyright
4.01 MB
Adobe PDF
57a4a22109e61904ca4797b2e7b11da8