Communication-efficient distributed training of machine learning models

Vogels, Thijs

doi:10.5075/epfl-thesis-9926

doctoral thesis

Communication-efficient distributed training of machine learning models

2023

In this thesis, we explore techniques for addressing the communication bottleneck in data-parallel distributed training of deep learning models. We investigate algorithms that either reduce the size of the messages that are exchanged between workers, or that reduce the number of messages sent and received.

To reduce the size of messages, we propose an algorithm for lossy compression of gradients. This algorithm is compatible with existing high-performance training pipelines based on the all-reduce primitive and leverages the natural approximate low-rank structure in gradients of neural network layers to obtain high compression rates.

To reduce the number of messages, we study the decentralized learning paradigm where workers do not average their model updates all-to-all in each step of Stochastic Gradient Descent, but only communicate with a small subset of their peers. We extend the aforementioned compression algorithm to operate in this setting. We also study the influence of the communication topology on the performance of decentralized learning, highlighting shortcomings of the typical 'spectral gap' metric to measure the quality of communication topologies, and proposing a new framework for evaluating topologies. Finally, we propose an alternative communication paradigm for distributed learning over sparse topologies. This paradigm, which is based on the concept 'relaying' updates over spanning trees of the communication topology, shows benefits over the typical gossip-based approach, especially when the workers have very heterogeneous data distributions.

Type

doctoral thesis

DOI

10.5075/epfl-thesis-9926

Author(s)

Vogels, Thijs

Advisors

Jaggi, Martin

Jury

Prof. Anne-Marie Kermarrec (présidente) ; Prof. Martin Jaggi (directeur de thèse) ; Prof. Patrick Thiran, Prof. Mike Rabbat, Prof. Dan Alistarh (rapporteurs)

Date Issued

2023

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2023-04-11

Thesis number

9926

Total of pages

146

Subjects

Deep learning

•

machine learning

•

distributed training

•

decentralized learning

•

gradient compression

•

stochastic gradient descent

EPFL units

Faculty

School

Doctoral School

Available on Infoscience

April 12, 2023

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/196913