Optimization Algorithms for Decentralized, Distributed and Collaborative Machine Learning
Distributed learning is the key for enabling training of modern large-scale machine learning models, through parallelising the learning process. Collaborative learning is essential for learning from privacy-sensitive data that is distributed across various agents, each having distinct data distributions. Both tasks are distributed in nature, which brings them under a common umbrella.
In this thesis, we examine algorithms for distributed and collaborative learning through the perspective of optimization theory. Specifically, we delve into the theoretical convergence properties of prevalent algorithms (e.g., decentralized SGD, local SGD, asynchronous SGD, clipped SGD, among others), and we address ways to enhance their efficiency.
A significant portion of this thesis centers on decentralized optimization methods for both distributed and collaborative learning applications. These are optimization techniques where agents interact directly with one another, bypassing the need for a central coordinator.
First, we address the challenge of communication efficiency in decentralized learning, introducing Choco-SGD---a first decentralized optimization algorithm which allows arbitrary high communication compression and at the same time comes with provable convergence guarantees.
Subsequently, we analyze the convergence properties of a large class of distributed and decentralized algorithms including decentralized SGD---the predominant algorithm for decentralized optimization---under a unified framework. Our approach imposes milder assumptions on functions and communication topologies than existing works. Our theory recovers and improves the previously best-known convergence for all of the algorithms covered in the framework.
Our theory reveals that the performance of decentralized SGD-based algorithms deteriorates when agents have differing data---a scenario commonly encountered in practical collaborative learning applications.
We therefore study and improve convergence rates of Gradient Tracking---the algorithm whose convergence behavior is known to be unaffected by the data heterogeniety.
Additionally, we explore asynchronous optimization techniques as an alternative approach to enhance communication efficiency. In this thesis, we explore a more straightforward centralized communication scheme. We improve existing theoretical convergence bounds of the asynchronous SGD algorithm. In particular, one of our main contributions is to show that asynchronous SGD in distributed learning scenarios is always faster than the synchronous minibatch SGD.
Lastly, we turn our attention to privacy aspects, which are especially important for collaborative learning algorithms, as the local agents' data are typically privacy sensitive. We study the convergence behavior of the clipped SGD algorithm---an integral part of any differentially private learning algorithm. We also study the convergence properties of MF-DP-FTRL algorithms which inject correlated noise over the iterations to achieve privacy and we propose a new correlated noise schedule based on our enhanced theory.
EPFL_TH9927.pdf
n/a
openaccess
copyright
7.26 MB
Adobe PDF
7c15007f19293d84455358b479ceddbd