Files

Abstract

The success of deep learning may be attributed in large part to remarkable growth in the size and complexity of deep neural networks. However, present learning systems raise significant efficiency concerns and privacy: (1) currently, training systems are lagging behind the fast growth of deep neural architectures, and the training efficiency of deep learning algorithms cannot be guaranteed; (2) most learning is operated in a centralized way, yet massive amounts of data are created on decentralized edge devices and may contain sensitive information about users. All of these considerations lead to the necessity to migrate to distributed deep learning. In this thesis, we study efficiency and robustness, the two fundamental problems that have emerged in distributed deep learning. We first propose strategies to improve communication efficiency---a bottleneck to scaling distributed learning systems out and up---from various aspects: the study starts by understanding the trade-off between communication frequency and generalization performance, and then extends to decentralized and sparse communication topologies with compressed communication. Next, we investigate the computational efficiency issue of deep learning, which is yet another crucial factor that determines the learning and deployment efficiency. The proposed solutions can be generalized to various scenarios. Finally, learning with edge devices introduces various kinds of heterogeneity (e.g. data heterogeneity and system heterogeneity) in practice. As the last key contribution of this thesis, we develop robust decentralized/federated algorithms that are resistant to real-world challenges such as client data distribution shifts and heterogeneous computing systems.

Details

PDF