Files

Abstract

Machine learning (ML) applications are ubiquitous. They run in different environments such as datacenters, the cloud, and even on edge devices. Despite where they run, distributing ML training seems the only way to attain scalable, high-quality learning. But, distributing ML is challenging, essentially due to the unique nature of ML applications. First, ML training needs to be robust against arbitrary (i.e., Byzantine) failures due to its usage in mission-critical applications. Second, training applications in datacenters run on shared clusters of computing resources, for which we need resource allocation solutions that meet the high computation demands of these applications while fully utilizing existing resources. Third, running distributed training in the cloud faces a network bottleneck, exacerbated by the fast-growing pace of computing power. Hence, we need solutions that reduce the communication load without impacting the training accuracy. Fourth, despite the scalability and privacy guarantees of training on edge devices via federated learning, the heterogeneity of devices' capabilities and their data distributions calls for robust solutions that cope with these challenges. To achieve robustness, we introduce Garfield, a library to help practitioners make their ML applications Byzantine-resilient. Besides addressing the vulnerability of the shared-graph architecture followed by classical ML frameworks, Garfield supports various communication patterns, robust aggregation rules, and compute hardware (i.e., CPUs and GPUs). We show how to use Garfield in different architectures, network settings, and data distributions. We explore elastic training (i.e., changing the training parameters mid-execution) to efficiently solve the resource allocation problem in datacenters' shared clusters. We present ERA, which provides elasticity in two dimensions: (1) it scales jobs horizontally, i.e., by adding or removing resources to or from the running jobs, and (2) it dynamically changes, at will, the per-GPU batch size to control the utilization of each GPU. We demonstrate that simultaneous scaling in both dimensions improves the training time without impacting the training accuracy. We show how to use cloud object stores (COS) to alleviate the network bottleneck of training transfer learning (TL) applications in the cloud. We propose HAPI, a processing system for TL that spans the compute and the COS tiers, enabling significant improvements while remaining transparent to the user. HAPI mitigates the network bottleneck by carefully splitting the TL application such that feature extraction is, partially or entirely, executed next to storage. We show how to efficiently and robustly train generative adversarial networks (GANs) in the federated learning paradigm with FeGAN. Essentially, we co-locate both components of a GAN (i.e., a generator and a discriminator) on each device (addressing the scaling problem) and have a server aggregate the devices' models using balanced sampling and Kullback-Leibler weighting, mitigating training issues and boosting convergence.

Details

PDF