Files

Abstract

Whether it occurs in artificial or biological substrates, {\it learning} is a {distributed} phenomenon in at least two aspects. First, meaningful data and experiences are rarely found in one location, hence {\it learners} have a strong incentive to work together. Second, a learner is itself a distributed system, made of more basic processes; the change in the connections between these basic processes is what allows learning. This generic view encompasses a large set of learning situations, from brains, to metabolic networks in the organism to the data centers where several machines are collaborating to recommend personalized content for a billion-users social media. In both aforementioned aspects, a learning system's ability to cope with the failure of some of its components is crucial. This thesis explores the robustness of learning systems from these two aspects. The first aspect is {\it coarse-grained}, as the unit of failure is a whole learner. The second aspect is {\it fine-grained}, as the unit of failure is the basic component of the learner (e.g. a neuron or a synapse). The first and larger part of this thesis focuses on the coarse-grained aspect. Specifically, we study the robustness of distributed Stochastic Gradient Descent (SGD is the work-horse algorithm behind today's most celebrated success in machine learning). We begin by proving that the standard deployment of SGD today is brittle, as this deployment typically consists of {\it averaging} the inputs from each learner. This leads to harmful consequences as the data that is used in machine learning comes from different and potentially unreliable sources. To account for the various types of failures (data poisoning from hackers, software bugs, communication delays etc.), we adopt the general abstraction of arbitrary failures in distributed systems, namely, the {\it Byzantine failures} abstraction. We provide a sufficient condition for SGD to be Byzantine resilient and present three algorithms that satisfy our condition under different configurations. The key algorithms that are introduced by this thesis are (1)~Krum, a gradient aggregation rule (GAR) that we prove to be a robust alternative to averaging in synchronous settings; (2)~Bulyan, a meta-algorithm that we prove to strengthen any given GAR in very high dimensional situations and (3)~Kardam, a gradient filtering scheme that we prove to be Byzantine resilient in the more challenging asynchronous setting. For each of our algorithms, we also provide a few variants as well as a discussion of their practical limitations. The second part of this thesis goes down to the fine-grained aspect. We focus on the special case of (artificial) neural networks. We view these networks as a weighted directed graph and prove upper bounds on the {\it forward propagated error} when the basic components (neurons and synapses) are failing. We also discuss the limitation of these bounds, how they could apply to future neuromorphic hardware and how they could inform on other systems such as biological (metabolic) networks.

Details

Actions

Preview