AGGREGATHOR: Byzantine Machine Learning via Robust Gradient Aggregation

We present AGGREGATHOR, a framework that implements state-of-the-art robust (Byzantine-resilient) distributed stochastic gradient descent. Following the standard parameter server model, we assume that a minority of worker machines can be controlled by an adversary and behave arbitrarily. Such a setting has been theoretically studied with several of the existing approaches using a robust aggregation of the workers’ gradient estimations. Yet, the question is whether a Byzantine-resilient aggregation can leverage more workers to speedup learning. We answer this theoretical question, and implement these state-of-the-art theoretical approaches on AGGREGATHOR, to assess their practical costs. We built AGGREGATHOR around TensorFlow and introduce modifications for vanilla TensorFlow towards making it usable in an actual Byzantine setting. AGGREGATHOR also permits the use of unreliable gradient transfer over UDP to provide further speed-up (without losing the accuracy) over the native communication protocols (TCP-based) of TensorFlow in saturated networks. We quantify the overhead of Byzantine resilience of AGGREGATHOR to 19% and 43% (to ensure weak and strong Byzantine resilience respectively) compared to vanilla TensorFlow.

Presented at:
The Conference on Systems and Machine Learning (SysML), 2019, Stanford, CA, USA, March 31 - April 2, 2019
Apr 01 2019
Published in the Conference on Systems and Machine Learning (SysML) 2019, Stanford, CA, USA.
Additional link:

 Record created 2019-05-15, last modified 2019-05-25

Download fulltext

Rate this document:

Rate this document:
(Not yet reviewed)