Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Genuinely Distributed Byzantine Machine Learning
 
conference paper

Genuinely Distributed Byzantine Machine Learning

El Mhamdi, El Mahdi  
•
Guerraoui, Rachid  
•
Guirguis, Arsany Hany Abdelmessih  
Show more
August 3, 2020
PODC '20 Proceedings of the 39th Symposium on Principles of Distributed Computing
The ACM Symposium on Principles of Distributed Computing (PODC)

Machine Learning (ML) solutions are nowadays distributed, according to the so-called server/worker architecture. One server holds the model parameters while several workers train the model. Clearly, such architecture is prone to various types of component failures, which can be all encompassed within the spectrum of a Byzantine behavior. Several approaches have been proposed recently to tolerate Byzantine workers. Yet all require trusting a central parameter server. We initiate in this paper the study of the "general" Byzantine-resilient distributed machine learning problem where no individual component is trusted. In particular, we distribute the parameter server computation on several nodes. We show that this problem can be solved in an asynchronous system, despite the presence of ⅓ Byzantine parameter servers and ⅓ Byzantine workers (which is optimal). We present a new algorithm, ByzSGD, which solves the general Byzantine-resilient distributed machine learning problem by relying on three major schemes. The first, Scatter/Gather, is a communication scheme whose goal is to bound the maximum drift among models on correct servers. The second, Distributed Median Contraction (DMC), leverages the geometric properties of the median in high dimensional spaces to bring parameters within the correct servers back close to each other, ensuring learning convergence. The third, Minimum-Diameter Averaging (MDA), is a statistically-robust gradient aggregation rule whose goal is to tolerate Byzantine workers. MDA requires loose bound on the variance of non-Byzantine gradient estimates, compared to existing alternatives (e.g., Krum [12]). Interestingly, ByzSGD ensures Byzantine resilience without adding communication rounds (on a normal path), compared to vanilla non-Byzantine alternatives. ByzSGD requires, however, a larger number of messages which, we show, can be reduced if we assume synchrony. We implemented ByzSGD on top of TensorFlow, and we report on our evaluation results. In particular, we show that ByzSGD achieves convergence in Byzantine settings with around 32% overhead compared to vanilla TensorFlow. Furthermore, we show that ByzSGD's throughput overhead is 24--176% in the synchronous case and 28--220% in the asynchronous case.

  • Files
  • Details
  • Metrics
Type
conference paper
DOI
10.1145/3382734.3405695
Author(s)
El Mhamdi, El Mahdi  
Guerraoui, Rachid  
Guirguis, Arsany Hany Abdelmessih  
Hoang, Le Nguyen  
Rouault, Sébastien Louis Alexandre  
Date Issued

2020-08-03

Publisher

Association for Computing Machinery

Published in
PODC '20 Proceedings of the 39th Symposium on Principles of Distributed Computing
ISBN of the book

978-1-450375-82-5

Total of pages

10

Subjects

ml-ai

•

distributed machine learning

•

Byzantine fault tolerance

•

Byzantine parameter servers

Note

Published in the proceedings of The ACM Symposium on Principles of Distributed Computing (PODC) 2020.

URL
https://dl.acm.org/doi/pdf/10.1145/3382734.3405695
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
DCL  
Event nameEvent placeEvent date
The ACM Symposium on Principles of Distributed Computing (PODC)

Salerno, Italy

August 3–7, 2020

Available on Infoscience
October 7, 2020
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/172271
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés