GARFIELD: System Support for Byzantine Machine Learning (Regular Paper)

Guerraoui, Rachid; Guirguis, Arsany; Plassmann, Jeremy; Ragot, Anton; Rouault, Sébastien

doi:10.1109/DSN48987.2021.00021

Guerraoui, Rachid; Guirguis, Arsany; Plassmann, Jeremy; Ragot, Anton; Rouault, Sébastien

2021

Download

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

We present GARFIELD, a library to transparently make machine learning (ML) applications, initially built with popular (but fragile) frameworks, e.g., TensorFlow and PyTorch, Byzantine–resilient. GARFIELD relies on a novel object–oriented design, reducing the coding effort, and addressing the vulnerability of the shared–graph architecture followed by classical ML frameworks. GARFIELD encompasses various communication patterns and supports computations on CPUs and GPUs, allowing addressing the general question of the practical cost of Byzantine resilience in ML applications. We report on the usage of GARFIELD on three main ML architectures: (a) a single server with multiple workers, (b) several servers and workers, and (c) peer–to–peer settings. Using GARFIELD, we highlight interesting facts about the cost of Byzantine resilience. In particular, (a) Byzantine resilience, unlike crash resilience, induces an accuracy loss, (b) the throughput overhead comes more from communication than from robust aggregation, and (c) tolerating Byzantine servers costs more than tolerating Byzantine workers.

Details

Title GARFIELD: System Support for Byzantine Machine Learning (Regular Paper)

Author(s) Guerraoui, Rachid ; Guirguis, Arsany ; Plassmann, Jeremy ; Ragot, Anton ; Rouault, Sébastien

Published in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Pages 39-51

Conference 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Taipei, Taiwan, June 21-24, 2021

Date 2021-06-21

Publisher IEEE

ISBN 978-1-665411-94-3

Keywords

ml-ai; Distributed Machine Learning; Byzantine Fault Tolerance; Robust Machine Learning

DOI https://doi.org/10.1109/DSN48987.2021.00021

Additional link Link to code

Laboratories DCL

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > DCL - Distributed Computing Laboratory
Peer-reviewed publications
Conference Papers
Work produced at EPFL

Record creation date 2021-09-07

Files

Abstract

Details

Actions