Practical Byzantine-resilient Stochastic Gradient Descent
Algorithms are everywhere.
The recipe for the frangipane cake is an algorithm.
If all the listed ingredients are available and the cook is sufficiently deft, after a finite number of small, well-defined steps a delicious dessert will exit the oven.
Now, what if the grocer who sold the cook the ingredients mislabeled some of them?
If salt had been mistakenly labeled and sold as sugar, those tasting the cake would probably be disgusted and stop eating.
If a dangerous drug had been maliciously sold as sugar, the consequences could be much more awful.
When it comes to security, machine learning is perhaps closer to cooking than it is to other families of computer programs.
Follow the recipe for the Margherita pizza using cream and onions instead of tomatoes and mozzarella, and you may pull a Flammekueche out of the oven.
Likewise, what the practitioner gets at the end of the training certainly depends on the relevance of the model and other hyperparameters, but ultimately the output parameters are defined by the inputs: the training set.
While for cookery we simply trust the supply chain and blame the cook for mistakes, we should expect machine learning algorithms to gracefully handle (partially) malformed and malicious datasets and inputs, as for any other computer program.
Byzantine machine learning studies malicious behaviors during the training phase.
A particular, growing body of work has been tackling Byzantine failures in the context of distributed Stochastic Gradient Descent (SGD).
A central server distributes and safely aggregates gradients from several worker nodes, some of them being adversarial.
On the theoretical side, this thesis tries to advance the state-of-the-art on two fronts.
Existing works had always considered a trusted, central parameter server.
We propose a novel algorithm that has the same proven guarantees as previous works, but does not require any node to be trusted and can operate under network asynchrony.
Two other contributions tackle problems either in the construction or an assumption common to statistically-robust Gradient Aggregation Rule (GAR), that made them very vulnerable to attacks in practice.
We put a substantial emphasis on devising pragmatic, easy to implement and computationally cheap techniques to address these issues.
On the system side, this thesis implements many of the existing and contributed GARs, integrated into two major machine learning frameworks (PyTorch and TensorFlow).
These algorithms are assesses from microbenchmarks on a single GPU to actual networked deployments on many nodes.
The associated code has been open-sourced.
EPFL_TH8999.pdf
n/a
openaccess
Copyright
7.12 MB
Adobe PDF
a246a2fe3d3d6520f4446440e2c2452c