In this thesis, we focus on the problem of achieving practical privacy guarantees in machine learning (ML), where the classic differential privacy (DP) fails to maintain a good trade-off between user privacy and data utility. Differential privacy guarantee may be influenced by extreme outliers or samples outside of the data distribution to a large extent. For example, when trying to protect a classification model for magnetic resonance imaging (MRI), differentially private mechanisms would add the amount of noise sufficient to hide any image in the space of the same dimensionality. That includes images that do not belong to the intended data distribution (cars, houses, animals, and so on). Such generality inevitably yields poor privacy guarantees. Based on these observations and the ideas of DP, we propose a data-aware approach to privacy in machine learning. We design two novel privacy notions, Average-Case Differential Privacy (ADP) and Bayesian Differential Privacy (BDP), which allow to take into account the data distribution information and significantly improve the privacy-utility balance.
First, we present average-case differential privacy, an empirical privacy notion designed for ex post privacy analysis of generative models and privacy-preserving data publishing. It relaxes the worst-case requirement of differential privacy to the average case and relies on empirical estimation to deal with undefined distributions. This notion can be regarded as a statistical sensitivity measure -- it measures the expected change in the model outcomes given a change in the inputs generated by an observed distribution.
Second, we develop a more rigorous privacy notion, Bayesian differential privacy, based on the same high-level principle of probabilistic sensitivity measure. As the main theoretical contributions of this thesis, we formulate and prove basic properties of Bayesian DP, such as composition, group privacy, and resistance to post-processing, and we develop a novel privacy accounting method for iterative algorithms based on the advanced composition theorem. Furthermore, we show connections between our accountant and the well-known moments accountant, as well as between Bayesian DP and other privacy definitions.
Our practical contributions and evaluation branch into three main areas: (1) privacy-preserving data release using generative adversarial networks (GANs); (2) private classification using convolutional neural networks and other ML models; and (3) private federated learning (FL) for both discriminative and generative models. We demonstrate that both notions allow to achieve considerably higher utility than differential privacy, and that Bayesian DP provides a superior trade-off between privacy guarantees and the output model quality in all settings.
EPFL_TH7216.pdf
openaccess
3.44 MB
Adobe PDF
73eca490a868e0d1c26a249bf238c7ff