Privacy and Confidentiality in Machine Learning and Data Analysis: Understanding Risks and Developing Protections
Without the ability to collect, access and analyze data, most of nowadays research would be impossible. Without data to learn from, the field of machine learning (ML) would not exist.
However, much of the particularly useful data---medical records, human behavioral data, communications data, etc.---is privacy-sensitive or confidential, and it has been shown that such data can be extracted from ML models and the results of data analyses.
It is thus important to 1) study how privacy and confidentiality get affected when sensitive data is analyzed or used for model training, and 2) develop methods to prevent privacy and confidentiality violations.
In this thesis, we make progress on both of these problems.
A particularly successful framework for protecting privacy is differential privacy (DP). It allows for designing randomized database functions, so-called DP mechanisms, whose outputs reveal only a controlled amount of information about any given record in the database.
An important and widely used property of DP is that it supports composition, i.e., that it allows for invoking multiple mechanisms on the same database while preserving some level of privacy. This ability to access a database multiple times is essential for many ML and data analysis tasks.
The first part of the thesis leverages new insights into how privacy degrades under composition to make DP applicable in novel scenarios and to improve the quality of mechanism outputs.
Our first result is based on the observation that many mechanisms have the property that some of their outputs reveal less information about records in the input database than others. We show that whenever a mechanism produces such an output, mechanisms that come later in the composition are allowed to reveal more information about database records, which improves output quality.
We further analyze the case where mechanisms access only a (deterministic or random) subset of the available database records.
We develop a framework that allows for easily expressing such constraints on the accessed records and for using the constraints to derive improved privacy guarantees.
In the second part of the thesis, we consider the stricter local differential privacy (LDP) framework for protecting privacy even against the organization collecting the data. LDP is often less practical than DP with a trusted data collector since it makes accessing the data adaptively multiple times more difficult. We propose a solution to this problem. Instead of computing functions directly on the privacy-sensitive data, we use this data to reweight records in a dataset without privacy constraints, such as a public dataset. ML and data analysis can then be performed on this reweighted dataset without access restrictions.
Lastly, we move beyond individual records and show how other types of information revealed by ML models can pose privacy and confidentiality risks. We first do this for the case of information about the training distribution by identifying the sources of leakage of such information. This analysis allows for the development of principled mitigation strategies. We then turn to the widely successful large language models, which can memorize and reveal yet other types of information about the data on which they were trained, such as facts or information about their alignment. We propose a taxonomy for these different types of memorization and analyze the implications for privacy, confidentiality, and beyond.
EPFL_TH9959.pdf
main document
restricted
N/A
1.29 MB
Adobe PDF
03d13654f59bceeee1cccc478bd93613