As large, data-driven artificial intelligence models become ubiquitous, guaranteeing high data quality is imperative for constructing models. Crowdsourcing, community sensing, and data filtering have long been the standard approaches to guaranteeing or improving data quality. The underlying theory, mainly incentive mechanism design, is often limited in its scope of applicability. A subset of incentive mechanisms designed to handle unverifiable or inherently subjective data - Peer Prediction mechanisms - is generally only applicable to settings where the data signal comes from a discrete distribution. In this thesis, we expand the scope of applicability of Peer Prediction mechanisms in two parts.
In the first part, we address a constrained extension of Peer Prediction that is applicable to machine learning. A data collecting entity, known as a Center, may not need to learn a joint distribution of (x,y) pairs. It may only need to learn a parameterized model that minimizes a loss function on the joint distribution. We analyze a statistical measure known as Influence, which can be interpreted as a form of Peer Prediction. We will show that the Peer Truth Serum (PTS) is a special case of Influence, and that Influence has desirable game-theoretic properties as an incentive mechanism.
We then take the analysis of Influence into the regime of data filtering, which is uniquely challenging compared to crowdsourcing. We use asymptotic analysis to show that, in the limit of infinite samples, the ability to filter training data using Influence is constrained by the degree of corruption in the validation data. However, finite sample analysis reveals that one can exceed the quality of the validation data if conditions are met regarding higher moments of the data models.
In the second part, we move on from this more constrained extension to the most general extension of Peer Prediction: learning arbitrary distributions. Many crowdsourcing problems involve absolutely continuous distributions, such as Gaussian distributions. The standard approach is to discretize the space and apply a discrete Peer Prediction mechanism. This approach has numerous issues: coarse discretizations result in inaccurate approximations of the distribution and loose incentives, while fine discretizations result in volatile payments, which tend to fail in real world applications. We expand the theory of Peer Prediction, rather than seek a better implementation of current theory. We consider two approaches.
In the first approach, one can discretize the space, which we call partitioning into bins, but pick from a set of partitions rather than just one. In this regime, the notion of peer matching in Peer Prediction is generalized with the concept of Peer Neighborhoods. With a reasonable strengthening of the Agent update condition, we obtain a valid extension of the PTS on arbitrary distributions.
The partitioning approach for arbitrary distributions reveals a more precise theory. By changing perspective from partitioning according to the Lebesgue measure on the space of reports to partitioning according to the public probability measure, we obtain a payment function that doesn't rely on discretization. Using this function as the basis for a mechanism, a Continuous Truth Serum, reveals solutions to other underlying problems with Peer Prediction, such as the unobserved category problem.
EPFL_TH9803.pdf
n/a
openaccess
copyright
4.34 MB
Adobe PDF
9ff97195fbd8f29389974d1d2f158673