Multi-armed Bandits in Action

Making decisions is part and parcel of being human. Among a set of actions, we want to choose the one that has the highest reward. But the uncertainty of the outcome prevents us from always making the right decision. Making decisions under uncertainty can be studied in a principled way by the exploitation-exploration framework. The multi-armed bandit (MAB) framework is perhaps one of the simplest, and yet one of the most powerful settings to optimize a sequence of choices within an exploitation-exploration framework. In this thesis, I study several MAB related problems, from three different perspectives: I study (1) how machine-learning (ML) can benefit from MAB algorithms, (2) how MAB algorithms can affect humans, and (3) how human interactions can be studied in the MAB framework. (1) Optimization lies at the heart of almost all ML algorithms. Stochastic-gradient descent (SGD) and stochastic-coordinate descent (CD) are perhaps two of the most well known and widely used optimization algorithms. In Chapters 2 and 3, I revisit the datapoint and coordinate-selection procedure of SGD and CD from an MAB point of view. The goal is to reduce the training time of ML models. SGD works by estimating, on the fly, the gradient of the cost function by sampling a datapoint uniformly at random from a training set, and CD works by updating only a single decision variable (coordinate) sampled at random. Updating the model’s parameters based on, respectively, different datapoints or different coordinates, yields various improvements. However, a priori, it is not clear which datapoint or coordinate improves the model the most. I address this challenge by studying these problems in an MAB setting, and I develop algorithms to learn the optimal datapoint or coordinate selection strategies. Our methods often significantly reduce the training time of several machine-learning models. (2) Although some MAB algorithms are designed to improve ML algorithms, they can affect humans’ opinions about the outside world. In the personalized recommender systems, the goal is to predict the preference of a user and to suggest the best content to them. However, recent studies suggest that a personalized algorithm can learn and propagate systematic biases and polarize opinions [Pariser, 2011]. In Chapter 4, to combat bias, I propose to use constraints on the distribution from which a content is selected. The constraints can be designed to ameliorate polarization and biases. I combine the classic MAB setting with these constraints and show how an adaptation of an MAB algorithm can lead to a scalable algorithm with provable guarantees for the constrained setting. Furthermore, I show that our algorithms often significantly outperform the existing algorithms that we could apply to this setting. (3) Interacting with others is one of the main sources of information for us. In Chapter 5, I study how this natural setting in the human world can be studied in the bandit framework. I extend the classic single decision-maker setting of MAB to multiple decision-makers, where a decision-maker observes her neighbors’ decisions and rewards. Presumably, the additional information of the neighbors should improve the decisions. I show how to model such a decision-making process that appeals to the classic MAB framework. I study the new setting, in both stochastic and adversarial MAB frameworks, and I develop algorithms that incorporate the additional knowledge of the peers.


Advisor(s):
Thiran, Patrick
Celis, Laura Elisa
Year:
2020
Publisher:
Lausanne, EPFL
Keywords:
Laboratories:
INDY2




 Record created 2020-02-17, last modified 2020-02-17


Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)