Abstract

We consider online convex optimizations in the bandit setting. The decision maker does not know the time- varying cost functions, or their gradients. At each time step, she observes the value of the cost function for her chosen action. The objective is to minimize the regret, that is, the difference between the sum of the costs she accumulates and that of the optimal action computable had she known the cost functions a priori. We present a novel algorithm in order to minimize the regret in an unconstrained action space. Our algorithm hinges on the idea of introducing randomization to approximate the gradients of the cost functions using only their observed values. We establish an almost sure regret bound for the mean values of actions and an expected regret bound for the actions.

Details