Bandit Algorithms

Stochastic Linear Bandits and UCB

Posted onOctober 19, 2016March 17, 201918 Comments

Recall that in the adversarial contextual $K$-action bandit problem, at the beginning of each round $t$ a context $c_t\in \Ctx$ is observed. The idea is that the context $c_t$ may help the learner to choose a better action. This led Continue Reading

Contextual Bandits and the Exp4 Algorithm

Posted onOctober 14, 20169 Comments

In most bandit problems there is likely to be some additional information available at the beginning of rounds and often this information can potentially help with the action choices. For example, in a web article recommendation system, where the goal Continue Reading

High probability lower bounds

Posted onOctober 14, 20161 Comment

In the post on adversarial bandits we proved two high probability upper bounds on the regret of Exp-IX. Specifically, we showed: Theorem: There exists a policy $\pi$ such that for all $\delta \in (0,1)$ for any adversarial environment $\nu\in [0,1]^{nK}$, Continue Reading

Adversarial bandits

Posted onOctober 1, 2016October 16, 20197 Comments

A stochastic bandit with $K$ actions is completely determined by the distributions of rewards, $P_1,\dots,P_K$, of the respective actions. In particular, in round $t$, the distribution of the reward $X_t$ received by a learner choosing action $A_t\in [K]$ is $P_{A_t}$, Continue Reading

Instance dependent lower bounds

Posted onSeptember 30, 20166 Comments

In the last post we showed that under mild assumptions ($n = \Omega(K)$ and Gaussian noise), the regret in the worst case is at least $\Omega(\sqrt{Kn})$. More precisely, we showed that for every policy $\pi$ and $n\ge K-1$ there exists Continue Reading

More information theory and minimax lower bounds

Posted onSeptember 28, 201617 Comments

Continuing the previous post, we prove the claimed minimax lower bound. We start with a useful result that quantifies the difficulty of identifying whether or not an observation is drawn from similar distributions $P$ and $Q$ defined over the same Continue Reading

Optimality concepts and information theory

Posted onSeptember 22, 201613 Comments

In this post we introduce the concept of minimax regret and reformulate our previous result on the upper bound on the worst-case regret of UCB in terms of the minimax regret. We briefly discuss the strengths and weaknesses of using Continue Reading

The Upper Confidence Bound Algorithm

Posted onSeptember 18, 201641 Comments

We now describe the celebrated Upper Confidence Bound (UCB) algorithm that overcomes all of the limitations of strategies based on exploration followed by commitment, including the need to know the horizon and sub-optimality gaps. The algorithm has many different forms, Continue Reading

First steps: Explore-then-Commit

Posted onSeptember 14, 20166 Comments

With most of the background material out of the way, we are almost ready to start designing algorithms for finite-armed stochastic bandits and analyzing their properties, and especially the regret. The last thing we need is an introduction to concentration Continue Reading

Finite-armed stochastic bandits: Warming up

Posted onSeptember 4, 201619 Comments

On Monday last week we did not have a lecture, so the lectures spilled over to this week’s Monday. This week was devoted to building up foundations, and this post will summarize how far we got. The post is pretty Continue Reading