Category Archives: Bandits

First order bounds for k-armed adversarial bandits

To revive the content on this blog a little we have decided to highlight some of the new topics covered in the book that we are excited about and that were not previously covered in the blog. In this post we are going to talk about first-order bounds for $k$-armed adversarial bandits. Just to wet your appetite, in future posts we plan to talk about the variance of Exp3, the relationship between stochastic and adversarial linear bandits, minimax duality, Bayesian methods and much more!

Recall that in the $k$-armed adversarial bandit problem the adversary secretly chooses a sequence $(y_t)_{t=1}^n$ with $y_t \in [0,1]^k$. Then in each round the learner chooses a distribution $P_t$ over $[k]$ and samples an action $A_t \sim P_t$. The loss of the learner is $y_{tA_t}$, which is the only thing the learner learns about the vector $y_t$ and the regret, which the learner should minimize, is
R_n = \max_{a \in [k]} \E\left[\sum_{t=1}^n y_{tA_t} – y_{ta}\right]\,.
We proved in an earlier post that Exp3 ensures that $R_n \leq \sqrt{2nk \log(k)}$, which can be improved to $\sqrt{8nk}$ using online mirror descent, which we’ll get to in a future post.

First-order regret bounds

These results are fine, but they do not exhibit adaptivity. In stochastic bandits we proved bounds that depend on the bandit (the suboptimality gaps of the arms) and argued that (for our best algorithms) in fact this dependence is unimprovable. These algorithm can be said to adapt to the difficulty of the bandit problem. In this post we ask whether something similar can be done in adversarial bandits?

There are many ways to make regret bounds bandit (or, data-) dependent. One possibility, leading to what is called first-order regret bounds, is to replace $n$, the number of rounds, with $L^* = \min_{a \in [k]} \sum_{t=1}^n y_{ta}$, aiming for bounds of the form $R_n = O(\sqrt{k (1+L^*) \log(n)})$. That is, the bound improves as the optimal loss decreases, so one would employ the algorithm considered here when expecting $L^*\ll n$. In the worst-case, $L^* = n$, so the new bound is just a little worse than our previous best bound (because of the additional $\log(n)$ term), so adaptation (up to log terms) is free. We will return to the question of the extra log term in the notes.

The reader at this stage may wonder why we (or anyone would ever) care about this problem? First, if it is possible to adapt, one should — we hope you agree (who wants to throw away free lunches?). But why adapt to $L^*$? In short, because we can. It is also quite reasonable that problems can differ in the magnitude of $L^*$. Historically, that adapting to $L^*$ is possible was first observed in the full-information setting, which, naturally led to the question of whether this is still possible in the bandit setting. But what is the insight in the full-information setting? The insight is simple: Actions whose total loss at any point is larger than $L^*$ can be discarded. Of course, our algorithms need to make this “smoother” and without knowing $L^*$, but this is what makes the problem fun. The approach we follow here is to use follow-the-regularized-leader (FTRL) with an adaptive learning rate and logarithmic barrier.

Follow-the-regularized-leader (FTRL)

Given a “nice” convex function $F : \cA \to \R$ with non-empty convex domain $\cD\subset \R^k$, FTRL for $k$-armed bandits chooses
P_t = \argmin_{p \in \cA} \eta_t \sum_{s=1}^{t-1} \ip{p, \hat Y_s} + F(p)\,,
where $\cA \subset \cD \cap \cP_{k-1}$ is a subset of the probability simplex $\cP_{k-1} = \{x \in [0,1]^k : \norm{x}_1 = 1\}$ and $\hat Y_{sa} = \one{A_s = a} y_{sa} / P_{sa}$ is the importance-weighted estimator of $y_{sa}$ and $\eta_t\ge 0$ is a weighting factor that determines the importance of the data received up to round $t$. That $\cA$ may be a strict subset of $\cP_{k-1}$ is a subtlety that we will come back to later. The reason we allow this possibility is because we may need $F$ to “blow up” as the boundary of $\cP_{k-1}$ is the boundary is approached from inside.

Besides Mirror Descent, FTRL is the second, equally important algorithm for online convex optimization. Just like Mirror Descent, FTRL uses a regularizer (function $F$) to stabilize predictions. If $F=0$, we get the follow-the-leader (FTL) algorithm, which, in the lack of extra assumptions on the data, is a terrible choice (the reader is invited to check that the worst-case regret of FTL is linear, even under the full information case, i.e., when $\hat Y_s=y_s$).

There are many parallels between Mirror Descent and FTRL, but also some important differences. As we know, Mirror Descent can recover Exp3. So can FTRL. And in fact, this happens when $F$ is chosen to be the unnormalized negentropy function, which is also what made Mirror Descent recover Exp3. So in this case the two algorithms become identical. Just to recall, the unnormalized negentropy is defined through $F(p) = -\sum_{i=1}^k p_i \log(p_i) – p_i$, where the domain $\cA$ is chosen to be equal to the probability simplex. Interestingly, to achieve our goal, we need to choose a different regularizer, the so-called log-barrier function of the positive quadrant: $F(p) = -\sum_{i=1}^k \log(p_i)$, $p\in \cD=(0,\infty)^k$. One may think that we should then choose $\cA=\cD\cap \cP_{k-1}$, but it will turn out that we need to make $\cA$ a little smaller than this set.

Let $R_n(p)=\E\left[\sum_{t=1}^n \ip{y_{t},P_t-p}\right]$ denote the expected regret against $p\in \cP_{k-1}$. Note that $R_n = \max_{p\in \cP_{k-1}} R_n(p)$. To analyze the regret of this algorithm we need a bound on the regret for follow the regularized leader with changing learning rates. The following theorem holds in this regard, whose proof we only cite, but will not include in this post:

Theorem (FTRL Regret Bound):
Let $(\eta_t)_{t=1}^n$ be almost surely decreasing, $\eta_{n+1}=\eta_n$ (needed in the definition of $P_{n+1}$), $F$ Legendre and assume that the closure of $F$’s domain includes $\cA$. Then, for any $a\in \cA$,
R_n(a) \leq \E\left[\frac{\diam_F(\cA)}{\eta_n}\right] + \sum_{t=1}^n \E\left[\ip{P_t – P_{t+1}, \hat Y_t} – \frac{D_F(P_{t+1}, P_t)}{\eta_t}\right]\,,
where $\diam_F(\cA) = \sup_{a,b \in \cA} F(a) – F(b)$.

Note that the theorem allows for quite a bit of flexibility in selecting the “learning rates” $(\eta_t)$. In particular, as long as they are kept decreasing, the learning rates can be selected in a data-dependent manner. As we will see this will prove to be essential in proving the first-order regret bound. We have briefly touched upon Legendre functions earlier when discussing Mirror Descent. Just to recap, a strictly convex map $F:\cD \to \R$ is Legendre if the interior $\cD^\circ$ of $\cD$ is nonempty, $\mathrm{dom}(\nabla F) = \cD^\circ$ ($F$ is differentiable over the interior of its domain) and $\nabla F(x)$ diverges as $x$ approaches the boundary of $\cD$.

Achieving first-order regret bounds with FTRL

Let us return now to the choice of $F$ and $\cA$. First, it is obvious that $F$ is Legendre. Now, to choose $\cA$, we look into how to control the first term in the above theorem. Because $\log(1/x)$ blows up as $x$ tends to zero, the diameter of the probability simplex is infinite for this potential. For this reason we choose $\cA = \{x \in [1/n,1]^k : \norm{x}_1 = 1\}$, which is a trick we also used when analyzing adversarial linear bandits on the unit ball. Of course the optimal action is a corner of $\cP_{k-1}$, which is not in $\cA$, but $R_n \le \max_{p\in \cA} R_n(p) +1$, and so from now on can focus on controlling $\max_{p\in \cA} R_n(p)$. The diameter of $\cA$ with respect to $F$ is bounded by $\diam_F(\cA) \leq k \log(n)$.

We now show how to control the second term in the regret bound. The argument that follows is quite standard and is worth knowing independently of this proof. We remind you that for positive definite matrix $A$ we define $\norm{x}_A = \sqrt{x^\top A x}$. A second-order Taylor series expansion (along the segment connecting $P_{t+1}$ and $P_t$) shows that
D_F(P_{t+1}, P_t) = \frac{1}{2}\norm{P_{t+1} – P_t}_{H}^2\,,
where $H=\nabla^2 F(q)=\mathrm{diag}(1/q_1^2,\dots,1/q_k^2)$ for some point $q$ on the line segment connecting $P_t$ and $P_{t+1}$. We also have
\ip{P_t – P_{t+1}, \hat Y_t} \leq \norm{P_t – P_{t+1}}_{H} \norm{\hat Y_t}_{H^{-1}}\,,
which follows because the dual norm of $\norm{\cdot}_A$ for positive definite matrix $A$ is $\norm{\cdot}_{A^{-1}}$.
Basic calculus shows that $bx – ax^2/2 \leq \frac{b^2}{2a}$ for all $x$, which means that
\ip{P_t – P_{t+1}, \hat Y_t} – \frac{D_F(P_{t+1}, P_t)}{\eta_t} \leq \frac{\eta_t}{2} \norm{\hat Y_t}^2_{H^{-1}} =
\frac{\eta_t y_{tA_t}^2\, q_{A_t}^2 }{2\, P_{tA_t}^2}\,.
Now we make use of a handy trick. If $P_{t+1,A_t} \geq P_{tA_t}$, then the left-hand side of the above display is less than zero because the loss estimator is non-negative. On the other hand, if $P_{t+1,A_t} < P_{tA_t}$, then $q_{A_t} \leq P_{tA_t}$ and hence we conclude that \begin{align*} \ip{P_t - P_{t+1}, \hat Y_t} - \frac{D_F(P_{t+1}, P_t)}{\eta_t} \leq \frac{\eta_t y_{tA_t}^2}{2}\,. \end{align*} It follows that the regret is bounded by \begin{align*} R_n \leq 1+\E\left[\frac{\diam_F(\cA)}{\eta_n}\right] + \E\left[\sum_{t=1}^n \frac{\eta_t y_{tA_t}^2}{2}\right]\,. \end{align*} The last step is to select the learning rates. For this we choose \begin{align*} \eta_t = \frac{\sqrt{k \log(n)}}{\sqrt{1 + \sum_{s=1}^{t-1} y_{sA_s}^2}}\,. \end{align*} This the algorithm can indeed compute based on the information it has and is indeed decreasing. A simple calculation shows that \begin{align*} \frac{1}{2} \sum_{t=1}^n \eta_t y_{tA_t}^2 = \frac{1}{2} \sum_{t=1}^n \frac{y_{tA_t}^2}{\sqrt{1 + \sum_{s=1}^{t-1} y_{sA_s}^2}} \leq \sqrt{\sum_{t=1}^n y_{tA_t}^2}\,. \end{align*} We won't dwell on why this holds for now, but the idea is to bound the sum in terms of an integral and notice that $\frac{1}{2}\int f'(t) / \sqrt{f(t)} dt = \sqrt{f(t)}$. Combining this with the regret bound shows that \begin{align*} R_n = O\left(\sqrt{k \log(n) \left(1 + \E\left[\sum_{t=1}^n y_{tA_t}^2\right]\right)}\right)\,, \end{align*} where we used Jensen's inequality to move the expectation inside the square root. This is already quite a fine bound, but it is not so nice that the right-hand side depends on the algorithm. Note that $y_{tA_t} \leq 1$, which means that \begin{align*} \E\left[\sum_{t=1}^n y_{tA_t}^2\right] \leq \E\left[\sum_{t=1}^n y_{tA_t}\right] = R_n + \min_{a \in [k]} \sum_{t=1}^n y_{ta}\,. \end{align*} We're nearly done now. Substituting this into the regret gives \begin{align*} R_n = O\left(\sqrt{k \log(n)\left(1 + R_n + L^*\right)}\right)\,. \end{align*} Squaring both sides and bounding the quadratic shows that \begin{align*} R_n = O\left(\sqrt{k (1 + L^*) \log(n)}\right)\,, \end{align*} which is the promised result.


Note 1: In our analysis we restricted the action set to ensure the diameter with respect to the logarithmic barrier is bounded by $k \log(n)$. This is not strictly necessary. Even when the action set is the whole probability simplex, the regret relative to some $a$ in the interior of the probability simplex can be bounded by
R_n(a) = \E\left[\sum_{t=1}^n \ip{P_t – a, y_t}\right]
\leq \E\left[\frac{F(a) – F(P_1)}{\eta_n} + \sum_{t=1}^n \left(\ip{P_t – P_{t+1}, \hat y_t} – \frac{D_F(P_{t+1}, P_t)}{\eta_t}\right)\right]\,.
Weakening the first term to the diameter is where we went wrong in the analysis above. The desired result is proven by bounding the regret by
R_n = \max_{a \in \cA} \E\left[\sum_{t=1}^n \ip{P_t – a, y_t}\right]
\leq k + \max_{a \in \cA \cap [1/n,1]^k} R_n(a)\,.

Note 2: There are other kinds of adaptivity. The quadratic variation is
Q_n = \sum_{t=1}^n \norm{y_t – \mu}^2_2\,,
where $\mu = \frac{1}{n} \sum_{t=1}^n y_t$. There exist an algorithm for which $R_n = \tilde O(\sqrt{Q_n})$.
For details see the recent paper by Sebastien Bubeck, Michael B. Cohen and Yuanzhi Li.

Note 3: The bound on $\ip{p – q, y} – D_F(p, q) / \eta$ that uses the Taylor expansion of $F$ is standard. See Theorem 26.12 in the book.

Note 4: We might cover the analysis of FTRL with adaptive learning rates in a future post. There are several ways to write these bounds, with some forms more convenient than others. A few varieties are given in the exercises and solutions in Chapter 28 of the book, including the above theorem.

Note 5: Mirror descent and FTRL are often either equivalent, or enjoy more-or-less equivalent bounds. We plan a post dedicated to the similarities and differences between these algorithms for a future post. For now we mention only that FTRL seems to be more well-behaved when using an adaptive learning rate.

Note 6: In practice, FTRL with the log barrier and learning rate tuning given in the post almost always seems to perform better than Exp3. In bandits where the loss of the optimal arm is small, the improvement can be quite enormous. In the extreme case where losses from the first arm are sampled from a Bernoulli with bias 1/2 and losses from the second arm are always zero, the expected regret of Exp3 over 1000 rounds is about $R_n \approx 30$, while for the log barrier with an adaptive learning rate it is about $R_n \approx 5$.

Bibliographic remarks

This post is inspired by a paper by Chen-Yu Wei, Haipeng Luo, who prove more sophisticated results with a more complicated argument and algorithm. First-order bounds for bandits were first provided by Chamy Allenberg, Peter Auer, Laszlo Gyorfi and Gyorgy Ottucsak. These ideas have been generalized to more complex models such as semi-bandits by Gergely Neu. The results in the latter paper also replace the dependence on $\log(n)$ with a dependence on $\log(k)$.

The algorithm/results appear in the exercises in Chapter 28 of the book.

Bandit Algorithms Book

Dear readers

After nearly two years since starting to write the blog we have at last completed a first draft of the book, which is to be published by Cambridge University Press.
The book is available for free as a PDF and will remain so after publication. We’re grateful to Cambridge for allowing this.

Without further ado, here is the link.

Although we still have a few things we want to do, the manuscript is sufficiently polished to be useful. Of course we would greatly appreciate any comments you might have, including typos, errors in the proofs, missing references, confusing explanations or anything else you might notice. We will periodically update the book, so it would be helpful if you could quote the revision number on the cover when sending us your comments (

The manuscript includes a lot of material not in the blog. The last seven chapters are all new, covering combinatorial (semi-)bandits, non-stationary bandits, ranking, pure exploration, Bayesian methods, Thompson sampling, partial monitoring and an introduction to learning in Markov decision processes. Those chapters that are based on blog posts have been cleaned up and often we have added significant depth. There is a lot of literature that we have not covered. Some of these missing topics are discussed in extreme brevity in the introduction to Part VII. It really is amazing how large the bandit literature has become and we’re sorry not to have found space for everything.

The book includes around 250 exercises, some of which have solutions. On average the exercises have been proofread less carefully than the rest of the book, so some caution is advised. The solutions to selected exercises are available here.

Finally, we’re very thankful for all the feedback already received, both on the blog and early drafts of the book.