Category Archives: Finite-armed bandits

First order bounds for k-armed adversarial bandits

To revive the content on this blog a little we have decided to highlight some of the new topics covered in the book that we are excited about and that were not previously covered in the blog. In this post we are going to talk about first-order bounds for $k$-armed adversarial bandits. Just to wet your appetite, in future posts we plan to talk about the variance of Exp3, the relationship between stochastic and adversarial linear bandits, minimax duality, Bayesian methods and much more!

Recall that in the $k$-armed adversarial bandit problem the adversary secretly chooses a sequence $(y_t)_{t=1}^n$ with $y_t \in [0,1]^k$. Then in each round the learner chooses a distribution $P_t$ over $[k]$ and samples an action $A_t \sim P_t$. The loss of the learner is $y_{tA_t}$, which is the only thing the learner learns about the vector $y_t$ and the regret, which the learner should minimize, is
R_n = \max_{a \in [k]} \E\left[\sum_{t=1}^n y_{tA_t} – y_{ta}\right]\,.
We proved in an earlier post that Exp3 ensures that $R_n \leq \sqrt{2nk \log(k)}$, which can be improved to $\sqrt{8nk}$ using online mirror descent, which we’ll get to in a future post.

First-order regret bounds

These results are fine, but they do not exhibit adaptivity. In stochastic bandits we proved bounds that depend on the bandit (the suboptimality gaps of the arms) and argued that (for our best algorithms) in fact this dependence is unimprovable. These algorithm can be said to adapt to the difficulty of the bandit problem. In this post we ask whether something similar can be done in adversarial bandits?

There are many ways to make regret bounds bandit (or, data-) dependent. One possibility, leading to what is called first-order regret bounds, is to replace $n$, the number of rounds, with $L^* = \min_{a \in [k]} \sum_{t=1}^n y_{ta}$, aiming for bounds of the form $R_n = O(\sqrt{k (1+L^*) \log(n)})$. That is, the bound improves as the optimal loss decreases, so one would employ the algorithm considered here when expecting $L^*\ll n$. In the worst-case, $L^* = n$, so the new bound is just a little worse than our previous best bound (because of the additional $\log(n)$ term), so adaptation (up to log terms) is free. We will return to the question of the extra log term in the notes.

The reader at this stage may wonder why we (or anyone would ever) care about this problem? First, if it is possible to adapt, one should — we hope you agree (who wants to throw away free lunches?). But why adapt to $L^*$? In short, because we can. It is also quite reasonable that problems can differ in the magnitude of $L^*$. Historically, that adapting to $L^*$ is possible was first observed in the full-information setting, which, naturally led to the question of whether this is still possible in the bandit setting. But what is the insight in the full-information setting? The insight is simple: Actions whose total loss at any point is larger than $L^*$ can be discarded. Of course, our algorithms need to make this “smoother” and without knowing $L^*$, but this is what makes the problem fun. The approach we follow here is to use follow-the-regularized-leader (FTRL) with an adaptive learning rate and logarithmic barrier.

Follow-the-regularized-leader (FTRL)

Given a “nice” convex function $F : \cA \to \R$ with non-empty convex domain $\cD\subset \R^k$, FTRL for $k$-armed bandits chooses
P_t = \argmin_{p \in \cA} \eta_t \sum_{s=1}^{t-1} \ip{p, \hat Y_s} + F(p)\,,
where $\cA \subset \cD \cap \cP_{k-1}$ is a subset of the probability simplex $\cP_{k-1} = \{x \in [0,1]^k : \norm{x}_1 = 1\}$ and $\hat Y_{sa} = \one{A_s = a} y_{sa} / P_{sa}$ is the importance-weighted estimator of $y_{sa}$ and $\eta_t\ge 0$ is a weighting factor that determines the importance of the data received up to round $t$. That $\cA$ may be a strict subset of $\cP_{k-1}$ is a subtlety that we will come back to later. The reason we allow this possibility is because we may need $F$ to “blow up” as the boundary of $\cP_{k-1}$ is the boundary is approached from inside.

Besides Mirror Descent, FTRL is the second, equally important algorithm for online convex optimization. Just like Mirror Descent, FTRL uses a regularizer (function $F$) to stabilize predictions. If $F=0$, we get the follow-the-leader (FTL) algorithm, which, in the lack of extra assumptions on the data, is a terrible choice (the reader is invited to check that the worst-case regret of FTL is linear, even under the full information case, i.e., when $\hat Y_s=y_s$).

There are many parallels between Mirror Descent and FTRL, but also some important differences. As we know, Mirror Descent can recover Exp3. So can FTRL. And in fact, this happens when $F$ is chosen to be the unnormalized negentropy function, which is also what made Mirror Descent recover Exp3. So in this case the two algorithms become identical. Just to recall, the unnormalized negentropy is defined through $F(p) = -\sum_{i=1}^k p_i \log(p_i) – p_i$, where the domain $\cA$ is chosen to be equal to the probability simplex. Interestingly, to achieve our goal, we need to choose a different regularizer, the so-called log-barrier function of the positive quadrant: $F(p) = -\sum_{i=1}^k \log(p_i)$, $p\in \cD=(0,\infty)^k$. One may think that we should then choose $\cA=\cD\cap \cP_{k-1}$, but it will turn out that we need to make $\cA$ a little smaller than this set.

Let $R_n(p)=\E\left[\sum_{t=1}^n \ip{y_{t},P_t-p}\right]$ denote the expected regret against $p\in \cP_{k-1}$. Note that $R_n = \max_{p\in \cP_{k-1}} R_n(p)$. To analyze the regret of this algorithm we need a bound on the regret for follow the regularized leader with changing learning rates. The following theorem holds in this regard, whose proof we only cite, but will not include in this post:

Theorem (FTRL Regret Bound):
Let $(\eta_t)_{t=1}^n$ be almost surely decreasing, $\eta_{n+1}=\eta_n$ (needed in the definition of $P_{n+1}$), $F$ Legendre and assume that the closure of $F$’s domain includes $\cA$. Then, for any $a\in \cA$,
R_n(a) \leq \E\left[\frac{\diam_F(\cA)}{\eta_n}\right] + \sum_{t=1}^n \E\left[\ip{P_t – P_{t+1}, \hat Y_t} – \frac{D_F(P_{t+1}, P_t)}{\eta_t}\right]\,,
where $\diam_F(\cA) = \sup_{a,b \in \cA} F(a) – F(b)$.

Note that the theorem allows for quite a bit of flexibility in selecting the “learning rates” $(\eta_t)$. In particular, as long as they are kept decreasing, the learning rates can be selected in a data-dependent manner. As we will see this will prove to be essential in proving the first-order regret bound. We have briefly touched upon Legendre functions earlier when discussing Mirror Descent. Just to recap, a strictly convex map $F:\cD \to \R$ is Legendre if the interior $\cD^\circ$ of $\cD$ is nonempty, $\mathrm{dom}(\nabla F) = \cD^\circ$ ($F$ is differentiable over the interior of its domain) and $\nabla F(x)$ diverges as $x$ approaches the boundary of $\cD$.

Achieving first-order regret bounds with FTRL

Let us return now to the choice of $F$ and $\cA$. First, it is obvious that $F$ is Legendre. Now, to choose $\cA$, we look into how to control the first term in the above theorem. Because $\log(1/x)$ blows up as $x$ tends to zero, the diameter of the probability simplex is infinite for this potential. For this reason we choose $\cA = \{x \in [1/n,1]^k : \norm{x}_1 = 1\}$, which is a trick we also used when analyzing adversarial linear bandits on the unit ball. Of course the optimal action is a corner of $\cP_{k-1}$, which is not in $\cA$, but $R_n \le \max_{p\in \cA} R_n(p) +1$, and so from now on can focus on controlling $\max_{p\in \cA} R_n(p)$. The diameter of $\cA$ with respect to $F$ is bounded by $\diam_F(\cA) \leq k \log(n)$.

We now show how to control the second term in the regret bound. The argument that follows is quite standard and is worth knowing independently of this proof. We remind you that for positive definite matrix $A$ we define $\norm{x}_A = \sqrt{x^\top A x}$. A second-order Taylor series expansion (along the segment connecting $P_{t+1}$ and $P_t$) shows that
D_F(P_{t+1}, P_t) = \frac{1}{2}\norm{P_{t+1} – P_t}_{H}^2\,,
where $H=\nabla^2 F(q)=\mathrm{diag}(1/q_1^2,\dots,1/q_k^2)$ for some point $q$ on the line segment connecting $P_t$ and $P_{t+1}$. We also have
\ip{P_t – P_{t+1}, \hat Y_t} \leq \norm{P_t – P_{t+1}}_{H} \norm{\hat Y_t}_{H^{-1}}\,,
which follows because the dual norm of $\norm{\cdot}_A$ for positive definite matrix $A$ is $\norm{\cdot}_{A^{-1}}$.
Basic calculus shows that $bx – ax^2/2 \leq \frac{b^2}{2a}$ for all $x$, which means that
\ip{P_t – P_{t+1}, \hat Y_t} – \frac{D_F(P_{t+1}, P_t)}{\eta_t} \leq \frac{\eta_t}{2} \norm{\hat Y_t}^2_{H^{-1}} =
\frac{\eta_t y_{tA_t}^2\, q_{A_t}^2 }{2\, P_{tA_t}^2}\,.
Now we make use of a handy trick. If $P_{t+1,A_t} \geq P_{tA_t}$, then the left-hand side of the above display is less than zero because the loss estimator is non-negative. On the other hand, if $P_{t+1,A_t} < P_{tA_t}$, then $q_{A_t} \leq P_{tA_t}$ and hence we conclude that \begin{align*} \ip{P_t - P_{t+1}, \hat Y_t} - \frac{D_F(P_{t+1}, P_t)}{\eta_t} \leq \frac{\eta_t y_{tA_t}^2}{2}\,. \end{align*} It follows that the regret is bounded by \begin{align*} R_n \leq 1+\E\left[\frac{\diam_F(\cA)}{\eta_n}\right] + \E\left[\sum_{t=1}^n \frac{\eta_t y_{tA_t}^2}{2}\right]\,. \end{align*} The last step is to select the learning rates. For this we choose \begin{align*} \eta_t = \frac{\sqrt{k \log(n)}}{\sqrt{1 + \sum_{s=1}^{t-1} y_{sA_s}^2}}\,. \end{align*} This the algorithm can indeed compute based on the information it has and is indeed decreasing. A simple calculation shows that \begin{align*} \frac{1}{2} \sum_{t=1}^n \eta_t y_{tA_t}^2 = \frac{1}{2} \sum_{t=1}^n \frac{y_{tA_t}^2}{\sqrt{1 + \sum_{s=1}^{t-1} y_{sA_s}^2}} \leq \sqrt{\sum_{t=1}^n y_{tA_t}^2}\,. \end{align*} We won't dwell on why this holds for now, but the idea is to bound the sum in terms of an integral and notice that $\frac{1}{2}\int f'(t) / \sqrt{f(t)} dt = \sqrt{f(t)}$. Combining this with the regret bound shows that \begin{align*} R_n = O\left(\sqrt{k \log(n) \left(1 + \E\left[\sum_{t=1}^n y_{tA_t}^2\right]\right)}\right)\,, \end{align*} where we used Jensen's inequality to move the expectation inside the square root. This is already quite a fine bound, but it is not so nice that the right-hand side depends on the algorithm. Note that $y_{tA_t} \leq 1$, which means that \begin{align*} \E\left[\sum_{t=1}^n y_{tA_t}^2\right] \leq \E\left[\sum_{t=1}^n y_{tA_t}\right] = R_n + \min_{a \in [k]} \sum_{t=1}^n y_{ta}\,. \end{align*} We're nearly done now. Substituting this into the regret gives \begin{align*} R_n = O\left(\sqrt{k \log(n)\left(1 + R_n + L^*\right)}\right)\,. \end{align*} Squaring both sides and bounding the quadratic shows that \begin{align*} R_n = O\left(\sqrt{k (1 + L^*) \log(n)}\right)\,, \end{align*} which is the promised result.


Note 1: In our analysis we restricted the action set to ensure the diameter with respect to the logarithmic barrier is bounded by $k \log(n)$. This is not strictly necessary. Even when the action set is the whole probability simplex, the regret relative to some $a$ in the interior of the probability simplex can be bounded by
R_n(a) = \E\left[\sum_{t=1}^n \ip{P_t – a, y_t}\right]
\leq \E\left[\frac{F(a) – F(P_1)}{\eta_n} + \sum_{t=1}^n \left(\ip{P_t – P_{t+1}, \hat y_t} – \frac{D_F(P_{t+1}, P_t)}{\eta_t}\right)\right]\,.
Weakening the first term to the diameter is where we went wrong in the analysis above. The desired result is proven by bounding the regret by
R_n = \max_{a \in \cA} \E\left[\sum_{t=1}^n \ip{P_t – a, y_t}\right]
\leq k + \max_{a \in \cA \cap [1/n,1]^k} R_n(a)\,.

Note 2: There are other kinds of adaptivity. The quadratic variation is
Q_n = \sum_{t=1}^n \norm{y_t – \mu}^2_2\,,
where $\mu = \frac{1}{n} \sum_{t=1}^n y_t$. There exist an algorithm for which $R_n = \tilde O(\sqrt{Q_n})$.
For details see the recent paper by Sebastien Bubeck, Michael B. Cohen and Yuanzhi Li.

Note 3: The bound on $\ip{p – q, y} – D_F(p, q) / \eta$ that uses the Taylor expansion of $F$ is standard. See Theorem 26.12 in the book.

Note 4: We might cover the analysis of FTRL with adaptive learning rates in a future post. There are several ways to write these bounds, with some forms more convenient than others. A few varieties are given in the exercises and solutions in Chapter 28 of the book, including the above theorem.

Note 5: Mirror descent and FTRL are often either equivalent, or enjoy more-or-less equivalent bounds. We plan a post dedicated to the similarities and differences between these algorithms for a future post. For now we mention only that FTRL seems to be more well-behaved when using an adaptive learning rate.

Note 6: In practice, FTRL with the log barrier and learning rate tuning given in the post almost always seems to perform better than Exp3. In bandits where the loss of the optimal arm is small, the improvement can be quite enormous. In the extreme case where losses from the first arm are sampled from a Bernoulli with bias 1/2 and losses from the second arm are always zero, the expected regret of Exp3 over 1000 rounds is about $R_n \approx 30$, while for the log barrier with an adaptive learning rate it is about $R_n \approx 5$.

Bibliographic remarks

This post is inspired by a paper by Chen-Yu Wei, Haipeng Luo, who prove more sophisticated results with a more complicated argument and algorithm. First-order bounds for bandits were first provided by Chamy Allenberg, Peter Auer, Laszlo Gyorfi and Gyorgy Ottucsak. These ideas have been generalized to more complex models such as semi-bandits by Gergely Neu. The results in the latter paper also replace the dependence on $\log(n)$ with a dependence on $\log(k)$.

The algorithm/results appear in the exercises in Chapter 28 of the book.

Instance dependent lower bounds

In the last post we showed that under mild assumptions ($n = \Omega(K)$ and Gaussian noise), the regret in the worst case is at least $\Omega(\sqrt{Kn})$. More precisely, we showed that for every policy $\pi$ and $n\ge K-1$ there exists a $K$-armed stochastic bandit problem $\nu\in \mathcal E_K$ with unit variance Gaussian rewards and means in $[0,1]$ such that the regret $R_n(\pi,\nu)$ of $\pi$ on $\nu$ satisfies $R_n(\pi,\nu) \geq c\sqrt{nK}$ for some universal constant $c > 0$, or
R_n^*(\cE_K)\doteq\inf_\pi \sup_{\nu \in \cE_K} R_n(\pi,\nu)\ge c \sqrt{nK}\,.
Earlier we have also seen that the regret of UCB on such problems is at most $C \sqrt{nK \log(n)}$ with some universal constant $C>0$: $R_n(\mathrm{UCB},\cE_K)\le C \sqrt{nK\log(n)}$. Thus, UCB is near minimax-optimal, in the sense that except for a logarithmic factor its upper bound matches the lower bound for $\cE_K$.

Does this mean that UCB is necessarily a good algorithm for class $\cE_K$? Not quite! Consider the following modification of UCB, which we will call by the descriptive, but not particularly flattering name UCBWaste. Choose $0 < C' \ll C$, where $C$ is the universal constant in the upper bound on the regret of UCB (if we actually calculated the value of $C$, we could see that e.g. $C'=0.001$ would do). Then, in the first $m=C'\sqrt{Kn}$ rounds, UCBWaste will explore each of the $K$ actions equally often, after which it switches to the UCB strategy. It is easy to see that from the perspective of worst-case regret, UCBWaste is almost as good as UCB (being at most $C'\sqrt{Kn}$ worse). Would you accept UCBWaste as a good policy?

Perhaps not. To see why, consider for example a problem $\nu\in \cE_K$ such that on $\nu$ the immediate regret for choosing any suboptimal action is close to one. From the regret analysis of UCB we can show that after seeing a few rewards from some suboptimal action, with high probability UCB will stop using that action. As a result, UCB on $\nu$ would achieve a very small regret. In fact, it follows from our previous analysis that for sufficiently large $n$, UCB will achieve a regret of $\sum_{i:\Delta_i(\nu)>0} \frac{C\log(n)}{\Delta_i(\nu)} \approx C K \log(n)$, where $\Delta_i(\nu)$ is the suboptimality gap of action $i$ in $\nu$. However, UCBWaste, true to its name, will suffer a regret of at least $C’\sqrt{Kn/2}$ on $\nu$, a quantity that for $n$ large is much larger than the logarithmic bound on the regret of UCB.

Thus, on the “easy”, or “nice” instances, UCBWaste’s regret seems to be unnecessarily large at least as $n\to\infty$, even though UCBWaste is a near worst-case optimal algorithm for any $n$. If we care about what happens for $n$ large (and why should not we — after all, having higher standards should not hurt), it may be worthwhile to look beyond finite-time worst-case optimality.

Algorithms that are nearly worst-case optimal may fail to take advantage of environments that are not the worst case. What is more desirable is to have algorithms which are near worst-case optimal, while their performance gets better on “nicer” instances.

But what makes an instance “nice”? For a fixed $K$-action stochastic bandit environment $\nu$ with $1$-subgaussian reward-noise, the regret of UCB was seen to be logarithmic and in particular
\limsup_{n\to\infty} \frac{R_n}{\log(n)} \leq c^*(\nu) \doteq \sum_{i:\Delta_i(\nu) > 0} \frac{2}{\Delta_i(\nu)}\,.
Then perhaps $c^*(\nu)$ could be used as an intrinsic measure of the difficulty of learning on $\nu$. Or perhaps there exist some constant $c'(\nu)$ that is even smaller than $c^*(\nu)$ such that some strategy suffers at most $c'(\nu) \log(n)$ regret asymptotically? If, for example, $c'(\nu)$ is smaller by a factor of two than $c^*(\nu)$, then this could be a big difference!

In this post we will show that, in some sense, $c^*(\nu)$ is the best possible. In particular, we will show that no reasonable strategy can outperform UCB in the asymptotic sense above on any problem instance $\nu$. This is a big difference from the previous approach because the “order of quantifiers” is different: In the minimax result proven in the last post we showed that for any policy $\pi$ there exists a bandit instance $\nu \in \cE_K$ on which $\pi$ suffers large regret. Here we show that for all reasonable policies $\pi$, the policy’s asymptotic regret is always at least as large as that of UCB.

We have not yet said what we mean by reasonable. Clearly for any $\nu$, there are policies for which $\limsup_{n\to\infty} R_n / \log(n) = 0$. For example, the policy that chooses $A_t = \argmax_i \mu_i(\nu)$. But this is not a reasonable policy because it suffers linear regret for any $\nu’$ such that the optimal action of $\nu$ is not optimal in $\nu’$. At least, we should require that the policy achieves sublinear regret over all problems. This may still look a bit wasteful because UCB achieves logarithmic asymptotic regret for any $1$-subgaussian problem. A compromise between the undemanding sublinear and the perhaps overly restrictive logarithmic growth is to require that the policy has subpolynomial regret growth, a choice that has historically been the most used, and which we will also take. The resulting policies are said to be consistent.

Asymptotic lower bound

To firm up the ideas, define $\cE_K’$ as the class of $K$-action stochastic bandits where each action has a reward distribution with $1$-subgaussian noise. Further, let $\cE_K'(\mathcal N)\subset \cE_K’$ be those environments in $\cE_K’$ where the reward distribution is Gaussian. Note that for minimax lower bounds it is important to assume the sub-optimality gaps are bounded from above, this is not required in the asymptotic setting. The reason is that any reasonable (that word again) strategy should choose each arm at least once, which does not affect the asymptotic regret at all, but does affect the finite-time minimax regret.

Definition (consistency): A policy $\pi$ is consistent over $\cE$ if for all bandits $\nu \in \mathcal \cE$ and for all $p>0$ it holds that
R_n(\pi,\nu) = O(n^p)\, \quad \text{ as } n \to\infty\,.
We denote the class of consistent policies over $\cE$ by $\Pi_{\text{cons}}(\cE)$.

Note that $\eqref{eq:subpoly}$ trivially holds for $p\ge 1$. Note also that because $n \mapsto n^p$ is positive-valued, the above is equivalent to saying that for each $\nu \in \cE$ there exists a constant $C_\nu>0$ such that $R_n(\pi,\nu) \le C_\nu n^p$ for all $n\in \N$. In fact, one can also show that above we can also replace $O(\cdot)$ with $o(\cdot)$ with no effect on whether a policy is called consistent or not. Just like in statistics, consistency is a purely asymptotic notion.

We can see that UCB is consistent (its regret is logarithmic) over $\cE_K’$. On the other hand, the strategy that always chooses a fixed action $i\in K$ is not consistent (if $K > 1$) over any reasonable $\cE$, since its regret is linear in any $\nu$ where $i$ is suboptimal.

The main theorem of this post is that no consistent strategy can outperform UCB asymptotically in the Gaussian case:

Theorem (Asymptotic lower bound for Gaussian environments): For any policy $\pi$ consistent over the $K$-action unit-variance Gaussian environments $\cE_K'(\mathcal N)$ and any $\nu\in \cE_K'(\mathcal N)$, it holds that
\liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)} \geq c^*(\nu)\,,
where recall that $c^*(\nu)= \sum_{i:\Delta_i(\nu)>0} \frac{2}{\Delta_i(\nu)}$.

Since UCB is consistent for $\cE_K’\supset \cE_K'(\mathcal N)$,
c^*(\nu) = \inf_{\pi \in \Pi_{\text{cons}}(\cE_K'(\mathcal N))} \liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)}
\limsup_{n\to\infty} \frac{R_n(\text{UCB},\nu)}{\log(n)}\le c^*(n)\,,
that is
\inf_{\pi \in \Pi_{\text{cons}}(\cE_K'(\mathcal N))} \liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)}
\limsup_{n\to\infty} \frac{R_n(\text{UCB},\nu)}{\log(n)} = c^*(\nu)\,.
Thus, we see that UCB is asymptotically optimal over the class of unit variance Gaussian environments $\cE_K'(\mathcal N)$.

Pick a $\cE_K'(\mathcal N)$-consistent policy $\pi$ and a unit variance Gaussian environment $\nu =(P_1,\dots,P_K)\in \cE_K'(\mathcal N)$ with mean rewards $\mu \in \R^K$. We need to lower bound $R_n \doteq R_n(\pi,\nu)$. Let $\Delta_i \doteq \Delta_i(\nu)$. Based on the basic regret decomposition identity $R_n = \sum_i \Delta_i \E_{\nu,\pi}[T_i(n)]$, it is not hard to see that the result will follow if for any action $i\in [K]$ that is suboptimal in $\nu$ we prove that
\liminf_{n\to\infty} \frac{\E_{\nu,\pi}[T_i(n)]}{\log(n)} \ge \frac{2}{\Delta_i^2}\,.
As in the previous lower bound proof, we assume that all random variables are defined over the canonical bandit measurable space $(\cH_n,\cF_n)$, where $\cH_n = ([K]\times \R)^n$, $\cF_n$ is the restriction of $\mathcal L(R^{2n})$ to $\cH_n$, and $\PP_{\nu,\pi}$ (which induces $\E_{\nu,\pi}$) is the distribution of $n$-round action-reward histories induced by the interconnection of $\nu$ and $\pi$.

The intuitive logic of the lower bound is as follows: $\E_{\nu,\pi}[T_i(n)]$ must be (relatively) large because the regret in all environments $\nu’$ where (as opposed to what happens in $\nu$) action $i$ is optimal must be small by our assumption that $\pi$ is a “reasonable” policy and for this $\E_{\nu’,\pi}[T_i(n)]$ must be large. However, if the gap $\Delta_i$ is small and $\nu’$ is close to $\nu$, $\E_{\nu,\pi}[T_i(n)]$ will necessarily be large when $\E_{\nu’,\pi}[T_i(n)]$ is large because the distributions $\PP_{\nu,\pi}$ and $\PP_{\nu’,\pi}$ will also be close.

As in the previous lower bound proof, the divergence decomposition identity and the high-probability Pinsker inequality will help us to carry out this argument. In particular, from the divergence decomposition lemma, we get that for any $\nu’=(P’_1,\dots,P_K’)$ such that $P_j=P_j’$ except for $j=i$,
\KL(\PP_{\nu,\pi}, \PP_{\nu’,\pi}) = \E_{\nu,\pi}[T_i(n)] \, \KL(P_i, P’_i)\,,
while the high-probability Pinsker inequality gives that for any event $A$,
\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c)\ge \frac12 \exp(-\KL(\PP_{\nu,\pi}, \PP_{\nu’,\pi}) )\,,
or equivalently,
\KL(\PP_{\nu,\pi},\PP_{\nu’,\pi}) \ge \log \frac{1}{2 (\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c))}\,.
We now upper bound $\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c)$ in terms of $R_n+R_n’$ where $R_n’ = R_n(\nu’,\pi)$. For this choose $P_i’ = \mathcal N(\mu_i+\lambda,1)$ with $\lambda>\Delta_i$ to be selected later. Further, we choose $A= \{ T_i(n)\ge n/2 \}$ and thus $A^c = \{ T_i(n) < n/2 \}$. These choices make $\PP_{\nu,\pi}(A)+\PP_{\nu',\pi}(A^c)$ small if $R_n$ and $R_n'$ are both small. Indeed, since $i$ is suboptimal in $\nu$ and $i$ is optimal in $\nu'$ and for any $j\ne i$, $\Delta_j'(\nu) \ge \lambda-\Delta_i$, \begin{align*} R_n \ge \frac{n\Delta_i}{2} \PP_{\nu,\pi}(T_i(n)\ge n/2 ) \, \text{ and } R_n' \ge \frac{n (\lambda-\Delta_i) }{2} \PP_{\nu',\pi}( T_i(n) < n/2 )\,. \end{align*}
Summing these up, lower bounding $\Delta_i/2$ and $(\lambda-\Delta_i)/2$ by $\kappa(\Delta_i,\lambda) \doteq \frac{\min( \Delta_i, \lambda-\Delta_i )}{2}$, and then reordering gives
\PP_{\nu,\pi}( T_i(n) \ge n/2 ) + \PP_{\nu’,\pi}( T_i(n) < n/2 ) \le \frac{R_n+R_n'}{ \kappa(\Delta_i,\lambda) \,n} \end{align*} as promised. Combining this inequality with $\eqref{eq:infoplem}$ and $\eqref{eq:pinsker}$ and using $\KL(P_i,P'_i) = \lambda^2/2$ we get \begin{align} \frac{\lambda^2}{2}\E_{\nu,\pi}[T_i(n)] \ge \log\left( \frac{\kappa(\Delta_i,\lambda)}{2} \frac{n}{R_n+R_n'} \right) = \log\left( \frac{\kappa(\Delta_i,\lambda)}{2}\right) + \log(n) - \log(R_n+R_n')\,. \end{align} Dividing both sides by $\log(n)$, we see that it remains to upper bound $\frac{\log(R_n+R_n')}{\log(n)}$. For this, note that for any $p>0$ $R_n+ R_n’ \le C n^p$ for all $n$ and some constant $C>0$. Hence, $\log(R_n+R_n’) \le \log(C) + p\log(n)$, from which we get that $\limsup_{n\to\infty} \frac{\log(R_n+R_n’)}{\log(n)} \le p$. Since $p>0$ was arbitrary, it follows that this also holds for $p=0$. Hence,
\frac{\lambda^2}{2} \frac{\E_{\nu,\pi}[T_i(n)]}{\log(n)} \ge
1 – \limsup_{n\to\infty} \frac{\log(R_n+R_n’)}{\log(n)} \ge 1\,.
Taking the infimum of both sides over $\lambda>\Delta_i$ and reordering gives $\eqref{eq:armlowerbound}$, thus finishing the proof.


Instance-dependent finite-time lower bounds

Is it possible to go beyond asymptotic instance optimality? Yes, of course! All we have to do is to redefine what is meant by a “reasonable” algorithm. One idea is to call an algorithm reasonable when its finite-time(!) regret is close to minimax optimal. The situation is illustrated in the graph below:

Illustration of various optimality concepts.

Illustration of optimality concepts. The $x$ axis lists instances, ordered so that the “green” graph looks nice. Shown are the regret of various policies in different colors. While a near minimax algorithm can perform equally over all instances (and never much worse than the largest of the curves’ minimum values), a near instance optimal algorithm is required to come close to what is best on every instance using reasonable algorithms. The reasonable class can be the algorithms which are near minimax optimal.

In this part we will show instance-optimal finite-time lower bounds building on the ideas of the previous proof. This proof by and large uses the same ideas as the proof of the minimax result in the last post. This suggests that for future reference it may be useful to extract the common part, which summarizes what can be obtained by chaining the high-probability Pinsker inequality with the divergence decomposition lemma:

Lemma (Instance-dependent lower bound): Let $\nu = (P_i)_i,\nu’=(P’_i)$ be $K$-action stochastic bandit environments that differ only the distribution of the rewards for action $i\in [K]$. Further, assume that $i$ is suboptimal in $\nu$ and is optimal in $\nu’$, and in particular $i$ is the unique optimal action in $\nu’$. Let $\lambda = \mu_i(\nu’)-\mu_i(\nu)$. Then, for any policy $\pi$,
D(P_i,P_i’) \E_{\nu,\pi} [T_i(n)] \ge \log\left( \frac{\min\set{\lambda – \Delta_i, \Delta_i}}{4}\right) + \log(n) – \log(R_n+R_n’)\,.

Note that the lemma holds for finite $n$ and any $\nu$ and can be used to derive finite-time instance-dependent lower bounds for any environment class $\cE$ that is rich enough to include a pair $\nu’$ for any $\nu \in \cE$ and for policies that are uniformly good over $\cE$. For illustration consider the Gaussian unit variance environments $\cE\doteq \cE_K(\mathcal N)$ and policies $\pi$ whose regret, on any $\nu\in \cE$, is bounded by $C n^p$ for some $C>0$ and $p>0$. Call this class $\Pi(\cE,C,n,p)$, the class of $p$-order policies. Note that UCB is in this class with $C = C’\sqrt{K \log(n)}$ and $p=1/2$ with some $C’>0$. Thus, for $p\ge 1/2$ and suitable $C$ this class is not empty.

As an immediate corollary of the previous lemma we get the following result:

Theorem (Finite-time, instance-dependent lower bound for $p$-order policies): Take $\cE$ and $\Pi(\cE,C,n,p)$ as in the previous paragraph. Then, for any $\pi \in \Pi(\cE,C,n,p)$ and $\nu \in \cE$, the regret of $\pi$ on $\nu$ satisfies
R_n(\pi,\nu) \ge
\frac{1}{2} \sum_{i: \Delta_i>0} \left(\frac{(1-p)\log(n) + \log(\frac{\Delta_i}{8C})}{\Delta_i}\right)^+\,,
where $(x)^+ = \max(x,0)$ is the positive part of $x\in \R$.

Proof: Given $\nu=(\mathcal N(\mu_i,1))_i$, $i$ such that action $i$ is suboptimal in $\nu$, choose $\nu_i’ = (\mathcal N(\mu_j’,1))_j$ such that $\mu_j’=\mu_j$ unless $j=i$, in which make $\mu_i’ = \mu_i+\lambda$, $\lambda\le 2\Delta_i(\nu)$. Then, $\nu’\in \cE$ and from $\log(R_n+R_n’) \le \log(2C)+p\log(n)$ and $\eqref{eq:idalloclb}$, we get
\E_{\nu,\pi} [T_i(n)]
\ge \frac{2}{\lambda^2}\left(\log\left(\frac{n}{2Cn^p}\right) + \log\left(\frac{\min\set{\lambda – \Delta_i, \Delta_i}}{4}\right)\right)
Choosing $\lambda = 2\Delta_i$ and plugging this into the basic regret decomposition identity gives the result.


With $p=1/2$, $C = C’\sqrt{K}$ with $C’>0$ suitable so that $\Pi(\cE,C’\sqrt{K},n,1/2)\ne \emptyset$ we get for any policy $\pi$ that is “near-minimax optimal” for $\cE$ that
R_n(\pi,\nu) \ge \frac{1}{2} \sum_{i: \Delta_i>0} \left(\frac{\frac12\log(n) + \log(\frac{\Delta_i}{8C’ \sqrt{K}})}{\Delta_i}\right)^+\,.
The main difference to the asymptotic lower bound is twofold: First, when $\Delta_i$ is very small, the corresponding term is eliminated in the lower bound: For small $\Delta_i$, the impact of choosing for $n$ small action $i$ is negligible. Second, even when $\Delta_i$ are all relatively large, the leading term in this lower bound is approximately half that of $c^*(\nu)$. This effect may be real: The class of policies considered is larger than in the asymptotic lower bound, hence there is the possibility that the policy that is best tuned for a given environment achieves a smaller regret. At the same time, we only see a constant factor difference.

The lower bound $\eqref{eq:indnearminimaxlb}$ should be compared to the upper bound derived for UCB. In particular, recall that the simplified form of an upper bound was $\sum_{i:\Delta_i>0} \frac{C \log(n)}{\Delta_i}$. We see that the main difference to the above bound is the lack of the second term in $\eqref{eq:indnearminimaxlb}$. With some extra work, this gap can also be closed.

We now argue that the previous lower bound is not too weak in that in some sense it allows us to recover our previous minimax lower bound. In particular, choosing $\Delta_i= 8 \rho C n^{p-1}$ uniformly for all but one optimal action, we get $(1-p)\log(n)+\log(\frac{\Delta_i}{8C}) = \log(\rho)$. Plugging this into $\eqref{eq:finiteinstancebound}$ and lower bounding $\sup_{\rho>0}\log(\rho)/\rho = \exp(-1)$ by $0.35$ gives the following corollary:

Corollary (Finite-time lower bound for $p$-order policies): Take $\cE$ and $\Pi(\cE,C,n,p)$ as in the previous paragraph. Then, for any $\pi \in \Pi(\cE,C,n,p)$,
\sup_{\nu\in \cE} R_n(\pi,\nu) \ge \frac{0.35}{16} \frac{K-1}{C} n^{1-p} \,.

When $C = C’\sqrt{K}$ and $p=1/2$, we get the familiar $\Omega( \sqrt{Kn} )$ lower bound. However, note the difference: Whereas the previous lower bound was true for any policy, this lower bound holds only for policies in $\Pi(\cE,C’\sqrt{K},n,1/2)$. Nevertheless, it is reassuring that the instance-dependent lower bound is able to recover the minimax lower bound for the “near-minimax” policies. And to emphasize the distinction even further. In our minimax lower bound we only showed there exists an environment for which the lower bound holds, while the result above holds no matter which arm we choose to be the optimal one. This is only possible because we restricted the class of algorithms.


So UCB is close to minimax optimal (last post) and now we have seen that at least among the class of consistent strategies it is also asymptotically optimal and also almost optimal for any instance amongst the near-minimax optimal algorithm even in a non-asymptotic sense. One might ask if there is anything left to do, and like always the answer is: Yes, lots!

For one, we have only stated the results for the Gaussian case. Similar results are easily shown for alternative classes. A good exercise is to repeat the derivations in this post for the class of Bernoulli bandits, where the rewards of all arms are sampled from a Bernoulli distribution.

There is also the question of showing bounds that hold with high probability rather than in expectation. So far we have not discussed any properties of the distribution of the regret other than its mean, which we have upper bounded for UCB and lower bounded here and in the last post. In the next few posts we will develop algorithms for which the regret is bounded with high probability and so the lower bounds of this nature will be delayed for a little while.

Notes, references

There are now a wide range of papers with lower bounds for bandits. For asymptotic results the first article below was the earliest that we know of, while others are generalizations.

For non-minimax results that hold in finite time there are very recently a variety of approaches (and one older one). They are:

Finally, for the sake of completeness we note that minimax bounds are available in the following articles. The second of which deals with the high-probability case mentioned above.

Note 1. All of the above results treat all arms symmetrically. Sometimes one might want an algorithm that treats the arms differently and for which the regret guarantee depends on which arm is optimal. We hope eventually to discuss the most natural approach to this problem, which is to use a Bayesian strategy (though we will be interested in it for other reasons). Meanwhile, there are modifications of UCB that treat the actions asymmetrically so that its regret is small if some arms are optimal and large otherwise.

We simply refer the interested reader to a paper on the Pareto regret frontier for bandits, which fully characterizes the trade-offs available to asymmetric algorithms in terms of the worst-case expected regret when comparing to a fixed arm.

Note 2. The leading constant in the instance dependent bound is sub-optimal. This can be improved by a more careful choice of $\lambda$ that is closer to $\Delta_i$ than $2\Delta_i$ and dealing with the lower-order terms that this introduces.