All posts by Tor Lattimore

High probability lower bounds

In the post on adversarial bandits we proved two high probability upper bounds on the regret of Exp-IX. Specifically, we showed:

Theorem: There exists a policy $\pi$ such that for all $\delta \in (0,1)$ for any adversarial environment $\nu\in [0,1]^{nK}$, with probability at least $1-\delta$
\begin{align}
\label{eq:high-prob}
\hat R_n(\pi,\nu) = O\left(\sqrt{Kn \log(K)} + \sqrt{\frac{Kn}{\log(K)}} \log\left(\frac{1}{\delta}\right)\right)\,.
\end{align}

We also gave a version of the algorithm that depended on $\delta \in (0,1)$:

Theorem: For all $\delta \in (0,1)$ there exists a policy $\pi$ such that for any adversarial environment $\nu\in [0,1]^{nK}$, with probability at least $1- \delta$,
\begin{align}
\label{eq:high-prob2}
\hat R_n(\pi,\nu) = O\,\left(\sqrt{Kn \log\left(\frac{K}{\delta}\right)}\right)\,.
\end{align}

The key difference between these results is the order of quantifiers. In the first we have a single algorithm and a high-probability guarantee that holds simultaneously for any confidence level. For the second result the confidence level must be specified in advance. The price for using the generic algorithm appears to be $\sqrt{\log(1/\delta)}$, which is usually quite small but not totally insignificant. The purpose of this post is to show that both bounds are tight up to constant factors, which implies that algorithms knowing the confidence level in advance really do have an advantage in terms of the achievable regret.

One reason why choosing the confidence level in advance is not ideal is that the resulting high-probability bound cannot be integrated to prove a bound in expectation. For algorithms satisfying (\ref{eq:high-prob}) the expected regret can be bounded by
\begin{align}
R_n \leq \int^\infty_0 \Prob{\hat R_n \geq x} dx = O(\sqrt{Kn \log(K)})\,.
\label{eq:integratedbound}
\end{align}
On the other hand, if the high-probability bound only holds for a single $\delta$ as in \eqref{eq:high-prob2}, then it seems hard to do much better than
\begin{align*}
R_n \leq n \delta + O\left(\sqrt{Kn \log\left(\frac{K}{\delta}\right)}\right)\,,
\end{align*}
which with the best choice of $\delta$ leads to a bound where an extra $\log(n)$ factor appears compared to \eqref{eq:integratedbound}. In fact, it turns out that this argument cannot be strengthened and algorithms with the strong high-probability regret cannot be near-optimal in expectation.

Our approach for proving this fact is very similar to what we did for the minimax lower bounds for stochastic bandits in a previous post. There are two differences between the adversarial setting and the stochastic setting that force us to work a little harder. The first is that for the adversarial-bandit upper bounds we have assumed the rewards are bounded in $[0,1]$, which was necessary in order to say anything at all. This means that our lower bounds should also satisfy this requirement, while in the stochastic lower bounds we used Gaussian rewards with an unbounded range. The second difference comes from the fact that rather than for the regret, the stochastic lower bounds were given for what is known as the pseudo-regret in the adversarial framework and which reads
\begin{align*}
\bar R_n \doteq \max_i \EE{ \sum_{t=1}^n X_{ti} – \sum_{t=1}^n X_t }\,.
\end{align*}
In the stochastic setting, we have $\bar R_n = \sum_{i=1}^K \E[T_i(n)] \Delta_i$ and thus bounds on the pseudo-regret are possible by lower bounding the number of times an algorithm chooses sub-optimal arms on expectation. This is not enough to bound the regret, which depends also on the actual samples.

A lower bound on tail probabilities of pseudo-regret in stochastic bandits

Before we overcome these technicalities we describe the simple intuition by returning to the stochastic setting, using the pseudo-regret and relaxing the assumption that the rewards are bounded. It is important to remember that $\bar R_n$ is not a random variable because all the randomness is integrated away by the expectation. This means that it does not make sense to talk about high-probability results for $\bar R_n$, so we introduce another quantity,
\begin{align*}
\tilde R_n = \sum_{i=1}^K T_i(n) \Delta_i\,,
\end{align*}
which is a random variable through the pull counts $T_i(n)$ and which, for lack of a better name, we call the random pseudo-regret. For the subsequent results we let $\cE$ denote the set of Gaussian bandits with sub-optimality gaps bounded by one.

Theorem: Fix the horizon $n>0$, number of arms $K>1$, a constant $C>0$ and a policy $\pi$. Assume that for any $\nu \in \cE$ bandit environment,
\begin{align*}
R_n(\pi,\nu) \leq C \sqrt{(K-1)n}\,,
\end{align*}
Let $\delta \in (0,1)$. Then, there exists a bandit $\nu$ in $\cE$ such that
\begin{align*}
\Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4}\min\set{n,\, \frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}} \geq \delta\,.
\end{align*}

It follows that if the result can be transferred to the adversarial setting, there will be little or no room for improving \eqref{eq:high-prob}.
Proof: Let $\Delta \in (0, 1/2]$ be a constant to be tuned subsequently and define
\begin{align*}
\mu_i = \begin{cases}
\Delta, & \text{if } i = 1\,; \\
0, & \text{otherwise}\,,
\end{cases}
\end{align*}
As usual, let $R_n = R_n(\pi,\nu)$ for $\nu = (\cN(\mu_i,1))_{i\in [K]}$. Let $\PP=\PP_{\nu,\pi}$ and $\E=\E_{\nu,\pi}$. Let $i = \argmin_{j>1} \E[T_j(n)]$. Then, thanks to
\begin{align*}
C \sqrt{(K-1)n} \ge R_n = \Delta \sum_{i>1} \E[T_i(n)] \ge \Delta (K-1)\min_i \E[T_i(n)]
\end{align*}
we find that
\begin{align}
\E[T_i] \leq \frac{C}{\Delta} \sqrt{\frac{n}{K-1}}\,.
\label{eq:hpproofarmusage}
\end{align}

Define $\mu’ \in \R^K$ by
\begin{align*}
\mu’_j = \begin{cases}
\mu_j\,, & \text{if } j \neq i; \\
2\Delta\,, & \text{otherwise}
\end{cases}
\end{align*}
and let $\nu’=(\cN(\mu_j’,1))_{j\in [K]}$ be the Gaussian bandit with means $\mu’$ and abbreviate $\PP’=\PP_{\nu’,\pi}$ and $\E’=\E_{\nu’,\pi}$. Thus in $\nu’$ action $i$ is better than any other action by at least $\Delta$. Let $\tilde R_n = \sum_{j=1}^K T_j(n) \Delta_j$ and $\tilde R_n’ = \sum_{j=1}^n T_j(n) \Delta_j’$ be the random pseudo-regret in $\nu$ and $\nu’$ respectively, where $\Delta_j = \max_k \mu_k-\mu_j = \one{j\ne 1}\Delta$ and $\Delta_j’=\max_k \mu_k’-\mu_j \ge \one{i\ne j} \Delta$. Hence,
\begin{align*}
\tilde R_n & \ge T_i(n) \Delta_i \ge \one{T_i(n)\ge n/2} \frac{\Delta n}{2}\,, \qquad\text{ and }\\
\tilde R_n’ & \ge \Delta \sum_{j\ne i} T_j(n) = \Delta (n-T_i(n)) \ge \one{T_i(n) < n/2} \frac{\Delta n}{2}\,. \end{align*} Hence, $T_i(n)\ge n/2 \Rightarrow \tilde R_n \ge \frac{\Delta n}{2}$ and $T_i(n) < n/2 \Rightarrow \tilde R_n' \ge \frac{\Delta n}{2}$, implying that \begin{align} \max\left(\Prob{ \tilde R_n \ge \frac{\Delta n}{2} }, \PP'\left(\tilde R_n' \ge \frac{\Delta n}{2} \right)\right) \ge \frac12 \left( \Prob{T_i(n) \ge n/2} +\PP'\left(T_i(n) < n/2 \right) \right)\,. \label{eq:hprlb} \end{align} By the high probability Pinsker lemma, the divergence decomposition identity (earlier this was called this the information processing lemma) and \eqref{eq:hpproofarmusage} we have
\begin{align}
\Prob{T_i(n) \geq n/2} + \mathbb P’\left(T_i(n) < n/2\right)
&\geq \frac{1}{2} \exp\left(-\KL(\mathbb P, \mathbb P’)\right) \nonumber \\
&= \frac{1}{2} \exp\left(-2\E[T_i(n)] \Delta^2\right) \nonumber \\
&\geq \frac{1}{2} \exp\left(-2C \Delta\sqrt{\frac{n}{K-1}}\right)\,.
\label{eq:hppinskerlb}
\end{align}
To enforce that the right-hand side of the above display is at least $2\delta$, we choose
\begin{align*}
\Delta = \min\set{\frac{1}{2},\,\frac{1}{2C} \sqrt{\frac{K-1}{n}} \log\left(\frac{1}{4\delta}\right)}\,.
\end{align*}
Putting \eqref{eq:hprlb} and \eqref{eq:hppinskerlb} together we find that either
\begin{align*}
&\Prob{\tilde R_n \geq \frac{1}{4}\min\set{n,\, \frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}} \geq \delta \\
\text{or}\qquad &
\mathbb P’\left(\tilde R’_n \geq \frac{1}{4}\min\set{n,\,\frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}\right) \geq \delta\,.
\end{align*}
QED.

From this theorem we can derive two useful corollaries.

Corollary: Fix $n>0$, $K>1$. For any policy $\pi$ and $\delta \in (0,1)$ small enough that
\begin{align}
n\delta \leq \sqrt{n (K-1) \log\left(\frac{1}{4\delta}\right)} \label{eq:hplbdelta}
\end{align}
there exists a bandit problem $\nu\in\cE$ such that
\begin{align}
\Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4}\min\set{n,\, \sqrt{\frac{n(K-1)}{2} \log\left(\frac{1}{4\delta}\right)}}} \geq \delta\,.
\label{eq:hplbsmalldelta}
\end{align}

Proof: We prove the result by contradiction. Assume that the conclusion does not holds for $\pi$. We will derive a contradiction. Take $\delta$ that satisfies \eqref{eq:hplbdelta}. Then, for any bandit problem $\nu\in \cE$ the expected regret of $\pi$ is bounded by
\begin{align*}
R_n(\pi,\nu) \leq n\delta + \sqrt{\frac{n(K-1)}{2} \log\left(\frac{1}{4\delta}\right)}
\leq \sqrt{2n(K-1) \log\left(\frac{1}{4\delta}\right)}\,.
\end{align*}
Therefore $\pi$ satisfies the conditions of the previous theorem with $C =\sqrt{2 \log(\frac{1}{4\delta})}$, which implies that there exists some bandit problem $\nu\in\cE$ such that \eqref{eq:hplbsmalldelta} holds, contradicting our assumption.
QED

Corollary: Fix any $K>1$, $p \in (0,1)$ and $C > 0$. Then, for any policy $\pi$ there exists $n>0$ large enough, $\delta\in(0,1)$ small enough and a bandit environment $\nu\in \cE$ such that
\begin{align*}
\Prob{\tilde R_n(\pi,\nu) \geq C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)} \geq \delta\,.
\end{align*}

Proof: Again, we proceed by contradiction. Suppose that a policy $\pi$ exists for which the conclusion does not hold. Then, for any $n>0$ and environment $\nu\in \cE$,
\begin{align}
\Prob{\tilde R_n(\pi,\nu) \geq C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)} < \delta \label{eq:hpexplbp} \end{align} and therefore, for any $n>0$, the expected $n$-round regret of $\pi$ on $\nu$ is bounded by
\begin{align*}
R_n(\pi,\nu)
\leq \int^\infty_0 \Prob{ \tilde R_n(\pi,\nu) \geq x} dx
\leq C \sqrt{n(K-1)} \int^\infty_0 \exp\left(-x^{1/p}\right) dx
\leq C \sqrt{n(K-1)}\,.
\end{align*} Therefore, by the previous theorem, for any $n>0$, $\delta\in (0,1)$ there exists a bandit $\nu_{n,\delta}\in \cE$ such that
\begin{align*}
\Prob{\tilde R_n(\pi,\nu_{n,\delta}) \geq \frac{1}{4} \min\set{n, \frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)}} \geq \delta\,.
\end{align*}
For $\delta$ small enough, $\frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right) \ge C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)$ and then choosing $n$ large enough so that $\frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)\le n$, we find that on the environment $\nu=\nu_{n,\delta}$,
\begin{align*}
\delta &\le \Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4} \min\set{n, \frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)}} \\
& \le \Prob{ \tilde R_n(\pi,\nu) \geq C\sqrt{(K-1)n} \log^p\left(\frac1n \right) }\,,
\end{align*}
contradicting \eqref{eq:hpexplbp}.
QED

A lower bound on tail probabilities of regret in adversarial bandits

So how do we transfer this argument to the case where the rewards are bounded and the regret is used, rather than the pseudo-regret? For the first we can simply shift the means to be close to $1/2$ and truncate or “clip” the rewards that (by misfortune) end up outside the allowed range. To deal with the regret there are two options. Either one adds an additional step to show that the regret and the pseudo-regret concentrate sufficiently fast (which they do), or one correlates the losses across the actions. The latter is the strategy that we will follow here.

We start with an observation. Our goal is to show that there exist a reward sequence $x=(x_1,\dots,x_n)\in [0,1]^{nK}$ such that the regret $\hat R_n=\max_i \sum_t x_{ti} – x_{t,A_t}$ is above some threshold $u>0$ with a probability exceeding a prespecified value $\delta\in (0,1)$. For this we want to argue that it suffices to show this when the rewards are randomly chosen. Similarly to the stochastic case we define the “extended” canonical bandit probability space. Since the regret in adversarial bandits depends on non-observed rewards, the outcome space of the extended canonical probability space is $\Omega_n = \R^{nK}\times [K]^n$ and now $X_t,A_t: \Omega_n \to \R$ are $X_t(x,a) = x_t$ and $A_t(x,a) = a_t$ where we use the convention that $x= (x_1,\dots,x_n)$ and $a=(a_1,\dots,a_n)$. We also let $\hat R_n = \max_i \sum_{t=1}^n X_{ti} – \sum_{t=1}^n X_{t,A_t}$ and define $\PP_{Q,\pi}$ to be joint of $(X_1,\dots,X_n,A_1,\dots,A_n)$ arising from the interaction of $\pi$ with $X\sim Q$. Finally, as we often need it, for a fixed $\nu\in \R^{nK}$ we abbreviate $\PP_{\delta_\nu,\pi}$ to $\PP_{\nu,\pi}$ where $\delta_\nu$ is the Dirac (probability) measure over $\R^{nK}$ (i.e., $\delta_{\nu}(U) = \one{\nu \in U}$ for $U\subset \R^{nK}$ Borel).

Lemma (Randomization device): For any $Q$ probability measure over $\R^{nK}$, any policy $\pi$, $u\in \R$ and $\delta\in (0,1)$,
\begin{align}\label{eq:pqpidelta}
\PP_{Q,\pi}( \hat R_n \geq u ) \geq \delta \implies
\exists \nu \in\mathrm{support}(Q) \text{ such that } \PP_{\nu,\pi}(\hat R_n \geq u) \geq \delta\,.
\end{align}

The lemma is proved by noting that $\PP_{Q,\pi}$ can be disintegrated into the “product” of $Q$ and $\{\PP_{\nu,\pi}\}_{\nu}$. The proof is given at the end of the post.

Given this machinery, let us get into the proof. Fix a policy $\pi$, $n>0$, $K>1$ and a $\delta\in (0,1)$. Our goal is to find some $u>0$ and a reward sequence $x\in [0,1]^{nK}$ such that the random regret of $\pi$ while interacting with $x$ is above $u$ with probability exceeding $\delta$. For this, we will define two reward distributions $Q$ and $Q’$, and show for (at least) one of $\PP_{Q,\pi}$ or $\PP_{Q’,\pi}$ that the probability of $\hat R_n \ge u$ exceeds $\delta$.

Instead of the canonical probability models we will find it more convenient to work with two sequences $(X_t,A_t)_t$ and $(X_t’,A_t’)_t$ of reward-action pairs defined over a common probability space. These are constructed as follows: We let $(\eta_t)_t$ be an i.i.d. sequence of $\mathcal N(0,\sigma^2)$ Gaussian variables and then let
\begin{align*}
X_{tj} = \clip( \mu_j + \eta_t )\,, \qquad X_{tj}’ = \clip( \mu_j’ + \eta_t)\, \qquad (t\in [n],j\in [K])\,,
\end{align*}
where $\clip(x) = \max(\min(x,1),0)$ clips its argument to $[0,1]$, and for some $\Delta\in (0,1/4]$ to be chosen later,
\begin{align*}
\mu_j = \frac12 + \one{j=1} \Delta, \qquad (j\in [K])\,.
\end{align*}
The “means” $(\mu_j’)_j$ will also be chosen later. Note that apart from clipping, $(X_{ti})_t$ (and also $(X_{ti}’)_t$) is merely a shifted version of $(\eta_t)_t$. In particular, thanks to $\Delta>0$, $X_{t1}\ge X_{tj}$ for any $t,j$. Moreover, $X_{t1}$ exceeds $X_{tj}$ by $\Delta$ when none of them is clipped:
\begin{align}
X_{t1}\ge X_{tj} + \Delta\, \one{\eta_t\in [-1/2,1/2-\Delta]}\,, \quad t\in [n], j\ne 1\,.
\label{eq:hplb_rewardgapxt}
\end{align}

Now, define $(A_t)_t$ to be the random actions that arise from the interaction of $\pi$ and $(X_t)_t$ and let $i = \argmin_{j>1} \EE{ T_j(n) }$ where $T_i(n) = \sum_{t=1}^n \one{A_t=i}$. As before, $\EE{T_i(n)}\le n/(K-1)$. Choose
\begin{align*}
\mu_j’ = \mu_j + \one{j=i}\, 2\Delta\,, \quad j\in [K]\,
\end{align*}
so that $X_{ti}’\ge X_{tj}’$ for $j\in [K]$ and furthermore
\begin{align}
X_{ti}’\ge X_{tj}’ + \Delta\, \one{\eta_t\in [-1/2,1/2-2\Delta]}\,, \quad t\in [n], j\ne i\,.
\label{eq:hplb_rewardgapxtp}
\end{align}

Denote by $\hat R_n = \max_j \sum_t X_{tj} – \sum_t X_{t,A_t}$ the random regret of $\pi$ when interacting with $X = (X_1,\dots,X_n)$ and let $\hat R_n’ = \max_j \sum_t X’_{tj} – \sum_t X’_{t,A_t’}$ the random regret of $\pi$ when interacting with $X’ = (X_1′,\dots,X_n’)$.
Thus, it suffices to prove that either $\Prob{ \hat R_n \ge u }\ge \delta$ or $\Prob{ \hat R_n’ \ge u} \ge \delta$.

Illustration of the idea of the lower bound proof. An elf randomly chooses a pair of environments from the bag of all possible clipped Gaussian environments and feeds the environment into the policy, which is also randomized. The policy spits out the random regret. Many of the random regrets will be high if the random selection of the elf is unlucky for the algorithm.

Illustration of the idea of the lower bound proof. An elf randomly chooses a pair of environments from the bag of all possible clipped Gaussian environments and feeds the environment into the policy, which is also randomized. The policy spits out the random regret for both environments. For at least one of the two environments, many of the random regrets will be high with a high chance.

By our earlier remarks, $\hat R_n = \sum_t X_{t1} – \sum_t X_{t,A_t}$ and $\hat R_n’ = \sum_t X_{ti}’ – \sum_t X_{t,A_t}’$. Define $U_t =\one{\eta_t\in [-1/2,1/2-\Delta]}$, $U_t’=\one{\eta_t\in [-1/2,1/2-2\Delta]}$, $A_{tj} = \one{A_t=j}$ and $A_{tj}’ = \one{A_t’=j}$. From \eqref{eq:hplb_rewardgapxt} we see that
\begin{align*}
\hat R_n \ge \Delta\, \sum_t \one{A_t\ne 1} U_t =\Delta\, \sum_t (1-A_{t1}) U_t \ge \Delta\,(U – T_1(n)) \ge \Delta\,(U + T_i(n) – n)\,,
\end{align*}
where we also defined $U = \sum_t U_t$ and used that $T_1(n)+T_i(n)\le n$. Note that $U_t=1$ indicates that $(X_{tj})_j$ are “unclipped”. Similarly, from \eqref{eq:hplb_rewardgapxtp} we see that
\begin{align*}
\hat R_n’ \ge \Delta \, \sum_t \one{A_t’\ne i} U_t’ =\Delta\, \sum_t (1-A_{ti}’) U_t’ \ge \Delta\,( U’ – T_i'(n)) \,,
\end{align*}
where $T_i'(n)=\sum_t A_{ti}’$ and $U’ = \sum_t U_t’$. Based on the lower bounds on $\hat R_n$ and $\hat R_n’$ we thus see that if $T_i(n)\ge n/2$ and $U\ge 3n/4$ then $\hat R_n \ge u\doteq \frac{n\Delta}{4}$ and if $T_i'(n) < n/2$ and $U'\ge 3n/4$ then $\hat R_n' \ge u$ holds, too. Thus, from union bounds, \begin{align*} \Probng{ \hat R_n \ge u } &\ge \Prob{ T_i(n)\ge n/2, U\ge 3n/4 } \ge \Prob{ T_i(n)\ge n/2 } - \Prob{U < 3n/4}\,,\\ \Probng{ \hat R_n' \ge u } &\ge \Prob{ T_i'(n) < n/2, U'\ge 3n/4 } \ge \Prob{ T_i'(n) < n/2 } - \Prob{U' < 3n/4}\,. \end{align*} Noticing that $U'\le U$ and hence $\Prob{U<3n/4}\le \Prob{U' <3n/4}$, we get \begin{align*} \max(\Probng{ \hat R_n \ge u },\Probng{ \hat R_n' \ge u }) & \ge \frac12 \Bigl(\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }\Bigr) - \Prob{U' < 3n/4}\,. \end{align*} The sum $\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }$ will be lower bounded with the help of the high-probability Pinsker inequality. For an upper bound on $\Prob{U’ < 3n/4}$, we have the following technical lemma:

Lemma (Control of the number of unclipped rounds): Assume that $\Delta\le 1/8$. If $n \ge 32 \log(2/\delta)$ and $\sigma\le 1/10$ then $\Prob{U’<3n/4} \le \delta/2$.

The proof is based on bounding the tail probability of the mean the i.i.d. $(U_t’)_t$ Bernoulli variables using Hoeffding’s inequality and is given later. Intuitively, it is clear that by making $\sigma^2$ small, the number of times $\eta_t$ falls inside $[-1/2,1/4]\subset [-1/2,1/2-2\Delta]$ can be made arbitrary high with arbitrary high probability.

Our goal now is to lower bound $\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }$ by $3\delta$. As suggested before, we aim to use the high-probability Pinsker inequality. One difficulty that we face is that the events $\{T_i(n)\ge n/2\}$ and $\{T_i'(n) < n/2\}$ may not be complementary as they are defined in terms of a potentially distinct set of random variables. This will be overcome by rewriting the above probabilities using the canonical bandit probability spaces. In fact, we will use the non-extended version of these probability spaces (as defined earlier in the context of stochastic bandits). The reason of this is that after Pinsker, we plan to apply the divergence decomposition identity, which decomposes the divergence between distributions of action-reward sequences.

To properly write things let $Q_j$ denote the distribution of $X_{tj}$ and similarly let $Q_j’$ be the distribution of $X_{tj}’$. These are well defined (why?). Define the stochastic bandits $\beta=(Q_1,\dots,Q_K)$ and $\beta’=(Q’_1,\dots,Q’_K)$. Let $\Omega_n = ([K]\times \R)^n$ and let $\tilde Y_t,\tilde A_t:\Omega_n \to \R$ be the usual coordinate projection functions: $\tilde Y_t(a_1,y_1,\dots,a_n,y_n) = y_t$ and $\tilde A_t(a_1,y_1,\dots,a_n,y_n)=a_t$. Also, let $\tilde T_i(n) = \sum_{t=1}^n \one{\tilde A_t=i}$. Recall that $\PP_{\beta,\pi}$ denotes the probability measure over $\Omega_n$ that arises from the interaction of $\pi$ and $\beta$ (detto for $\PP_{\beta’,\pi}$). Now, since $T_i(n)$ is only a function of $(A_1,X_{1,A_1},\dots,A_n,X_{n,A_n})$ whose probability distribution is exactly $\PP_{\beta,\pi}$, we have
\begin{align*}
\Prob{ T_i(n) \ge n/2 } = \PP_{\beta,\pi}( \tilde T_i(n) \ge n/2 )\,.
\end{align*}
Similarly,
\begin{align*}
\Prob{ T_i'(n) < n/2 } = \PP_{\beta',\pi}( \tilde T_i(n) < n/2 )\,. \end{align*} Now, by the high-probability Pinsker inequality and the divergence decomposition lemma, \begin{align*} \Prob{ T_i(n) \ge n/2 } + \Prob{ T_i'(n) < n/2 } & = \PP_{\beta,\pi}( \tilde T_i(n) \ge n/2 ) + \PP_{\beta',\pi}( \tilde T_i(n) < n/2 ) \\ & \ge \frac12 \exp\left(- \KL( \PP_{\beta,\pi}, \PP_{\beta',\pi} ) \right) \\ & = \frac12 \exp\left(- \E_{\beta,\pi}[\tilde T_i(n)] \KL( Q_i, Q_i' ) \right) \\ & \ge \frac12 \exp\left(- \EE{ T_i(n) } \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2) ) \right)\,, \end{align*} where in the last equality we used that $Q_j=Q_j'$ unless $j=i$, while in the last step we used $\E_{\beta,\pi}[\tilde T_i(n)] = \EE{ T_i(n) }$ and also that $\KL( Q_i, Q_i' ) \le \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2) )$. From where does the last inequality come, one might ask. The answer is the truncation, which always reduces information. More precisely, let $P$ and $Q$ be probability measures on the same probability space $(\Omega, \cF)$. Let $X:\Omega \to \R$ be a random variable and $P_X$ and $Q_X$ be the laws of $X$ with under $P$ and $Q$ respectively. Then $\KL(P_X, Q_X) \leq \KL(P, Q)$.

Now, by the choice of $i$,
\begin{align*}
\EE{ T_i(n) } \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2)
\le \frac{n}{K-1} \frac{2\Delta^2}{\sigma^2}\,.
\end{align*}
Plugging this into the previous display we get that if
\begin{align*}
\Delta = \sigma \sqrt{\frac{K-1}{2n} \log\frac{1}{6\delta}}
\end{align*}
then $\Prob{ T_i(n) \ge n/2 } + \Prob{ T_i'(n) < n/2 }\ge 3\delta$ and thus $\max(\Prob{ \hat R_n \ge u },\Prob{ \hat R_n'\ge u} )\ge \delta$. Recalling the definition $u = n\Delta/4$ and choosing $\sigma=1/10$ gives the following result:

Theorem (High probability lower bound for adversarial bandits): Let $K>1$, $n>0$ and $\delta\in (0,1)$ such that
\begin{align*}
n\ge \max\left( 32 \log \frac2{\delta}, (0.8)^2 \frac{K-1}{2} \log \frac{1}{6\delta}\right)
\end{align*}
holds. Then, for any bandit policy $\pi$ there exists a reward sequence $\nu = (x_1,\dots,x_n)\in [0,1]^{nK}$ such that if $\hat R_n$ is the random regret of $\pi$ when interacting with the environment $\nu$ then
\begin{align*}
\Prob{ \hat R_n \ge 0.025\sqrt{\frac{n(K-1)}{2}\log \frac{1}{6\delta}} } \ge \delta\,.
\end{align*}

So what can one take away from this post? The main thing is to recognize that the upper bounds we proved in the previous post cannot be improved very much, at least in this worst case sense. This includes the important difference between the high-probability regret that is achievable when the confidence level $\delta$ is chosen in advance and what is possible if a single strategy must satisfy a high-probability regret guarantee for all confidence levels simultaneously.

Besides this result we also introduced some new techniques that will be revisited in the future, especially the randomization device lemma. The advantage of using clipped and correlated Gaussian rewards is that it ensures the same arm is always optimal, no matter how the noise behaves.

Technicalities

The purpose of this section is to lay to rest the two technical results required in the main body. The first a proof of the lemma which gives us the randomization technique or “device” and afterwards the proof of the proof of the lemma that controls the number of unclipped rounds.

Proof of the randomization device lemma

The argument underlying this goes as follows: If $A=(A_1,\dots,A_n)\in [K]^n$ are the actions of a stochastic policy $\pi=(\pi_1,\dots,\pi_n)$ when interacting with the environment where the rewards $X=(X_1,\dots,X_n)\in \R^{nK}$ are drawn from $Q$ then for $t=1,\dots,n$, $A_t\sim \pi_t(\cdot|A_1,X_{1,A_1},\dots,A_{t-1},X_{t-1,A_{t-1}})$ and thus the distribution $\PP_{Q,\pi}$ of $(X,A)$ satisfies
\begin{align*}
d\PP_{Q,\pi}(x,a)
&= \pi(a|x) d\rho^{\otimes n}(a) dQ(x) \,,
\end{align*}
where $a=(a_1,\dots,a_n)\in [K]^n$, $x\in (x_1,\dots,x_n)\in \R^{nK}$,
\begin{align*}
\pi(a|x)\doteq
\pi_1(a_1) \pi_2(a_2|a_1,x_{1,a_1}) \cdots \pi_n(a_n|a_1,x_{1,a_1}, \dots,a_{n-1},x_{n-1,a_{n-1}})
\end{align*}
and $\rho^{\otimes n}$ is the $n$-fold product $\rho$ with itself, where $\rho$ is the counting measure $\rho$ on $[K]$. Letting $\delta_x$ be the Dirac (probability) measure on $\R^{nK}$ concentrated at $x$ (i.e., $\delta_x(U) = \one{x\in U}$), we have that $\PP_{Q,\pi}$ can be disintegrated into $Q$ and $\{\PP_{\delta_x,\pi}\}_x$. In particular, a direct calculation verifies that
\begin{align}
d\PP_{Q,\pi}(x,a) = \int_{y\in \R^{nK}} dQ(y) \, d\PP_{\delta_y,\pi}(x,a) \,.
\label{eq:disintegration}
\end{align}

Let $(X_t,A_t)$ be the reward and action of round $t$ in the extended canonical bandit probability space and $\hat R_n$ the random regret defined in terms of these random variables. For any Borel $U\subset \R$,
\begin{align*}
\PP_{Q,\pi}( \hat R_n \in U )
&= \int \one{\hat R_n(x,a)\in U} d\PP_{Q,\pi}(x,a) \\
&= \int_{\R^{nK}} dQ(y) \left( \int_{\R^{nK}\times [K]^n} \one{\hat R_n(x,a)\in U} d\PP_{\delta_y,\pi}(x,a) \right)\\
&= \int_{\R^{nK}} dQ(y) \PP_{\delta_y,\pi}( \hat R_n\in U )\,,
\end{align*}
where the the second equality uses \eqref{eq:disintegration} and Fubini. From the above equality it is obvious that it is not possible that $\PP_{Q,\pi}( \hat R_n \in U )\ge \delta$ while for all $y\in \mathrm{support}(Q)$, $\PP_{\delta_y,\pi}( \hat R_n\in U )<\delta$, thus finishing the proof. QED

Proof of lemma controlling number of clipped rounds: First note that $U’\le U$ and thus $\Prob{U < 3n/4}\le \Prob{U '< 3n/4}$ hence it suffices to control the latter. Since $\Delta\le 1/8$ and $\eta_t$ is a Gaussian with zero mean and variance $\sigma^2$, and in particular $\eta_t$ is $\sigma^2$-subgaussian, we have \begin{align*} 1 - p = \Prob{U_t' = 0} & \leq \Prob{ |\eta_t| > 1/2-2\Delta }
\leq 2 \exp\left(-\frac{\left(1/2 – 2\Delta\right)^2}{2\sigma^2}\right)\\
& \leq 2 \exp\left(-\frac{1}{2 (4)^2 \sigma^2}\right)
\le \frac{1}{8}\,,
\end{align*}
where the last inequality follows whenever $\sigma^2 \le \frac{1}{32 \log 16}$ which is larger than $0.01$. Therefore $p \ge 7/8$ and
\begin{align*}
\Prob{\sum_{t=1}^n U_t’ < \frac{3n}{4}} &= \Prob{\frac{1}{n} \sum_{t=1}^n ( U_t' - p) < -(p-\frac{3}{4}) } \\ &\le \Prob{\frac{1}{n} \sum_{t=1}^n ( U_t' - p) \le - \frac{1}{8}} \\ &\leq \exp\left(-\frac{n}{32}\right) \leq \frac{\delta}{2}\,, \end{align*} where the second last inequality uses Hoeffding’s bound together with that $U_t’-p$ is $1/4$-subgaussian, and the last holds by our assumption on $n$.
QED

Notes

Note 1: It so happens that the counter-example construction we used means that the same arm has the best reward in every round (not just the best mean). It is perhaps a little surprising that algorithms cannot exploit this fact, in contrast the experts setting where this knowledge enormously improves the achievable regret.

Note 2: Adaptivity is all the rage right now. Can you design an adversarial bandit algorithm that exploits “easy data” when available? For example, if the rewards don’t change much over time, or lie in a small range. There are still a lot of open questions in this area. The paper referenced below gives lower bounds for some of these situations.

References

Sebastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits. 2016

Instance dependent lower bounds

In the last post we showed that under mild assumptions ($n = \Omega(K)$ and Gaussian noise), the regret in the worst case is at least $\Omega(\sqrt{Kn})$. More precisely, we showed that for every policy $\pi$ and $n\ge K-1$ there exists a $K$-armed stochastic bandit problem $\nu\in \mathcal E_K$ with unit variance Gaussian rewards and means in $[0,1]$ such that the regret $R_n(\pi,\nu)$ of $\pi$ on $\nu$ satisfies $R_n(\pi,\nu) \geq c\sqrt{nK}$ for some universal constant $c > 0$, or
\begin{align*}
R_n^*(\cE_K)\doteq\inf_\pi \sup_{\nu \in \cE_K} R_n(\pi,\nu)\ge c \sqrt{nK}\,.
\end{align*}
Earlier we have also seen that the regret of UCB on such problems is at most $C \sqrt{nK \log(n)}$ with some universal constant $C>0$: $R_n(\mathrm{UCB},\cE_K)\le C \sqrt{nK\log(n)}$. Thus, UCB is near minimax-optimal, in the sense that except for a logarithmic factor its upper bound matches the lower bound for $\cE_K$.

Does this mean that UCB is necessarily a good algorithm for class $\cE_K$? Not quite! Consider the following modification of UCB, which we will call by the descriptive, but not particularly flattering name UCBWaste. Choose $0 < C' \ll C$, where $C$ is the universal constant in the upper bound on the regret of UCB (if we actually calculated the value of $C$, we could see that e.g. $C'=0.001$ would do). Then, in the first $m=C'\sqrt{Kn}$ rounds, UCBWaste will explore each of the $K$ actions equally often, after which it switches to the UCB strategy. It is easy to see that from the perspective of worst-case regret, UCBWaste is almost as good as UCB (being at most $C'\sqrt{Kn}$ worse). Would you accept UCBWaste as a good policy?

Perhaps not. To see why, consider for example a problem $\nu\in \cE_K$ such that on $\nu$ the immediate regret for choosing any suboptimal action is close to one. From the regret analysis of UCB we can show that after seeing a few rewards from some suboptimal action, with high probability UCB will stop using that action. As a result, UCB on $\nu$ would achieve a very small regret. In fact, it follows from our previous analysis that for sufficiently large $n$, UCB will achieve a regret of $\sum_{i:\Delta_i(\nu)>0} \frac{C\log(n)}{\Delta_i(\nu)} \approx C K \log(n)$, where $\Delta_i(\nu)$ is the suboptimality gap of action $i$ in $\nu$. However, UCBWaste, true to its name, will suffer a regret of at least $C’\sqrt{Kn/2}$ on $\nu$, a quantity that for $n$ large is much larger than the logarithmic bound on the regret of UCB.

Thus, on the “easy”, or “nice” instances, UCBWaste’s regret seems to be unnecessarily large at least as $n\to\infty$, even though UCBWaste is a near worst-case optimal algorithm for any $n$. If we care about what happens for $n$ large (and why should not we — after all, having higher standards should not hurt), it may be worthwhile to look beyond finite-time worst-case optimality.

Algorithms that are nearly worst-case optimal may fail to take advantage of environments that are not the worst case. What is more desirable is to have algorithms which are near worst-case optimal, while their performance gets better on “nicer” instances.

But what makes an instance “nice”? For a fixed $K$-action stochastic bandit environment $\nu$ with $1$-subgaussian reward-noise, the regret of UCB was seen to be logarithmic and in particular
\begin{align}
\limsup_{n\to\infty} \frac{R_n}{\log(n)} \leq c^*(\nu) \doteq \sum_{i:\Delta_i(\nu) > 0} \frac{2}{\Delta_i(\nu)}\,.
\label{eq:cnudef}
\end{align}
Then perhaps $c^*(\nu)$ could be used as an intrinsic measure of the difficulty of learning on $\nu$. Or perhaps there exist some constant $c'(\nu)$ that is even smaller than $c^*(\nu)$ such that some strategy suffers at most $c'(\nu) \log(n)$ regret asymptotically? If, for example, $c'(\nu)$ is smaller by a factor of two than $c^*(\nu)$, then this could be a big difference!

In this post we will show that, in some sense, $c^*(\nu)$ is the best possible. In particular, we will show that no reasonable strategy can outperform UCB in the asymptotic sense above on any problem instance $\nu$. This is a big difference from the previous approach because the “order of quantifiers” is different: In the minimax result proven in the last post we showed that for any policy $\pi$ there exists a bandit instance $\nu \in \cE_K$ on which $\pi$ suffers large regret. Here we show that for all reasonable policies $\pi$, the policy’s asymptotic regret is always at least as large as that of UCB.

We have not yet said what we mean by reasonable. Clearly for any $\nu$, there are policies for which $\limsup_{n\to\infty} R_n / \log(n) = 0$. For example, the policy that chooses $A_t = \argmax_i \mu_i(\nu)$. But this is not a reasonable policy because it suffers linear regret for any $\nu’$ such that the optimal action of $\nu$ is not optimal in $\nu’$. At least, we should require that the policy achieves sublinear regret over all problems. This may still look a bit wasteful because UCB achieves logarithmic asymptotic regret for any $1$-subgaussian problem. A compromise between the undemanding sublinear and the perhaps overly restrictive logarithmic growth is to require that the policy has subpolynomial regret growth, a choice that has historically been the most used, and which we will also take. The resulting policies are said to be consistent.

Asymptotic lower bound

To firm up the ideas, define $\cE_K’$ as the class of $K$-action stochastic bandits where each action has a reward distribution with $1$-subgaussian noise. Further, let $\cE_K'(\mathcal N)\subset \cE_K’$ be those environments in $\cE_K’$ where the reward distribution is Gaussian. Note that for minimax lower bounds it is important to assume the sub-optimality gaps are bounded from above, this is not required in the asymptotic setting. The reason is that any reasonable (that word again) strategy should choose each arm at least once, which does not affect the asymptotic regret at all, but does affect the finite-time minimax regret.

Definition (consistency): A policy $\pi$ is consistent over $\cE$ if for all bandits $\nu \in \mathcal \cE$ and for all $p>0$ it holds that
\begin{align}\label{eq:subpoly}
R_n(\pi,\nu) = O(n^p)\, \quad \text{ as } n \to\infty\,.
\end{align}
We denote the class of consistent policies over $\cE$ by $\Pi_{\text{cons}}(\cE)$.

Note that $\eqref{eq:subpoly}$ trivially holds for $p\ge 1$. Note also that because $n \mapsto n^p$ is positive-valued, the above is equivalent to saying that for each $\nu \in \cE$ there exists a constant $C_\nu>0$ such that $R_n(\pi,\nu) \le C_\nu n^p$ for all $n\in \N$. In fact, one can also show that above we can also replace $O(\cdot)$ with $o(\cdot)$ with no effect on whether a policy is called consistent or not. Just like in statistics, consistency is a purely asymptotic notion.

We can see that UCB is consistent (its regret is logarithmic) over $\cE_K’$. On the other hand, the strategy that always chooses a fixed action $i\in K$ is not consistent (if $K > 1$) over any reasonable $\cE$, since its regret is linear in any $\nu$ where $i$ is suboptimal.

The main theorem of this post is that no consistent strategy can outperform UCB asymptotically in the Gaussian case:

Theorem (Asymptotic lower bound for Gaussian environments): For any policy $\pi$ consistent over the $K$-action unit-variance Gaussian environments $\cE_K'(\mathcal N)$ and any $\nu\in \cE_K'(\mathcal N)$, it holds that
\begin{align*}
\liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)} \geq c^*(\nu)\,,
\end{align*}
where recall that $c^*(\nu)= \sum_{i:\Delta_i(\nu)>0} \frac{2}{\Delta_i(\nu)}$.

Since UCB is consistent for $\cE_K’\supset \cE_K'(\mathcal N)$,
\begin{align*}
c^*(\nu) = \inf_{\pi \in \Pi_{\text{cons}}(\cE_K'(\mathcal N))} \liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)}
\le
\limsup_{n\to\infty} \frac{R_n(\text{UCB},\nu)}{\log(n)}\le c^*(n)\,,
\end{align*}
that is
\begin{align*}
\inf_{\pi \in \Pi_{\text{cons}}(\cE_K'(\mathcal N))} \liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)}
=
\limsup_{n\to\infty} \frac{R_n(\text{UCB},\nu)}{\log(n)} = c^*(\nu)\,.
\end{align*}
Thus, we see that UCB is asymptotically optimal over the class of unit variance Gaussian environments $\cE_K'(\mathcal N)$.

Proof
Pick a $\cE_K'(\mathcal N)$-consistent policy $\pi$ and a unit variance Gaussian environment $\nu =(P_1,\dots,P_K)\in \cE_K'(\mathcal N)$ with mean rewards $\mu \in \R^K$. We need to lower bound $R_n \doteq R_n(\pi,\nu)$. Let $\Delta_i \doteq \Delta_i(\nu)$. Based on the basic regret decomposition identity $R_n = \sum_i \Delta_i \E_{\nu,\pi}[T_i(n)]$, it is not hard to see that the result will follow if for any action $i\in [K]$ that is suboptimal in $\nu$ we prove that
\begin{align}
\liminf_{n\to\infty} \frac{\E_{\nu,\pi}[T_i(n)]}{\log(n)} \ge \frac{2}{\Delta_i^2}\,.
\label{eq:armlowerbound}
\end{align}
As in the previous lower bound proof, we assume that all random variables are defined over the canonical bandit measurable space $(\cH_n,\cF_n)$, where $\cH_n = ([K]\times \R)^n$, $\cF_n$ is the restriction of $\mathcal L(R^{2n})$ to $\cH_n$, and $\PP_{\nu,\pi}$ (which induces $\E_{\nu,\pi}$) is the distribution of $n$-round action-reward histories induced by the interconnection of $\nu$ and $\pi$.

The intuitive logic of the lower bound is as follows: $\E_{\nu,\pi}[T_i(n)]$ must be (relatively) large because the regret in all environments $\nu’$ where (as opposed to what happens in $\nu$) action $i$ is optimal must be small by our assumption that $\pi$ is a “reasonable” policy and for this $\E_{\nu’,\pi}[T_i(n)]$ must be large. However, if the gap $\Delta_i$ is small and $\nu’$ is close to $\nu$, $\E_{\nu,\pi}[T_i(n)]$ will necessarily be large when $\E_{\nu’,\pi}[T_i(n)]$ is large because the distributions $\PP_{\nu,\pi}$ and $\PP_{\nu’,\pi}$ will also be close.

As in the previous lower bound proof, the divergence decomposition identity and the high-probability Pinsker inequality will help us to carry out this argument. In particular, from the divergence decomposition lemma, we get that for any $\nu’=(P’_1,\dots,P_K’)$ such that $P_j=P_j’$ except for $j=i$,
\begin{align}
\KL(\PP_{\nu,\pi}, \PP_{\nu’,\pi}) = \E_{\nu,\pi}[T_i(n)] \, \KL(P_i, P’_i)\,,
\label{eq:infoplem}
\end{align}
while the high-probability Pinsker inequality gives that for any event $A$,
\begin{align*}
\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c)\ge \frac12 \exp(-\KL(\PP_{\nu,\pi}, \PP_{\nu’,\pi}) )\,,
\end{align*}
or equivalently,
\begin{align}
\KL(\PP_{\nu,\pi},\PP_{\nu’,\pi}) \ge \log \frac{1}{2 (\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c))}\,.
\label{eq:pinsker}
\end{align}
We now upper bound $\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c)$ in terms of $R_n+R_n’$ where $R_n’ = R_n(\nu’,\pi)$. For this choose $P_i’ = \mathcal N(\mu_i+\lambda,1)$ with $\lambda>\Delta_i$ to be selected later. Further, we choose $A= \{ T_i(n)\ge n/2 \}$ and thus $A^c = \{ T_i(n) < n/2 \}$. These choices make $\PP_{\nu,\pi}(A)+\PP_{\nu',\pi}(A^c)$ small if $R_n$ and $R_n'$ are both small. Indeed, since $i$ is suboptimal in $\nu$ and $i$ is optimal in $\nu'$ and for any $j\ne i$, $\Delta_j'(\nu) \ge \lambda-\Delta_i$, \begin{align*} R_n \ge \frac{n\Delta_i}{2} \PP_{\nu,\pi}(T_i(n)\ge n/2 ) \, \text{ and } R_n' \ge \frac{n (\lambda-\Delta_i) }{2} \PP_{\nu',\pi}( T_i(n) < n/2 )\,. \end{align*}
Summing these up, lower bounding $\Delta_i/2$ and $(\lambda-\Delta_i)/2$ by $\kappa(\Delta_i,\lambda) \doteq \frac{\min( \Delta_i, \lambda-\Delta_i )}{2}$, and then reordering gives
\begin{align*}
\PP_{\nu,\pi}( T_i(n) \ge n/2 ) + \PP_{\nu’,\pi}( T_i(n) < n/2 ) \le \frac{R_n+R_n'}{ \kappa(\Delta_i,\lambda) \,n} \end{align*} as promised. Combining this inequality with $\eqref{eq:infoplem}$ and $\eqref{eq:pinsker}$ and using $\KL(P_i,P'_i) = \lambda^2/2$ we get \begin{align} \frac{\lambda^2}{2}\E_{\nu,\pi}[T_i(n)] \ge \log\left( \frac{\kappa(\Delta_i,\lambda)}{2} \frac{n}{R_n+R_n'} \right) = \log\left( \frac{\kappa(\Delta_i,\lambda)}{2}\right) + \log(n) - \log(R_n+R_n')\,. \end{align} Dividing both sides by $\log(n)$, we see that it remains to upper bound $\frac{\log(R_n+R_n')}{\log(n)}$. For this, note that for any $p>0$ $R_n+ R_n’ \le C n^p$ for all $n$ and some constant $C>0$. Hence, $\log(R_n+R_n’) \le \log(C) + p\log(n)$, from which we get that $\limsup_{n\to\infty} \frac{\log(R_n+R_n’)}{\log(n)} \le p$. Since $p>0$ was arbitrary, it follows that this also holds for $p=0$. Hence,
\begin{align}
\liminf_{n\to\infty}
\frac{\lambda^2}{2} \frac{\E_{\nu,\pi}[T_i(n)]}{\log(n)} \ge
1 – \limsup_{n\to\infty} \frac{\log(R_n+R_n’)}{\log(n)} \ge 1\,.
\end{align}
Taking the infimum of both sides over $\lambda>\Delta_i$ and reordering gives $\eqref{eq:armlowerbound}$, thus finishing the proof.

QED

Instance-dependent finite-time lower bounds

Is it possible to go beyond asymptotic instance optimality? Yes, of course! All we have to do is to redefine what is meant by a “reasonable” algorithm. One idea is to call an algorithm reasonable when its finite-time(!) regret is close to minimax optimal. The situation is illustrated in the graph below:

Illustration of various optimality concepts.

Illustration of optimality concepts. The $x$ axis lists instances, ordered so that the “green” graph looks nice. Shown are the regret of various policies in different colors. While a near minimax algorithm can perform equally over all instances (and never much worse than the largest of the curves’ minimum values), a near instance optimal algorithm is required to come close to what is best on every instance using reasonable algorithms. The reasonable class can be the algorithms which are near minimax optimal.

In this part we will show instance-optimal finite-time lower bounds building on the ideas of the previous proof. This proof by and large uses the same ideas as the proof of the minimax result in the last post. This suggests that for future reference it may be useful to extract the common part, which summarizes what can be obtained by chaining the high-probability Pinsker inequality with the divergence decomposition lemma:

Lemma (Instance-dependent lower bound): Let $\nu = (P_i)_i,\nu’=(P’_i)$ be $K$-action stochastic bandit environments that differ only the distribution of the rewards for action $i\in [K]$. Further, assume that $i$ is suboptimal in $\nu$ and is optimal in $\nu’$, and in particular $i$ is the unique optimal action in $\nu’$. Let $\lambda = \mu_i(\nu’)-\mu_i(\nu)$. Then, for any policy $\pi$,
\begin{align}\label{eq:idalloclb}
D(P_i,P_i’) \E_{\nu,\pi} [T_i(n)] \ge \log\left( \frac{\min\set{\lambda – \Delta_i, \Delta_i}}{4}\right) + \log(n) – \log(R_n+R_n’)\,.
\end{align}

Note that the lemma holds for finite $n$ and any $\nu$ and can be used to derive finite-time instance-dependent lower bounds for any environment class $\cE$ that is rich enough to include a pair $\nu’$ for any $\nu \in \cE$ and for policies that are uniformly good over $\cE$. For illustration consider the Gaussian unit variance environments $\cE\doteq \cE_K(\mathcal N)$ and policies $\pi$ whose regret, on any $\nu\in \cE$, is bounded by $C n^p$ for some $C>0$ and $p>0$. Call this class $\Pi(\cE,C,n,p)$, the class of $p$-order policies. Note that UCB is in this class with $C = C’\sqrt{K \log(n)}$ and $p=1/2$ with some $C’>0$. Thus, for $p\ge 1/2$ and suitable $C$ this class is not empty.

As an immediate corollary of the previous lemma we get the following result:

Theorem (Finite-time, instance-dependent lower bound for $p$-order policies): Take $\cE$ and $\Pi(\cE,C,n,p)$ as in the previous paragraph. Then, for any $\pi \in \Pi(\cE,C,n,p)$ and $\nu \in \cE$, the regret of $\pi$ on $\nu$ satisfies
\begin{align}
R_n(\pi,\nu) \ge
\frac{1}{2} \sum_{i: \Delta_i>0} \left(\frac{(1-p)\log(n) + \log(\frac{\Delta_i}{8C})}{\Delta_i}\right)^+\,,
\label{eq:finiteinstancebound}
\end{align}
where $(x)^+ = \max(x,0)$ is the positive part of $x\in \R$.

Proof: Given $\nu=(\mathcal N(\mu_i,1))_i$, $i$ such that action $i$ is suboptimal in $\nu$, choose $\nu_i’ = (\mathcal N(\mu_j’,1))_j$ such that $\mu_j’=\mu_j$ unless $j=i$, in which make $\mu_i’ = \mu_i+\lambda$, $\lambda\le 2\Delta_i(\nu)$. Then, $\nu’\in \cE$ and from $\log(R_n+R_n’) \le \log(2C)+p\log(n)$ and $\eqref{eq:idalloclb}$, we get
\begin{align*}
\E_{\nu,\pi} [T_i(n)]
\ge \frac{2}{\lambda^2}\left(\log\left(\frac{n}{2Cn^p}\right) + \log\left(\frac{\min\set{\lambda – \Delta_i, \Delta_i}}{4}\right)\right)
\end{align*}
Choosing $\lambda = 2\Delta_i$ and plugging this into the basic regret decomposition identity gives the result.

QED

With $p=1/2$, $C = C’\sqrt{K}$ with $C’>0$ suitable so that $\Pi(\cE,C’\sqrt{K},n,1/2)\ne \emptyset$ we get for any policy $\pi$ that is “near-minimax optimal” for $\cE$ that
\begin{align}
R_n(\pi,\nu) \ge \frac{1}{2} \sum_{i: \Delta_i>0} \left(\frac{\frac12\log(n) + \log(\frac{\Delta_i}{8C’ \sqrt{K}})}{\Delta_i}\right)^+\,.
\label{eq:indnearminimaxlb}
\end{align}
The main difference to the asymptotic lower bound is twofold: First, when $\Delta_i$ is very small, the corresponding term is eliminated in the lower bound: For small $\Delta_i$, the impact of choosing for $n$ small action $i$ is negligible. Second, even when $\Delta_i$ are all relatively large, the leading term in this lower bound is approximately half that of $c^*(\nu)$. This effect may be real: The class of policies considered is larger than in the asymptotic lower bound, hence there is the possibility that the policy that is best tuned for a given environment achieves a smaller regret. At the same time, we only see a constant factor difference.

The lower bound $\eqref{eq:indnearminimaxlb}$ should be compared to the upper bound derived for UCB. In particular, recall that the simplified form of an upper bound was $\sum_{i:\Delta_i>0} \frac{C \log(n)}{\Delta_i}$. We see that the main difference to the above bound is the lack of the second term in $\eqref{eq:indnearminimaxlb}$. With some extra work, this gap can also be closed.

We now argue that the previous lower bound is not too weak in that in some sense it allows us to recover our previous minimax lower bound. In particular, choosing $\Delta_i= 8 \rho C n^{p-1}$ uniformly for all but one optimal action, we get $(1-p)\log(n)+\log(\frac{\Delta_i}{8C}) = \log(\rho)$. Plugging this into $\eqref{eq:finiteinstancebound}$ and lower bounding $\sup_{\rho>0}\log(\rho)/\rho = \exp(-1)$ by $0.35$ gives the following corollary:

Corollary (Finite-time lower bound for $p$-order policies): Take $\cE$ and $\Pi(\cE,C,n,p)$ as in the previous paragraph. Then, for any $\pi \in \Pi(\cE,C,n,p)$,
\begin{align*}
\sup_{\nu\in \cE} R_n(\pi,\nu) \ge \frac{0.35}{16} \frac{K-1}{C} n^{1-p} \,.
\end{align*}

When $C = C’\sqrt{K}$ and $p=1/2$, we get the familiar $\Omega( \sqrt{Kn} )$ lower bound. However, note the difference: Whereas the previous lower bound was true for any policy, this lower bound holds only for policies in $\Pi(\cE,C’\sqrt{K},n,1/2)$. Nevertheless, it is reassuring that the instance-dependent lower bound is able to recover the minimax lower bound for the “near-minimax” policies. And to emphasize the distinction even further. In our minimax lower bound we only showed there exists an environment for which the lower bound holds, while the result above holds no matter which arm we choose to be the optimal one. This is only possible because we restricted the class of algorithms.

Summary

So UCB is close to minimax optimal (last post) and now we have seen that at least among the class of consistent strategies it is also asymptotically optimal and also almost optimal for any instance amongst the near-minimax optimal algorithm even in a non-asymptotic sense. One might ask if there is anything left to do, and like always the answer is: Yes, lots!

For one, we have only stated the results for the Gaussian case. Similar results are easily shown for alternative classes. A good exercise is to repeat the derivations in this post for the class of Bernoulli bandits, where the rewards of all arms are sampled from a Bernoulli distribution.

There is also the question of showing bounds that hold with high probability rather than in expectation. So far we have not discussed any properties of the distribution of the regret other than its mean, which we have upper bounded for UCB and lower bounded here and in the last post. In the next few posts we will develop algorithms for which the regret is bounded with high probability and so the lower bounds of this nature will be delayed for a little while.

Notes, references

There are now a wide range of papers with lower bounds for bandits. For asymptotic results the first article below was the earliest that we know of, while others are generalizations.

For non-minimax results that hold in finite time there are very recently a variety of approaches (and one older one). They are:

Finally, for the sake of completeness we note that minimax bounds are available in the following articles. The second of which deals with the high-probability case mentioned above.

Note 1. All of the above results treat all arms symmetrically. Sometimes one might want an algorithm that treats the arms differently and for which the regret guarantee depends on which arm is optimal. We hope eventually to discuss the most natural approach to this problem, which is to use a Bayesian strategy (though we will be interested in it for other reasons). Meanwhile, there are modifications of UCB that treat the actions asymmetrically so that its regret is small if some arms are optimal and large otherwise.

We simply refer the interested reader to a paper on the Pareto regret frontier for bandits, which fully characterizes the trade-offs available to asymmetric algorithms in terms of the worst-case expected regret when comparing to a fixed arm.

Note 2. The leading constant in the instance dependent bound is sub-optimal. This can be improved by a more careful choice of $\lambda$ that is closer to $\Delta_i$ than $2\Delta_i$ and dealing with the lower-order terms that this introduces.