All posts by Tor Lattimore

Lower Bounds for Stochastic Linear Bandits

Lower bounds for linear bandits turn out to be more nuanced than the finite-armed case. The big difference is that for linear bandits the shape of the action-set plays a role in the form of the regret, not just the distribution of the noise. This should not come as a big surprise because the stochastic finite-armed bandit problem can be modeled as a linear bandit with actions being the standard basis vectors, $\cA = \set{e_1,\ldots,e_K}$. In this case the actions are orthogonal, which means that samples from one action do not give information about the rewards for other actions. Other action sets such as the sphere ($\cA = S^d = \{x \in \R^d : \norm{x}_2 = 1\}$) do not share this property. For example, if $d = 2$ and $\cA = S^d$ and an algorithm chooses actions $e_1 = (1,0)$ and $e_2 = (0,1)$ many times, then it can deduce the reward it would obtain from choosing any other action.

We will prove a variety of lower bounds under different assumptions. The first three have a worst-case flavor showing what is (not) achievable in general, or under a sparsity constraint, or if the realizable assumption is not satisfied. All of these are proven using the same information-theoretic tools that we have seen for previous results, combined with careful choices of action sets and environment classes. The difficulty is always in guessing what is the worst case, which is followed by simply turning the cranks on the usual machinery.

Besides the worst-case results we also give an optimal asymptotic lower bound for finite action sets that generalizes the asymptotic lower bound for finite-armed stochastic bandits give in a previous post. The proof of this result is somewhat more technical, but follows the same general flavor as the previous asymptotic lower bounds.

We use a simple model with Gaussian noise. For action $A_t \in \cA \subseteq \R^d$ the reward is $X_t = \shortinner{A_t, \theta} + \eta_t$ where $\eta_t\sim \mathcal N(0,1)$ is the standard Gaussian noise term and $\theta \in \Theta \subset \R^d$. The regret of a strategy is:
\begin{align*}
R_n(\cA, \theta) = \max_{x \in \cA} \E\left[\sum_{t=1}^n \shortinner{x^* – A_t, \theta}\right]\,,
\end{align*}
where $x^* = \argmax_x \shortinner{x, \theta}$ is the optimal action. Note that the arguments to the regret function differ from those used in some previous posts. In general we include the quantities of interest, which in this case (with a fixed strategy and noise model) are the action set and the unknown parameter. As for finite-armed bandits we define the sub-optimality gap of arm $x \in \cA$ by $\Delta_x = \max_{y \in \cA} \shortinner{y – x, \theta}$ and $\Delta_{\min} = \inf \set{\Delta_x : x \in \cA \text{ and } \Delta_x > 0}$. Note that the latter quantity can be zero if $\cA$ is infinitely large, but is non-zero if there are finitely many arms and the problem is non-trivial (there exists a sub-optimal arm). If $A$ is a matrix, then $\lambda_{\min}(A)$ is its smallest eigenvalue. We also recall the notation used for finite-armed bandits by defining $T_x(t) = \sum_{s=1}^t \one{A_s = x}$.

Worst case bounds

Our worst case bound relies on a specific action set and shows that the $\tilde O(d \sqrt{n})$ upper bound for linear version of UCB cannot be improved in general except (most likely) for the logarithmic factors.

Theorem Let the action set be $\mathcal A = \set{-1, 1}^d$. Then for any strategy there exists a $\theta \in \Theta = \set{-\sqrt{1/n}, \sqrt{1/n}}^d$ such that
\begin{align*}
R_n(\cA, \theta) \geq \frac{\exp(-2)}{4} \cdot d \sqrt{n}\,.
\end{align*}

Proof
For $\theta \in \Theta$ we denote $\mathbb{P}_\theta$ to be the measure on outcomes $A_1, X_1,\ldots,A_n,X_n$ induced by the interaction of the strategy and the bandit determined by $\theta$. By the relative entropy identity we have for $\theta, \theta’ \in \Theta$ that
\begin{align}
\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’}) = \frac{1}{2} \sum_{t=1}^n \E\left[\shortinner{A_t, \theta – \theta’}^2\right]\,.
\label{eq:linear-kl}
\end{align}
For $1 \leq i \leq d$ and $\theta \in \Theta$ define
\begin{align*}
p_{\theta,i} = \mathbb{P}_\theta\left(\sum_{t=1}^n \one{\sign(A_{t,i}) \neq \sign(\theta_i)} \geq n/2\right)\,.
\end{align*}
Now let $1 \leq i \leq d$ and $\theta \in \Theta$ be fixed and let $\theta’ = \theta$ except for $\theta’_i = -\theta_i$. Then by the high probability version of Pinsker’s inequality and (\ref{eq:linear-kl}) we have
\begin{align}
\label{eq:sphere-kl}
p_{\theta,i} + p_{\theta’,i}
&\geq \frac{1}{2} \exp\left(-\frac{1}{2}\sum_{t=1}^n \shortinner{A_t, \theta – \theta’}^2\right)
= \frac{1}{2} \exp\left(-2\right)\,.
\end{align}
Therefore using the notation $\sum_{\theta_{-i}}$ as an abbreviation for $\sum_{\theta_1,\ldots,\theta_{i-1},\theta_{i+1},\ldots,\theta_d \in \set{\pm \sqrt{1/n}}^{d-1}}$,
\begin{align*}
\sum_{\theta \in \Theta} 2^{-d} \sum_{i=1}^d p_{\theta,i}
&= \sum_{i=1}^d \sum_{\theta_{-i}} 2^{-d} \sum_{\theta_i \in \set{\pm \sqrt{1/n}}} p_{\theta,i} \\
&\geq \sum_{i=1}^d \sum_{\theta_{-i}} 2^{-d} \cdot \frac{1}{2}\exp\left(-2\right) \\
&= \frac{d}{4} \exp\left(-2\right)\,.
\end{align*}
Therefore there exists a $\theta \in \Theta$ such that
\begin{align*}
\sum_{i=1}^d p_{\theta,i} \geq \frac{d}{4} \exp\left(-2\right)\,.
\end{align*}
Let $x^* = \argmax_{x \in \cA} \shortinner{x^*, \theta}$. Then by the definition of $p_{\theta,i}$, the regret for this choice of $\theta$ is at least
\begin{align*}
R_n(\cA, \theta)
&= \sum_{t=1}^n \E_\theta\left[\shortinner{x^* – A_t, \theta}\right] \\
&= 2\sqrt{\frac{1}{n}} \sum_{i=1}^d \sum_{t=1}^n \mathbb{P}_\theta\left(\sign(A_{t,i}) \neq \sign(\theta_i)\right) \\
&\geq \sqrt{n} \sum_{i=1}^d \mathbb{P}_\theta\left(\sum_{t=1}^n \one{\sign(A_{t,i}) \neq \sign(\theta_i)} \geq n/2 \right) \\
&= \sqrt{n} \sum_{i=1}^d p_{\theta,i}
\geq \frac{\exp(-2)}{4} \cdot d \sqrt{n}\,.
\end{align*}
QED

Sparse case

We now tackle the sparse case where the underlying parameter $\theta$ is assumed to have $\norm{\theta}_0 = \sum_{i=1}^d \one{\theta_i > 0} \leq p$ for some $p$ that is usually much smaller than $d$. An extreme case is when $p= 1$, which essentially reduces to the finite-armed bandit problem where we observe the regret is at least $\Omega(\sqrt{dn})$ in the worst case. For this reason we cannot expect too much from sparsity. It turns out that the best one can hope for (in the worst case) is $\Omega(\sqrt{p d n})$, so again the lower bound is nearly matching the upper bound for an existing algorithm.

Theorem
Let $2 \leq p\leq d$ be even and define the set of actions $\cA$ by
\begin{align*}
\mathcal A = \set{ x \in \set{0, 1}^d : \sum_{i=1}^d \one{x_i > 0} = \frac{p}{2}}.
\end{align*}
Then for any strategy there exists a $\theta$ with $\norm{\theta}_0 \leq p$ such that
\begin{align*}
R_n \geq \frac{\sqrt{2d p n}}{16} \exp(-1)\,.
\end{align*}

The assumption that $p$ is even is non-restrictive, since in case it is not even the following proof goes through for $p – 1$ and the regret only changes by a very small constant factor. The proof relies on a slightly different construction than the previous result, and is fractionally more complicated because of it.

Proof
Let $\epsilon = \sqrt{2d/(p n)}$ and $\theta$ be given by
\begin{align*}
\theta_i = \begin{cases}
\epsilon\,, & \text{if } i \leq p / 2\,; \\
0\,, & \text{otherwise}\,.
\end{cases}
\end{align*}
Given $S \subseteq \{p/2+1,\ldots,d\}$ with $|S| = p/2$ define $\theta’$ by
\begin{align*}
\theta’_i = \begin{cases}
\epsilon\,, & \text{if } i \leq p / 2\,; \\
2\epsilon\,, & \text{if } i \in S \,; \\
0\,, & \text{otherwise}\,.
\end{cases}
\end{align*}
Let $\mathbb{P}_\theta$ and $\mathbb{P}_{\theta’}$ be the measures on the sequence of observations when a fixed strategy interacts with the bandits induced by $\theta$ and $\theta’$ respectively. Then the usual computation shows that
\begin{align*}
\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’}) = 2\epsilon^2 \E\left[\sum_{t=1}^n \sum_{i \in S} \one{A_{ti} \neq 0}\right]\,.
\end{align*}
By the pigeonhole principle we can choose an $S \subseteq \{p/2+1,\ldots,d\}$ with $|S| = p/2$ in such a way that
\begin{align*}
\E\left[\sum_{t=1}^n \sum_{i \in S} \one{A_{ti} \neq 0}\right] \leq \frac{n p}{d}\,.
\end{align*}
Therefore using this $S$ and with the high probability pinsker we have for any event $A$ that
\begin{align*}
\mathbb{P}_\theta(A) + \mathbb{P}_{\theta’}(A^c) \geq \frac{1}{2} \exp\left(-\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’})\right)
\geq \frac{1}{2} \exp\left(-\frac{np\epsilon^2}{2d}\right)
\geq \frac{1}{2} \exp\left(-1\right)
\end{align*}
Choosing $A = \set{\sum_{t=1}^n \sum_{i \in S} \one{A_{ti} > 0} \geq np/4}$ leads to
\begin{align*}
R_n(\mathcal A, \theta) &\geq \frac{n\epsilon p}{4} \mathbb{P}_{\theta}(A) &
R_n(\mathcal A, \theta’) &\geq \frac{n\epsilon p}{4} \mathbb{P}_{\theta’}(A^c)\,.
\end{align*}
Therefore
\begin{align*}
\max\set{R_n(\cA, \theta),\, R_n(\cA, \theta’)}
\geq \frac{n\epsilon p}{8} \left(\mathbb{P}_{\theta}(A) + \mathbb{P}_{\theta’}(A^c)\right)
\geq \frac{\sqrt{2ndp}}{16} \exp(-1)\,.
\end{align*}
QED

Unrealizable case

An important generalization of the linear model is the unrealizable case where the mean rewards are not assumed to follow a linear model exactly. Suppose that $\cA \subset \R^d$ and the mean reward is $\E[X_t|A_t = x] = \mu_x$ does not necessarily satisfy a linear model. It would be very pleasant to have an algorithm such that if $\mu_x = \shortinner{x, \theta}$ for all $x$, then
\begin{align*}
R_n(\cA, \mu) = \tilde O(d \sqrt{n})\,,
\end{align*}
while if there exists an $x \in \cA$ such that $\mu_x \neq \shortinner{x, \theta}$, then $R_n(\cA, \mu) = \tilde O(\sqrt{nK})$ recovers the UCB bound. That is, an algorithm that enjoys the bound of OFUL if the the linear model is correct, but recovers the regret of UCB otherwise. Of course one could hope for something even stronger, for example that
\begin{align}
R_n(\cA, \mu) = \tilde O\left(\min\set{\sqrt{Kn},\, d\sqrt{n} + n\epsilon}\right)\,, \label{eq:hope}
\end{align}
where $\epsilon = \min_{\theta \in \R^d} \max_{x \in \cA} |\mu_x – \shortinner{x, \theta}|$ is called the approximation error of the class of linear models. Unfortunately it turns out that results of this kind are not achievable. To show this we will prove a generic bound for the classical finite-armed bandit problem, and then show how this implies a lower bound on the ability to be adaptive to a linear model if possible and have acceptable regret if not.

Theorem
Let $\cA = \set{e_1,\ldots,e_K}$ be the standard basis vectors. Now define sets $\Theta, \Theta’ \subset \R^{K}$ by
\begin{align*}
\Theta &= \set{\theta \in [0,1]^K : \theta_i = 0 \text{ for } i > 1} \\
\Theta’ &= \set{\theta \in [0,1]^K}\,.
\end{align*}
If $2(K-1) \leq V \leq \sqrt{n(K-1)\exp(-2)/8}$ and $\sup_{\theta \in \Theta} R_n(\cA, \theta) \leq V$, then
\begin{align*}
\sup_{\theta’ \in \Theta’} R_n(\cA, \theta’) \geq \frac{n(K-1)}{8V} \exp(-2)\,.
\end{align*}

Proof
Let $\theta \in \Theta$ be given by $\theta_1 = \Delta = (K-1)/V \leq 1/2$. Therefore
\begin{align*}
\sum_{i=2}^K \E[T_i(n)] \leq \frac{V}{\Delta}
\end{align*}
and so by the pigeonhole principle there exists an $i > 1$ such that
\begin{align*}
\E[T_i(n)] \leq \frac{V}{(K-1)\Delta} = \frac{1}{\Delta^2}\,.
\end{align*}
Then define $\theta’ \in \Theta’$ by
\begin{align*}
\theta’_j = \begin{cases}
\Delta & \text{if } j = 1 \\
2\Delta & \text{if } j = i \\
0 & \text{otherwise}\,.
\end{cases}
\end{align*}
Then by the usual argument for any event $A$ we have
\begin{align*}
\mathbb{P}_\theta(A) + \mathbb{P}_{\theta’}(A^c)
\geq \frac{1}{2} \exp\left(\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’})\right)
= \frac{1}{2} \exp\left(-2 \Delta^2 \E[T_i(n)]\right)
\geq \frac{1}{2} \exp\left(-2\right)\,.
\end{align*}
Therefore
\begin{align*}
R_n(\mathcal A, \theta) + R_n(\mathcal A, \theta’)
\geq \frac{n\Delta}{4} \exp(-2) = \frac{n(K-1)}{4V} \exp(-2)
\end{align*}
Therefore by the assumption that $R_n(\mathcal A, \theta) \leq V \leq \sqrt{n(K-1) \exp(-2)/8}$ we have
\begin{align*}
R_n(\mathcal A, \theta’) \geq \frac{n(K-1)}{8V} \exp(-2)\,.
\end{align*}
Therefore $R_n(\cA, \theta) R_n(\cA, \theta’) \geq \frac{n(K-1)}{8} \exp(-2)$ as required.
QED

As promised we now relate this to the unrealizable linear bandits. Suppose that $d = 1$ (an absurd case) and that there are $K$ arms $\cA = \set{x_1, x_2,\ldots, x_{K}}$ where $x_1 = (1)$ and $x_i = (0)$ for $i > 1$. Clearly if the reward is really linear and $\theta > 0$, then the first arm is optimal, while otherwise any of the other arms have the same expected reward (of just $0$). Now simply add $K-1$ coordinates to each action so that the error in the 1-dimensional linear model can be modeled in a higher dimension and we have exactly the model used in the previous theorem. So $\cA = \set{e_1,e_2,\ldots,e_K}$. Then the theorem shows that (\ref{eq:hope}) is a pipe dream. If $R_n(\cA, \theta) = \tilde O(\sqrt{n})$ for all $\theta \in \Theta’$ (the realizable case), then there exists a $\theta’ \in \Theta’$ such that $R_n(\cA, \theta’) = \tilde \Omega(K \sqrt{n})$. To our knowledge it is still an open question of what is possible on this front. Our conjecture is that there is an algorithm for which
\begin{align*}
R_n(\cA, \theta) = \tilde O\left(\min\set{d\sqrt{n} + \epsilon n,\, \frac{K}{d}\sqrt{n}}\right)\,.
\end{align*}
In fact, it is not hard to design an algorithm that tries to achieve this bound by assuming the problem is realizable, but using some additional time to explore the remaining arms up to some accuracy to confirm the hypothesis. We hope to write a post on this in the future, but leave the claim as a conjecture for now.

Asymptotic lower bounds

Like in the finite-armed case, the asymptotic result is proven only for consistent strategies. Recall that a strategy is consistent in some class if the regret is sub-polynomial for any bandit in that class.

Theorem Let $\cA \subset \R^d$ be a finite set that spans $\R^d$ and suppose a strategy satisfies
\begin{align*}
\text{for all } \theta \in \R^d \text{ and } p > 0 \qquad R_n(\cA, \theta) = o(n^p)\,.
\end{align*}
Let $\theta \in \R^d$ be any parameter such that there is a unique optimal action and let $\bar G_n = \E_\theta \left[\sum_{t=1}^n A_t A_t^\top\right]$ be the expected Gram matrix when the strategy interacts with the bandit determined by $\theta$. Then $\liminf_{n\to\infty} \lambda_{\min}(\bar G_n) / \log(n) > 0$ (which implies that $\bar G_n$ is eventually non-singular). Furthermore, for any $x \in \cA$ it holds that:
\begin{align*}
\limsup_{n\to\infty} \log(n) \norm{x}_{\bar G_n^{-1}}^2 \leq \frac{\Delta_x^2}{2}\,.
\end{align*}

The reader should recognize $\norm{x}_{\bar G_n^{-1}}^2$ as the key term in the width of the confidence interval for the least squares estimator. This is quite intuitive. The theorem is saying that any consistent algorithm must prove statistically that all sub-optimal arms are indeed sub-optimal. Before the proof of this result we give a corollary that characterizes the asymptotic regret that must be endured by any consistent strategy.

Corollary
Let $\cA \subset \R^d$ be a finite set that spans $\R^d$ and $\theta \in \R^d$ be such that there is a unique optimal action. Then for any consistent strategy
\begin{align*}
\liminf_{n\to\infty} \frac{R_n(\cA, \theta)}{\log(n)} \geq c(\cA, \theta)\,,
\end{align*}
where $c(\cA, \theta)$ is defined as
\begin{align*}
&c(\cA, \theta) = \inf_{\alpha \in [0,\infty)^{\cA}} \sum_{x \in \cA} \alpha(x) \Delta_x \\
&\quad\text{ subject to } \norm{x}_{H_\alpha^{-1}}^2 \leq \frac{\Delta_x^2}{2} \text{ for all } x \in \cA \text{ with } \Delta_x > 0\,,
\end{align*}
where $H = \sum_{x \in \cA} \alpha(x) x x^\top$.

Proof of Theorem
The proof of the first part is simply omitted (see the reference below for details). It follows along similar lines to what follows, essentially that if $G_n$ is not “sufficiently large” in every direction, then some alternative parameter is not sufficiently identifiable. Let $\theta’ \in \R^d$ be an alternative parameter and let $\mathbb{P}$ and $\mathbb{P}’$ be the measures on the sequence of outcomes $A_1,Y_1,\ldots,A_n,Y_n$ induced by the interaction between the strategy and the bandit determined by $\theta$ and $\theta’$ respectively. Then for any event $E$ we have
\begin{align}
\Prob{E} + \mathbb{P}'(E^c)
&\geq \frac{1}{2} \exp\left(-\KL(\mathbb{P}, \mathbb{P}’)\right) \nonumber \\
&= \frac{1}{2} \exp\left(-\frac{1}{2} \E\left[\sum_{t=1}^n \inner{A_t, \theta – \theta’}^2\right]\right)
= \frac{1}{2} \exp\left(-\frac{1}{2} \norm{\theta – \theta’}_{\bar G_n}^2\right)\,. \label{eq:linear-asy-kl}
\end{align}
A simple re-arrangement shows that
\begin{align*}
\frac{1}{2} \norm{\theta – \theta’}_{\bar G_n}^2 \geq \log\left(\frac{1}{2 \Prob{E} + 2 \mathbb{P}'(E^c)}\right)
\end{align*}
Now we follow the usual plan of choosing $\theta’$ to be close to $\theta$, but so that the optimal action in the bandit determined by $\theta’$ is not $x^*$. Let $\epsilon \in (0, \Delta_{\min})$ and $H$ be a positive definite matrix to be chosen later such that $\norm{x – x^*}_H^2 > 0$. Then define
\begin{align*}
\theta’ = \theta + \frac{\Delta_x + \epsilon}{\norm{x – x^*}^2_H} H(x – x^*)\,,
\end{align*}
which is chosen so that
\begin{align*}
\shortinner{x – x^*, \theta’} = \shortinner{x – x^*, \theta} + \Delta_x + \epsilon = \epsilon\,.
\end{align*}
This means that $x^*$ is not the optimal action for bandit $\theta’$, and in fact is $\epsilon$-suboptimal. We abbreviate $R_n = R_n(\cA, \theta)$ and $R_n’ = R_n(\cA, \theta’)$. Then
\begin{align*}
R_n
&= \E_\theta\left[\sum_{x \in \cA} T_x(n) \Delta_x\right]
\geq \frac{n\Delta_{\min}}{2} \Prob{T_{x^*}(n) < n/2} \geq \frac{n\epsilon}{2} \Prob{T_{x^*}(n) < n/2}\,. \end{align*} Similarly, $x^*$ is at least $\epsilon$-suboptimal in bandit $\theta'$ so that \begin{align*} R_n' \geq \frac{n\epsilon}{2} \mathbb{P}'\left(T_{x^*}(n) \geq n/2\right)\,. \end{align*} Therefore \begin{align} \Prob{T_{x^*}(n) < n/2} + \mathbb{P}'\left(T_{x^*}(n) \geq n/2\right) \leq \frac{2}{n\epsilon} \left(R_n + R_n'\right)\,. \label{eq:regret-sum} \end{align} Note that this holds for practically any choice of $H$ as long as $\norm{x - x^*}_H > 0$. The logical next step is to select that $H$ (which determines $\theta’$) in such a way that (\ref{eq:linear-asy-kl}) is as large as possible. The main difficulty is that this depends on $n$, so instead we aim to choose an $H$ so the quantity is large enough infinitely often. We starting by just re-arranging things:
\begin{align*}
\frac{1}{2} \norm{\theta – \theta’}_{\bar G_n}^2
= \frac{(\Delta_x + \epsilon)^2}{2} \cdot \frac{\norm{x – x^*}_{H \bar G_n H}^2}{\norm{x-x^*}_H^4}
= \frac{(\Delta_x + \epsilon)^2}{2 \norm{x – x^*}_{\bar G_n^{-1}}^2} \rho_n(H)\,,
\end{align*}
where we introduced
\begin{align*}
\rho_n(H) = \frac{\norm{x – x^*}_{\bar G_n^{-1}}^2 \norm{x – x^*}_{H \bar G_n H}^2}{\norm{x – x^*}_H^4}\,.
\end{align*}
Therefore by choosing $E$ to be the event that $T_{x^*}(n) < n/2$ and using (\ref{eq:regret-sum}) and (\ref{eq:linear-asy-kl}) we have \begin{align*} \frac{(\Delta_x + \epsilon)^2}{2\norm{x - x^*}_{\bar G_n^{-1}}^2} \rho_n(H) \geq \log\left(\frac{n \epsilon}{4R_n + 4R_n'}\right)\,, \end{align*} which after re-arrangement leads to \begin{align*} \frac{(\Delta_x + \epsilon)^2}{2\log(n)\norm{x - x^*}_{\bar G_n^{-1}}^2} \rho_n(H) \geq 1 - \frac{\log((4R_n + 4R_n')/\epsilon)}{\log(n)}\,. \end{align*} The definition of consistency means that $R_n$ and $R_n'$ are both sub-polynomial, which implies that the second term in the previous expression tends to zero for large $n$ and so by sending $\epsilon$ to zero we see that \begin{align} \label{eq:lin-lower-liminf} \liminf_{n\to\infty} \frac{\rho_n(H)}{\log(n) \norm{x - x^*}_{\bar G_n^{-1}}^2} \geq \frac{2}{\Delta_x^2}\,. \end{align} We complete the result using proof by contradiction. Suppose that \begin{align} \limsup_{n\to\infty} \log(n) \norm{x - x^*}_{\bar G_n^{-1}}^2 > \frac{\Delta_x^2}{2}\,. \label{eq:linear-lower-ass}
\end{align}
Then there exists an $\epsilon > 0$ and infinite set $S \subseteq \N$ such that
\begin{align*}
\log(n) \norm{x – x^*}_{\bar G_n^{-1}}^2 \geq \frac{(\Delta_x + \epsilon)^2}{2} \quad \text{ for all } n \in S\,.
\end{align*}
Therefore by (\ref{eq:lin-lower-liminf}),
\begin{align*}
\liminf_{n \in S} \rho_n(H) > 1\,.
\end{align*}
We now choose $H$ to be a cluster point of the sequence $\{\bar G_n^{-1} / \norm{\bar G_n^{-1}}\}_{n \in S}$ where $\norm{\bar G_n^{-1}}$ is the spectral norm of the matrix $\bar G_n^{-1}$. Such a point must exist, since matrices in this sequence have unit spectral norm by definition, and the set of matrices with bounded spectral norm is compact. We let $S’ \subseteq S$ be a subset so that $\bar G_n^{-1} / \norm{\bar G_n^{-1}}$ converges to $H$ on $n \in S’$. We now check that $\norm{x – x^*}_H > 0$.
\begin{align*}
\norm{x – x^*}_H^2 = \lim_{n \in S} \frac{\norm{x – x^*}^2_{\bar G_n^{-1}}}{\norm{\bar G_n^{-1}}}
> 0\,,
\end{align*}
where the last inequality follows from the assumption in (\ref{eq:linear-lower-ass}) and the first part of the theorem. Therefore
\begin{align*}
1 < \liminf_{n \in S} \rho_n(H) \leq \liminf_{n \in S'} \frac{\norm{x - x^*}^2_{\bar G_n^{-1}} \norm{x - x^*}^2_{H \bar G_n^{-1}H}}{\norm{x - x^*}_H^4} = 1\,, \end{align*} which is a contradiction, and so we conclude that (\ref{eq:linear-lower-ass}) does not hold and so \begin{align*} \limsup_{n\to\infty} \log(n) \norm{x - x^*}_{\bar G_n^{-1}}^2 \leq \frac{\Delta_x^2}{2}\,. \end{align*} QED

We leave the proof of the corollary as an exercise for the reader. Essentially though, any consistent algorithm must choose its actions so that in expectation
\begin{align*}
\norm{x – x^*}^2_{\bar G_n^{-1}} \leq (1 + o(1)) \frac{\Delta_x^2}{2 \log(n)}\,.
\end{align*}
Now since $x^*$ will be chosen linearly often it is easily shown for sub-optimal $x$ that $\lim_{n\to\infty} \norm{x – x^*}_{\bar G_n^{-1}} / \norm{x}_{\bar G_n^{-1}} \to 1$. This leads to the required constraint on the actions of the algorithm, and the optimization problem in the corollary is derived by minimizing the regret subject to this constraint.

Clouds looming for optimism

The clouds are closing in

The clouds are closing in


The theorem and its corollary have disturbing implications for strategies based on the principle of optimism in the face of uncertainty, which is that they can never be asymptotically optimal! The reason is that these strategies do not choose actions for which they have collected enough statistics to prove they are sub-optimal, but in the linear setting it can still be worthwhile playing these actions in case they are very informative about other actions for which the statistics are not yet so clear. A problematic example appears in the simplest case where any information sharing between the arms occurs at all. Namely, when the dimension is $d = 2$ and there are $K = 3$ arms.

Specifically, let $\cA = \set{x_1, x_2, x_3}$ where $x_1 = e_1$ and $x_2 = e_2$ and $x_3 = (1-\epsilon, \gamma \epsilon)$ where $\gamma \geq 1$ and $\epsilon > 0$ is small. Let $\theta = (1, 0)$ so that the optimal action is $x^* = x_1$ and $\Delta_{x_2} = 1$ and $\Delta_{x_3} = \epsilon$. Clearly if $\epsilon$ is very small, then $x_1$ and $x_3$ point in nearly the same direction and so choosing only these arms does not yield which of $x_1$ or $x_3$ is optimal. On the other hand, $x_2$ and $x_1 – x_3$ point in very different directions and so choosing $x_2$ allows a learning agent to quickly identify that $x_1$ is in fact optimal.

A troubling example

A troubling example

We now show how the theorem and corollary demonstrate this. First we calculate what is the optimal solution to the optimization problem. Recall we are trying to minimize
\begin{align*}
\sum_{x \in \cA} \alpha(x) \Delta_x \qquad \text{subject to } \norm{x}^2_{H(\alpha)^{-1}} \leq \frac{\Delta_x^2}{2} \text{for all } x \in \cA\,,
\end{align*}
where $H = \sum_{x \in \cA} \alpha(x) x x^\top$. Clearly we should choose $\alpha(x_1)$ arbitrarily large, then a computation shows that
\begin{align*}
\lim_{\alpha(x_1) \to \infty} H(\alpha)^{-1} =
\left[\begin{array}{cc}
0 & 0 \\
0 & \frac{1}{\alpha(x_3)\epsilon^2 \gamma^2 + \alpha(x_2)}
\end{array}\right]\,.
\end{align*}
Then for the constraints mean that
\begin{align*}
&\frac{1}{\alpha(x_3)\epsilon^2 \gamma^2 + \alpha(x_2)} = \lim_{\alpha(x_1) \to \infty} \norm{x_2}^2_{H(\alpha)^{-1}} \leq \frac{1}{2} \\
&\frac{\gamma^2 \epsilon^2}{\alpha(x_3) \epsilon^2 \gamma^2 + \alpha(x_2)} = \lim_{\alpha(x_1) \to \infty} \norm{x_3}^2_{H(\alpha)^{-1}} \leq \frac{\epsilon^2}{2}\,.
\end{align*}
Provided that $\gamma \geq 1$ this reduces simply to the constraint that
\begin{align*}
\alpha(x_3) \epsilon^2 + \alpha(x_2) \geq 2\gamma^2\,.
\end{align*}
Since we are minimizing $\alpha(x_2) + \epsilon \alpha(x_3)$ we can easily see that $\alpha(x_2) = 2\gamma^2$ and $\alpha(x_3) = 0$ provided that $2\gamma^2 \leq 2/\epsilon$. Therefore if $\epsilon$ is chosen sufficiently small relative to $\gamma$, then the optimal rate of the regret is $c(\cA, \theta) = 2\gamma^2$ and so there exists a strategy such that
\begin{align*}
\limsup_{n\to\infty} \frac{R_n(\cA, \theta)}{\log(n)} = 2\gamma^2\,.
\end{align*}
Now we argue that for $\gamma$ sufficiently large and $\epsilon$ arbitrarily small that the regret for any consistent optimistic algorithm is at least
\begin{align*}
\limsup_{n\to\infty} \frac{R_n(\cA, \theta)}{\log(n)} = \Omega(1/\epsilon)\,,
\end{align*}
which can be arbitrarily worse than the optimal rate! So why is this so? Recall that optimistic algorithms choose
\begin{align*}
A_t = \argmax_{x \in \cA} \max_{\tilde \theta \in \cC_t} \inner{x, \tilde \theta}\,,
\end{align*}
where $\cC_t \subset \R^d$ is a confidence set that we assume contains the true $\theta$ with high probability. So far this does not greatly restrict the class of algorithms that we might call optimistic. We now assume that there exists a constant $c > 0$ such that
\begin{align*}
\cC_t \subseteq \set{\tilde \theta : \norm{\hat \theta_t – \tilde \theta}_{G_t} \leq c \sqrt{\log(n)}}\,.
\end{align*}
So now we ask how often can we expect the optimistic algorithm to choose action $x_2 = e_2$ in the example described above? Since we have assumed $\theta \in \cC_t$ with high probability we have that
\begin{align*}
\max_{\tilde \theta \in \cC_t} \shortinner{x_1, \tilde \theta} \geq 1\,.
\end{align*}
On the other hand, if $T_{x_2}(t-1) > 4c^2 \log(n)$, then
\begin{align*}
\max_{\tilde \theta \in \cC_t} \shortinner{x_2, \tilde \theta}
&= \max_{\tilde \theta \in \cC_t} \shortinner{x_2, \tilde \theta – \theta} \\
&\leq 2 c \sqrt{\norm{x_2}_{G_t^{-1}} \log(n)} \\
&\leq 2 c \sqrt{\frac{\log(n)}{T_{x_2}(t-1)}} \\
&< 1\,, \end{align*} which means that $x_2$ will not be chosen more than $1 + 4c^2 \log(n)$ times. So if $\gamma = \Omega(c^2)$, then the optimistic algorithm will not choose $x_2$ sufficiently often and a simple computation shows it must choose $x_3$ at least $\Omega(\log(n)/\epsilon^2)$ times and suffers regret of $\Omega(\log(n)/\epsilon)$. The key take-away from this is that optimistic algorithms do not choose actions that are statistically sub-optimal, but for linear bandits it can be optimal to choose these actions more often to gain information about other actions.

Notes

Note 1: The worst-case bound demonstrates the near-optimality of the OFUL algorithm for a specific action-set. It is an open question to characterize the optimal regret for a wide range of action-sets. We will return to these issues soon when we discuss adversarial linear bandits.

Note 2: There is an algorithm that achieves the asymptotic lower bound (see references below), but so far there is no algorithm that is simultaneously asymptotically optimal and (near) minimax optimal.

Note 3: The assumption that $x^*$ was unique in the asymptotic results can be relaxed at the price of a little more work, and simple (natural) modifications to the theorem statements.

References

Worst-case lower bounds for stochastic bandits have appeared in a variety of places, all with roughly the same bound, but for different action sets. Our very simple proof is new, but takes inspiration mostly from the paper by Shamir.

The asymptotic lower bound (along with a strategy for which the upper bound matches) is by the authors.

The example used to show optimistic approaches cannot achieve the optimal rate has been used before in the pure exploration setting where the goal is to simply find the best action, without the constraint that the regret should be small.

We should also mention that examples have been constructed before demonstrating the need for carefully balancing the trade-off between information and regret. For many examples, and a candidate design principle for addressing these issues see

The results for the unrealizable case are inspired by the work of one of the authors on the pareto regret frontier for bandits, which characterizes what trade-offs are available when it is desirable to have a regret that is unusually small relative to some specific arms.

High probability lower bounds

In the post on adversarial bandits we proved two high probability upper bounds on the regret of Exp-IX. Specifically, we showed:

Theorem: There exists a policy $\pi$ such that for all $\delta \in (0,1)$ for any adversarial environment $\nu\in [0,1]^{nK}$, with probability at least $1-\delta$
\begin{align}
\label{eq:high-prob}
\hat R_n(\pi,\nu) = O\left(\sqrt{Kn \log(K)} + \sqrt{\frac{Kn}{\log(K)}} \log\left(\frac{1}{\delta}\right)\right)\,.
\end{align}

We also gave a version of the algorithm that depended on $\delta \in (0,1)$:

Theorem: For all $\delta \in (0,1)$ there exists a policy $\pi$ such that for any adversarial environment $\nu\in [0,1]^{nK}$, with probability at least $1- \delta$,
\begin{align}
\label{eq:high-prob2}
\hat R_n(\pi,\nu) = O\,\left(\sqrt{Kn \log\left(\frac{K}{\delta}\right)}\right)\,.
\end{align}

The key difference between these results is the order of quantifiers. In the first we have a single algorithm and a high-probability guarantee that holds simultaneously for any confidence level. For the second result the confidence level must be specified in advance. The price for using the generic algorithm appears to be $\sqrt{\log(1/\delta)}$, which is usually quite small but not totally insignificant. The purpose of this post is to show that both bounds are tight up to constant factors, which implies that algorithms knowing the confidence level in advance really do have an advantage in terms of the achievable regret.

One reason why choosing the confidence level in advance is not ideal is that the resulting high-probability bound cannot be integrated to prove a bound in expectation. For algorithms satisfying (\ref{eq:high-prob}) the expected regret can be bounded by
\begin{align}
R_n \leq \int^\infty_0 \Prob{\hat R_n \geq x} dx = O(\sqrt{Kn \log(K)})\,.
\label{eq:integratedbound}
\end{align}
On the other hand, if the high-probability bound only holds for a single $\delta$ as in \eqref{eq:high-prob2}, then it seems hard to do much better than
\begin{align*}
R_n \leq n \delta + O\left(\sqrt{Kn \log\left(\frac{K}{\delta}\right)}\right)\,,
\end{align*}
which with the best choice of $\delta$ leads to a bound where an extra $\log(n)$ factor appears compared to \eqref{eq:integratedbound}. In fact, it turns out that this argument cannot be strengthened and algorithms with the strong high-probability regret cannot be near-optimal in expectation.

Our approach for proving this fact is very similar to what we did for the minimax lower bounds for stochastic bandits in a previous post. There are two differences between the adversarial setting and the stochastic setting that force us to work a little harder. The first is that for the adversarial-bandit upper bounds we have assumed the rewards are bounded in $[0,1]$, which was necessary in order to say anything at all. This means that our lower bounds should also satisfy this requirement, while in the stochastic lower bounds we used Gaussian rewards with an unbounded range. The second difference comes from the fact that rather than for the regret, the stochastic lower bounds were given for what is known as the pseudo-regret in the adversarial framework and which reads
\begin{align*}
\bar R_n \doteq \max_i \EE{ \sum_{t=1}^n X_{ti} – \sum_{t=1}^n X_t }\,.
\end{align*}
In the stochastic setting, we have $\bar R_n = \sum_{i=1}^K \E[T_i(n)] \Delta_i$ and thus bounds on the pseudo-regret are possible by lower bounding the number of times an algorithm chooses sub-optimal arms on expectation. This is not enough to bound the regret, which depends also on the actual samples.

A lower bound on tail probabilities of pseudo-regret in stochastic bandits

Before we overcome these technicalities we describe the simple intuition by returning to the stochastic setting, using the pseudo-regret and relaxing the assumption that the rewards are bounded. It is important to remember that $\bar R_n$ is not a random variable because all the randomness is integrated away by the expectation. This means that it does not make sense to talk about high-probability results for $\bar R_n$, so we introduce another quantity,
\begin{align*}
\tilde R_n = \sum_{i=1}^K T_i(n) \Delta_i\,,
\end{align*}
which is a random variable through the pull counts $T_i(n)$ and which, for lack of a better name, we call the random pseudo-regret. For the subsequent results we let $\cE$ denote the set of Gaussian bandits with sub-optimality gaps bounded by one.

Theorem: Fix the horizon $n>0$, number of arms $K>1$, a constant $C>0$ and a policy $\pi$. Assume that for any $\nu \in \cE$ bandit environment,
\begin{align*}
R_n(\pi,\nu) \leq C \sqrt{(K-1)n}\,,
\end{align*}
Let $\delta \in (0,1)$. Then, there exists a bandit $\nu$ in $\cE$ such that
\begin{align*}
\Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4}\min\set{n,\, \frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}} \geq \delta\,.
\end{align*}

It follows that if the result can be transferred to the adversarial setting, there will be little or no room for improving \eqref{eq:high-prob}.
Proof: Let $\Delta \in (0, 1/2]$ be a constant to be tuned subsequently and define
\begin{align*}
\mu_i = \begin{cases}
\Delta, & \text{if } i = 1\,; \\
0, & \text{otherwise}\,,
\end{cases}
\end{align*}
As usual, let $R_n = R_n(\pi,\nu)$ for $\nu = (\cN(\mu_i,1))_{i\in [K]}$. Let $\PP=\PP_{\nu,\pi}$ and $\E=\E_{\nu,\pi}$. Let $i = \argmin_{j>1} \E[T_j(n)]$. Then, thanks to
\begin{align*}
C \sqrt{(K-1)n} \ge R_n = \Delta \sum_{i>1} \E[T_i(n)] \ge \Delta (K-1)\min_i \E[T_i(n)]
\end{align*}
we find that
\begin{align}
\E[T_i] \leq \frac{C}{\Delta} \sqrt{\frac{n}{K-1}}\,.
\label{eq:hpproofarmusage}
\end{align}

Define $\mu’ \in \R^K$ by
\begin{align*}
\mu’_j = \begin{cases}
\mu_j\,, & \text{if } j \neq i; \\
2\Delta\,, & \text{otherwise}
\end{cases}
\end{align*}
and let $\nu’=(\cN(\mu_j’,1))_{j\in [K]}$ be the Gaussian bandit with means $\mu’$ and abbreviate $\PP’=\PP_{\nu’,\pi}$ and $\E’=\E_{\nu’,\pi}$. Thus in $\nu’$ action $i$ is better than any other action by at least $\Delta$. Let $\tilde R_n = \sum_{j=1}^K T_j(n) \Delta_j$ and $\tilde R_n’ = \sum_{j=1}^n T_j(n) \Delta_j’$ be the random pseudo-regret in $\nu$ and $\nu’$ respectively, where $\Delta_j = \max_k \mu_k-\mu_j = \one{j\ne 1}\Delta$ and $\Delta_j’=\max_k \mu_k’-\mu_j \ge \one{i\ne j} \Delta$. Hence,
\begin{align*}
\tilde R_n & \ge T_i(n) \Delta_i \ge \one{T_i(n)\ge n/2} \frac{\Delta n}{2}\,, \qquad\text{ and }\\
\tilde R_n’ & \ge \Delta \sum_{j\ne i} T_j(n) = \Delta (n-T_i(n)) \ge \one{T_i(n) < n/2} \frac{\Delta n}{2}\,. \end{align*} Hence, $T_i(n)\ge n/2 \Rightarrow \tilde R_n \ge \frac{\Delta n}{2}$ and $T_i(n) < n/2 \Rightarrow \tilde R_n' \ge \frac{\Delta n}{2}$, implying that \begin{align} \max\left(\Prob{ \tilde R_n \ge \frac{\Delta n}{2} }, \PP'\left(\tilde R_n' \ge \frac{\Delta n}{2} \right)\right) \ge \frac12 \left( \Prob{T_i(n) \ge n/2} +\PP'\left(T_i(n) < n/2 \right) \right)\,. \label{eq:hprlb} \end{align} By the high probability Pinsker lemma, the divergence decomposition identity (earlier this was called this the information processing lemma) and \eqref{eq:hpproofarmusage} we have
\begin{align}
\Prob{T_i(n) \geq n/2} + \mathbb P’\left(T_i(n) < n/2\right)
&\geq \frac{1}{2} \exp\left(-\KL(\mathbb P, \mathbb P’)\right) \nonumber \\
&= \frac{1}{2} \exp\left(-2\E[T_i(n)] \Delta^2\right) \nonumber \\
&\geq \frac{1}{2} \exp\left(-2C \Delta\sqrt{\frac{n}{K-1}}\right)\,.
\label{eq:hppinskerlb}
\end{align}
To enforce that the right-hand side of the above display is at least $2\delta$, we choose
\begin{align*}
\Delta = \min\set{\frac{1}{2},\,\frac{1}{2C} \sqrt{\frac{K-1}{n}} \log\left(\frac{1}{4\delta}\right)}\,.
\end{align*}
Putting \eqref{eq:hprlb} and \eqref{eq:hppinskerlb} together we find that either
\begin{align*}
&\Prob{\tilde R_n \geq \frac{1}{4}\min\set{n,\, \frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}} \geq \delta \\
\text{or}\qquad &
\mathbb P’\left(\tilde R’_n \geq \frac{1}{4}\min\set{n,\,\frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}\right) \geq \delta\,.
\end{align*}
QED.

From this theorem we can derive two useful corollaries.

Corollary: Fix $n>0$, $K>1$. For any policy $\pi$ and $\delta \in (0,1)$ small enough that
\begin{align}
n\delta \leq \sqrt{n (K-1) \log\left(\frac{1}{4\delta}\right)} \label{eq:hplbdelta}
\end{align}
there exists a bandit problem $\nu\in\cE$ such that
\begin{align}
\Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4}\min\set{n,\, \sqrt{\frac{n(K-1)}{2} \log\left(\frac{1}{4\delta}\right)}}} \geq \delta\,.
\label{eq:hplbsmalldelta}
\end{align}

Proof: We prove the result by contradiction. Assume that the conclusion does not holds for $\pi$. We will derive a contradiction. Take $\delta$ that satisfies \eqref{eq:hplbdelta}. Then, for any bandit problem $\nu\in \cE$ the expected regret of $\pi$ is bounded by
\begin{align*}
R_n(\pi,\nu) \leq n\delta + \sqrt{\frac{n(K-1)}{2} \log\left(\frac{1}{4\delta}\right)}
\leq \sqrt{2n(K-1) \log\left(\frac{1}{4\delta}\right)}\,.
\end{align*}
Therefore $\pi$ satisfies the conditions of the previous theorem with $C =\sqrt{2 \log(\frac{1}{4\delta})}$, which implies that there exists some bandit problem $\nu\in\cE$ such that \eqref{eq:hplbsmalldelta} holds, contradicting our assumption.
QED

Corollary: Fix any $K>1$, $p \in (0,1)$ and $C > 0$. Then, for any policy $\pi$ there exists $n>0$ large enough, $\delta\in(0,1)$ small enough and a bandit environment $\nu\in \cE$ such that
\begin{align*}
\Prob{\tilde R_n(\pi,\nu) \geq C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)} \geq \delta\,.
\end{align*}

Proof: Again, we proceed by contradiction. Suppose that a policy $\pi$ exists for which the conclusion does not hold. Then, for any $n>0$ and environment $\nu\in \cE$,
\begin{align}
\Prob{\tilde R_n(\pi,\nu) \geq C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)} < \delta \label{eq:hpexplbp} \end{align} and therefore, for any $n>0$, the expected $n$-round regret of $\pi$ on $\nu$ is bounded by
\begin{align*}
R_n(\pi,\nu)
\leq \int^\infty_0 \Prob{ \tilde R_n(\pi,\nu) \geq x} dx
\leq C \sqrt{n(K-1)} \int^\infty_0 \exp\left(-x^{1/p}\right) dx
\leq C \sqrt{n(K-1)}\,.
\end{align*} Therefore, by the previous theorem, for any $n>0$, $\delta\in (0,1)$ there exists a bandit $\nu_{n,\delta}\in \cE$ such that
\begin{align*}
\Prob{\tilde R_n(\pi,\nu_{n,\delta}) \geq \frac{1}{4} \min\set{n, \frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)}} \geq \delta\,.
\end{align*}
For $\delta$ small enough, $\frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right) \ge C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)$ and then choosing $n$ large enough so that $\frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)\le n$, we find that on the environment $\nu=\nu_{n,\delta}$,
\begin{align*}
\delta &\le \Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4} \min\set{n, \frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)}} \\
& \le \Prob{ \tilde R_n(\pi,\nu) \geq C\sqrt{(K-1)n} \log^p\left(\frac1n \right) }\,,
\end{align*}
contradicting \eqref{eq:hpexplbp}.
QED

A lower bound on tail probabilities of regret in adversarial bandits

So how do we transfer this argument to the case where the rewards are bounded and the regret is used, rather than the pseudo-regret? For the first we can simply shift the means to be close to $1/2$ and truncate or “clip” the rewards that (by misfortune) end up outside the allowed range. To deal with the regret there are two options. Either one adds an additional step to show that the regret and the pseudo-regret concentrate sufficiently fast (which they do), or one correlates the losses across the actions. The latter is the strategy that we will follow here.

We start with an observation. Our goal is to show that there exist a reward sequence $x=(x_1,\dots,x_n)\in [0,1]^{nK}$ such that the regret $\hat R_n=\max_i \sum_t x_{ti} – x_{t,A_t}$ is above some threshold $u>0$ with a probability exceeding a prespecified value $\delta\in (0,1)$. For this we want to argue that it suffices to show this when the rewards are randomly chosen. Similarly to the stochastic case we define the “extended” canonical bandit probability space. Since the regret in adversarial bandits depends on non-observed rewards, the outcome space of the extended canonical probability space is $\Omega_n = \R^{nK}\times [K]^n$ and now $X_t,A_t: \Omega_n \to \R$ are $X_t(x,a) = x_t$ and $A_t(x,a) = a_t$ where we use the convention that $x= (x_1,\dots,x_n)$ and $a=(a_1,\dots,a_n)$. We also let $\hat R_n = \max_i \sum_{t=1}^n X_{ti} – \sum_{t=1}^n X_{t,A_t}$ and define $\PP_{Q,\pi}$ to be joint of $(X_1,\dots,X_n,A_1,\dots,A_n)$ arising from the interaction of $\pi$ with $X\sim Q$. Finally, as we often need it, for a fixed $\nu\in \R^{nK}$ we abbreviate $\PP_{\delta_\nu,\pi}$ to $\PP_{\nu,\pi}$ where $\delta_\nu$ is the Dirac (probability) measure over $\R^{nK}$ (i.e., $\delta_{\nu}(U) = \one{\nu \in U}$ for $U\subset \R^{nK}$ Borel).

Lemma (Randomization device): For any $Q$ probability measure over $\R^{nK}$, any policy $\pi$, $u\in \R$ and $\delta\in (0,1)$,
\begin{align}\label{eq:pqpidelta}
\PP_{Q,\pi}( \hat R_n \geq u ) \geq \delta \implies
\exists \nu \in\mathrm{support}(Q) \text{ such that } \PP_{\nu,\pi}(\hat R_n \geq u) \geq \delta\,.
\end{align}

The lemma is proved by noting that $\PP_{Q,\pi}$ can be disintegrated into the “product” of $Q$ and $\{\PP_{\nu,\pi}\}_{\nu}$. The proof is given at the end of the post.

Given this machinery, let us get into the proof. Fix a policy $\pi$, $n>0$, $K>1$ and a $\delta\in (0,1)$. Our goal is to find some $u>0$ and a reward sequence $x\in [0,1]^{nK}$ such that the random regret of $\pi$ while interacting with $x$ is above $u$ with probability exceeding $\delta$. For this, we will define two reward distributions $Q$ and $Q’$, and show for (at least) one of $\PP_{Q,\pi}$ or $\PP_{Q’,\pi}$ that the probability of $\hat R_n \ge u$ exceeds $\delta$.

Instead of the canonical probability models we will find it more convenient to work with two sequences $(X_t,A_t)_t$ and $(X_t’,A_t’)_t$ of reward-action pairs defined over a common probability space. These are constructed as follows: We let $(\eta_t)_t$ be an i.i.d. sequence of $\mathcal N(0,\sigma^2)$ Gaussian variables and then let
\begin{align*}
X_{tj} = \clip( \mu_j + \eta_t )\,, \qquad X_{tj}’ = \clip( \mu_j’ + \eta_t)\, \qquad (t\in [n],j\in [K])\,,
\end{align*}
where $\clip(x) = \max(\min(x,1),0)$ clips its argument to $[0,1]$, and for some $\Delta\in (0,1/4]$ to be chosen later,
\begin{align*}
\mu_j = \frac12 + \one{j=1} \Delta, \qquad (j\in [K])\,.
\end{align*}
The “means” $(\mu_j’)_j$ will also be chosen later. Note that apart from clipping, $(X_{ti})_t$ (and also $(X_{ti}’)_t$) is merely a shifted version of $(\eta_t)_t$. In particular, thanks to $\Delta>0$, $X_{t1}\ge X_{tj}$ for any $t,j$. Moreover, $X_{t1}$ exceeds $X_{tj}$ by $\Delta$ when none of them is clipped:
\begin{align}
X_{t1}\ge X_{tj} + \Delta\, \one{\eta_t\in [-1/2,1/2-\Delta]}\,, \quad t\in [n], j\ne 1\,.
\label{eq:hplb_rewardgapxt}
\end{align}

Now, define $(A_t)_t$ to be the random actions that arise from the interaction of $\pi$ and $(X_t)_t$ and let $i = \argmin_{j>1} \EE{ T_j(n) }$ where $T_i(n) = \sum_{t=1}^n \one{A_t=i}$. As before, $\EE{T_i(n)}\le n/(K-1)$. Choose
\begin{align*}
\mu_j’ = \mu_j + \one{j=i}\, 2\Delta\,, \quad j\in [K]\,
\end{align*}
so that $X_{ti}’\ge X_{tj}’$ for $j\in [K]$ and furthermore
\begin{align}
X_{ti}’\ge X_{tj}’ + \Delta\, \one{\eta_t\in [-1/2,1/2-2\Delta]}\,, \quad t\in [n], j\ne i\,.
\label{eq:hplb_rewardgapxtp}
\end{align}

Denote by $\hat R_n = \max_j \sum_t X_{tj} – \sum_t X_{t,A_t}$ the random regret of $\pi$ when interacting with $X = (X_1,\dots,X_n)$ and let $\hat R_n’ = \max_j \sum_t X’_{tj} – \sum_t X’_{t,A_t’}$ the random regret of $\pi$ when interacting with $X’ = (X_1′,\dots,X_n’)$.
Thus, it suffices to prove that either $\Prob{ \hat R_n \ge u }\ge \delta$ or $\Prob{ \hat R_n’ \ge u} \ge \delta$.

Illustration of the idea of the lower bound proof. An elf randomly chooses a pair of environments from the bag of all possible clipped Gaussian environments and feeds the environment into the policy, which is also randomized. The policy spits out the random regret. Many of the random regrets will be high if the random selection of the elf is unlucky for the algorithm.

Illustration of the idea of the lower bound proof. An elf randomly chooses a pair of environments from the bag of all possible clipped Gaussian environments and feeds the environment into the policy, which is also randomized. The policy spits out the random regret for both environments. For at least one of the two environments, many of the random regrets will be high with a high chance.

By our earlier remarks, $\hat R_n = \sum_t X_{t1} – \sum_t X_{t,A_t}$ and $\hat R_n’ = \sum_t X_{ti}’ – \sum_t X_{t,A_t}’$. Define $U_t =\one{\eta_t\in [-1/2,1/2-\Delta]}$, $U_t’=\one{\eta_t\in [-1/2,1/2-2\Delta]}$, $A_{tj} = \one{A_t=j}$ and $A_{tj}’ = \one{A_t’=j}$. From \eqref{eq:hplb_rewardgapxt} we see that
\begin{align*}
\hat R_n \ge \Delta\, \sum_t \one{A_t\ne 1} U_t =\Delta\, \sum_t (1-A_{t1}) U_t \ge \Delta\,(U – T_1(n)) \ge \Delta\,(U + T_i(n) – n)\,,
\end{align*}
where we also defined $U = \sum_t U_t$ and used that $T_1(n)+T_i(n)\le n$. Note that $U_t=1$ indicates that $(X_{tj})_j$ are “unclipped”. Similarly, from \eqref{eq:hplb_rewardgapxtp} we see that
\begin{align*}
\hat R_n’ \ge \Delta \, \sum_t \one{A_t’\ne i} U_t’ =\Delta\, \sum_t (1-A_{ti}’) U_t’ \ge \Delta\,( U’ – T_i'(n)) \,,
\end{align*}
where $T_i'(n)=\sum_t A_{ti}’$ and $U’ = \sum_t U_t’$. Based on the lower bounds on $\hat R_n$ and $\hat R_n’$ we thus see that if $T_i(n)\ge n/2$ and $U\ge 3n/4$ then $\hat R_n \ge u\doteq \frac{n\Delta}{4}$ and if $T_i'(n) < n/2$ and $U'\ge 3n/4$ then $\hat R_n' \ge u$ holds, too. Thus, from union bounds, \begin{align*} \Probng{ \hat R_n \ge u } &\ge \Prob{ T_i(n)\ge n/2, U\ge 3n/4 } \ge \Prob{ T_i(n)\ge n/2 } - \Prob{U < 3n/4}\,,\\ \Probng{ \hat R_n' \ge u } &\ge \Prob{ T_i'(n) < n/2, U'\ge 3n/4 } \ge \Prob{ T_i'(n) < n/2 } - \Prob{U' < 3n/4}\,. \end{align*} Noticing that $U'\le U$ and hence $\Prob{U<3n/4}\le \Prob{U' <3n/4}$, we get \begin{align*} \max(\Probng{ \hat R_n \ge u },\Probng{ \hat R_n' \ge u }) & \ge \frac12 \Bigl(\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }\Bigr) - \Prob{U' < 3n/4}\,. \end{align*} The sum $\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }$ will be lower bounded with the help of the high-probability Pinsker inequality. For an upper bound on $\Prob{U’ < 3n/4}$, we have the following technical lemma:

Lemma (Control of the number of unclipped rounds): Assume that $\Delta\le 1/8$. If $n \ge 32 \log(2/\delta)$ and $\sigma\le 1/10$ then $\Prob{U’<3n/4} \le \delta/2$.

The proof is based on bounding the tail probability of the mean the i.i.d. $(U_t’)_t$ Bernoulli variables using Hoeffding’s inequality and is given later. Intuitively, it is clear that by making $\sigma^2$ small, the number of times $\eta_t$ falls inside $[-1/2,1/4]\subset [-1/2,1/2-2\Delta]$ can be made arbitrary high with arbitrary high probability.

Our goal now is to lower bound $\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }$ by $3\delta$. As suggested before, we aim to use the high-probability Pinsker inequality. One difficulty that we face is that the events $\{T_i(n)\ge n/2\}$ and $\{T_i'(n) < n/2\}$ may not be complementary as they are defined in terms of a potentially distinct set of random variables. This will be overcome by rewriting the above probabilities using the canonical bandit probability spaces. In fact, we will use the non-extended version of these probability spaces (as defined earlier in the context of stochastic bandits). The reason of this is that after Pinsker, we plan to apply the divergence decomposition identity, which decomposes the divergence between distributions of action-reward sequences.

To properly write things let $Q_j$ denote the distribution of $X_{tj}$ and similarly let $Q_j’$ be the distribution of $X_{tj}’$. These are well defined (why?). Define the stochastic bandits $\beta=(Q_1,\dots,Q_K)$ and $\beta’=(Q’_1,\dots,Q’_K)$. Let $\Omega_n = ([K]\times \R)^n$ and let $\tilde Y_t,\tilde A_t:\Omega_n \to \R$ be the usual coordinate projection functions: $\tilde Y_t(a_1,y_1,\dots,a_n,y_n) = y_t$ and $\tilde A_t(a_1,y_1,\dots,a_n,y_n)=a_t$. Also, let $\tilde T_i(n) = \sum_{t=1}^n \one{\tilde A_t=i}$. Recall that $\PP_{\beta,\pi}$ denotes the probability measure over $\Omega_n$ that arises from the interaction of $\pi$ and $\beta$ (detto for $\PP_{\beta’,\pi}$). Now, since $T_i(n)$ is only a function of $(A_1,X_{1,A_1},\dots,A_n,X_{n,A_n})$ whose probability distribution is exactly $\PP_{\beta,\pi}$, we have
\begin{align*}
\Prob{ T_i(n) \ge n/2 } = \PP_{\beta,\pi}( \tilde T_i(n) \ge n/2 )\,.
\end{align*}
Similarly,
\begin{align*}
\Prob{ T_i'(n) < n/2 } = \PP_{\beta',\pi}( \tilde T_i(n) < n/2 )\,. \end{align*} Now, by the high-probability Pinsker inequality and the divergence decomposition lemma, \begin{align*} \Prob{ T_i(n) \ge n/2 } + \Prob{ T_i'(n) < n/2 } & = \PP_{\beta,\pi}( \tilde T_i(n) \ge n/2 ) + \PP_{\beta',\pi}( \tilde T_i(n) < n/2 ) \\ & \ge \frac12 \exp\left(- \KL( \PP_{\beta,\pi}, \PP_{\beta',\pi} ) \right) \\ & = \frac12 \exp\left(- \E_{\beta,\pi}[\tilde T_i(n)] \KL( Q_i, Q_i' ) \right) \\ & \ge \frac12 \exp\left(- \EE{ T_i(n) } \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2) ) \right)\,, \end{align*} where in the last equality we used that $Q_j=Q_j'$ unless $j=i$, while in the last step we used $\E_{\beta,\pi}[\tilde T_i(n)] = \EE{ T_i(n) }$ and also that $\KL( Q_i, Q_i' ) \le \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2) )$. From where does the last inequality come, one might ask. The answer is the truncation, which always reduces information. More precisely, let $P$ and $Q$ be probability measures on the same probability space $(\Omega, \cF)$. Let $X:\Omega \to \R$ be a random variable and $P_X$ and $Q_X$ be the laws of $X$ with under $P$ and $Q$ respectively. Then $\KL(P_X, Q_X) \leq \KL(P, Q)$.

Now, by the choice of $i$,
\begin{align*}
\EE{ T_i(n) } \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2)
\le \frac{n}{K-1} \frac{2\Delta^2}{\sigma^2}\,.
\end{align*}
Plugging this into the previous display we get that if
\begin{align*}
\Delta = \sigma \sqrt{\frac{K-1}{2n} \log\frac{1}{6\delta}}
\end{align*}
then $\Prob{ T_i(n) \ge n/2 } + \Prob{ T_i'(n) < n/2 }\ge 3\delta$ and thus $\max(\Prob{ \hat R_n \ge u },\Prob{ \hat R_n'\ge u} )\ge \delta$. Recalling the definition $u = n\Delta/4$ and choosing $\sigma=1/10$ gives the following result:

Theorem (High probability lower bound for adversarial bandits): Let $K>1$, $n>0$ and $\delta\in (0,1)$ such that
\begin{align*}
n\ge \max\left( 32 \log \frac2{\delta}, (0.8)^2 \frac{K-1}{2} \log \frac{1}{6\delta}\right)
\end{align*}
holds. Then, for any bandit policy $\pi$ there exists a reward sequence $\nu = (x_1,\dots,x_n)\in [0,1]^{nK}$ such that if $\hat R_n$ is the random regret of $\pi$ when interacting with the environment $\nu$ then
\begin{align*}
\Prob{ \hat R_n \ge 0.025\sqrt{\frac{n(K-1)}{2}\log \frac{1}{6\delta}} } \ge \delta\,.
\end{align*}

So what can one take away from this post? The main thing is to recognize that the upper bounds we proved in the previous post cannot be improved very much, at least in this worst case sense. This includes the important difference between the high-probability regret that is achievable when the confidence level $\delta$ is chosen in advance and what is possible if a single strategy must satisfy a high-probability regret guarantee for all confidence levels simultaneously.

Besides this result we also introduced some new techniques that will be revisited in the future, especially the randomization device lemma. The advantage of using clipped and correlated Gaussian rewards is that it ensures the same arm is always optimal, no matter how the noise behaves.

Technicalities

The purpose of this section is to lay to rest the two technical results required in the main body. The first a proof of the lemma which gives us the randomization technique or “device” and afterwards the proof of the proof of the lemma that controls the number of unclipped rounds.

Proof of the randomization device lemma

The argument underlying this goes as follows: If $A=(A_1,\dots,A_n)\in [K]^n$ are the actions of a stochastic policy $\pi=(\pi_1,\dots,\pi_n)$ when interacting with the environment where the rewards $X=(X_1,\dots,X_n)\in \R^{nK}$ are drawn from $Q$ then for $t=1,\dots,n$, $A_t\sim \pi_t(\cdot|A_1,X_{1,A_1},\dots,A_{t-1},X_{t-1,A_{t-1}})$ and thus the distribution $\PP_{Q,\pi}$ of $(X,A)$ satisfies
\begin{align*}
d\PP_{Q,\pi}(x,a)
&= \pi(a|x) d\rho^{\otimes n}(a) dQ(x) \,,
\end{align*}
where $a=(a_1,\dots,a_n)\in [K]^n$, $x\in (x_1,\dots,x_n)\in \R^{nK}$,
\begin{align*}
\pi(a|x)\doteq
\pi_1(a_1) \pi_2(a_2|a_1,x_{1,a_1}) \cdots \pi_n(a_n|a_1,x_{1,a_1}, \dots,a_{n-1},x_{n-1,a_{n-1}})
\end{align*}
and $\rho^{\otimes n}$ is the $n$-fold product $\rho$ with itself, where $\rho$ is the counting measure $\rho$ on $[K]$. Letting $\delta_x$ be the Dirac (probability) measure on $\R^{nK}$ concentrated at $x$ (i.e., $\delta_x(U) = \one{x\in U}$), we have that $\PP_{Q,\pi}$ can be disintegrated into $Q$ and $\{\PP_{\delta_x,\pi}\}_x$. In particular, a direct calculation verifies that
\begin{align}
d\PP_{Q,\pi}(x,a) = \int_{y\in \R^{nK}} dQ(y) \, d\PP_{\delta_y,\pi}(x,a) \,.
\label{eq:disintegration}
\end{align}

Let $(X_t,A_t)$ be the reward and action of round $t$ in the extended canonical bandit probability space and $\hat R_n$ the random regret defined in terms of these random variables. For any Borel $U\subset \R$,
\begin{align*}
\PP_{Q,\pi}( \hat R_n \in U )
&= \int \one{\hat R_n(x,a)\in U} d\PP_{Q,\pi}(x,a) \\
&= \int_{\R^{nK}} dQ(y) \left( \int_{\R^{nK}\times [K]^n} \one{\hat R_n(x,a)\in U} d\PP_{\delta_y,\pi}(x,a) \right)\\
&= \int_{\R^{nK}} dQ(y) \PP_{\delta_y,\pi}( \hat R_n\in U )\,,
\end{align*}
where the the second equality uses \eqref{eq:disintegration} and Fubini. From the above equality it is obvious that it is not possible that $\PP_{Q,\pi}( \hat R_n \in U )\ge \delta$ while for all $y\in \mathrm{support}(Q)$, $\PP_{\delta_y,\pi}( \hat R_n\in U )<\delta$, thus finishing the proof. QED

Proof of lemma controlling number of clipped rounds: First note that $U’\le U$ and thus $\Prob{U < 3n/4}\le \Prob{U '< 3n/4}$ hence it suffices to control the latter. Since $\Delta\le 1/8$ and $\eta_t$ is a Gaussian with zero mean and variance $\sigma^2$, and in particular $\eta_t$ is $\sigma^2$-subgaussian, we have \begin{align*} 1 - p = \Prob{U_t' = 0} & \leq \Prob{ |\eta_t| > 1/2-2\Delta }
\leq 2 \exp\left(-\frac{\left(1/2 – 2\Delta\right)^2}{2\sigma^2}\right)\\
& \leq 2 \exp\left(-\frac{1}{2 (4)^2 \sigma^2}\right)
\le \frac{1}{8}\,,
\end{align*}
where the last inequality follows whenever $\sigma^2 \le \frac{1}{32 \log 16}$ which is larger than $0.01$. Therefore $p \ge 7/8$ and
\begin{align*}
\Prob{\sum_{t=1}^n U_t’ < \frac{3n}{4}} &= \Prob{\frac{1}{n} \sum_{t=1}^n ( U_t' - p) < -(p-\frac{3}{4}) } \\ &\le \Prob{\frac{1}{n} \sum_{t=1}^n ( U_t' - p) \le - \frac{1}{8}} \\ &\leq \exp\left(-\frac{n}{32}\right) \leq \frac{\delta}{2}\,, \end{align*} where the second last inequality uses Hoeffding’s bound together with that $U_t’-p$ is $1/4$-subgaussian, and the last holds by our assumption on $n$.
QED

Notes

Note 1: It so happens that the counter-example construction we used means that the same arm has the best reward in every round (not just the best mean). It is perhaps a little surprising that algorithms cannot exploit this fact, in contrast the experts setting where this knowledge enormously improves the achievable regret.

Note 2: Adaptivity is all the rage right now. Can you design an adversarial bandit algorithm that exploits “easy data” when available? For example, if the rewards don’t change much over time, or lie in a small range. There are still a lot of open questions in this area. The paper referenced below gives lower bounds for some of these situations.

References

Sebastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits. 2016