Lower Bounds for Stochastic Linear Bandits

Lower bounds for linear bandits turn out to be more nuanced than the finite-armed case. The big difference is that for linear bandits the shape of the action-set plays a role in the form of the regret, not just the distribution of the noise. This should not come as a big surprise because the stochastic finite-armed bandit problem can be modeled as a linear bandit with actions being the standard basis vectors, $\cA = \set{e_1,\ldots,e_K}$. In this case the actions are orthogonal, which means that samples from one action do not give information about the rewards for other actions. Other action sets such as the sphere ($\cA = S^d = \{x \in \R^d : \norm{x}_2 = 1\}$) do not share this property. For example, if $d = 2$ and $\cA = S^d$ and an algorithm chooses actions $e_1 = (1,0)$ and $e_2 = (0,1)$ many times, then it can deduce the reward it would obtain from choosing any other action.

We will prove a variety of lower bounds under different assumptions. The first three have a worst-case flavor showing what is (not) achievable in general, or under a sparsity constraint, or if the realizable assumption is not satisfied. All of these are proven using the same information-theoretic tools that we have seen for previous results, combined with careful choices of action sets and environment classes. The difficulty is always in guessing what is the worst case, which is followed by simply turning the cranks on the usual machinery.

Besides the worst-case results we also give an optimal asymptotic lower bound for finite action sets that generalizes the asymptotic lower bound for finite-armed stochastic bandits give in a previous post. The proof of this result is somewhat more technical, but follows the same general flavor as the previous asymptotic lower bounds.

We use a simple model with Gaussian noise. For action $A_t \in \cA \subseteq \R^d$ the reward is $X_t = \shortinner{A_t, \theta} + \eta_t$ where $\eta_t\sim \mathcal N(0,1)$ is the standard Gaussian noise term and $\theta \in \Theta \subset \R^d$. The regret of a strategy is:
\begin{align*}
R_n(\cA, \theta) = \max_{x \in \cA} \E\left[\sum_{t=1}^n \shortinner{x^* – A_t, \theta}\right]\,,
\end{align*}
where $x^* = \argmax_x \shortinner{x, \theta}$ is the optimal action. Note that the arguments to the regret function differ from those used in some previous posts. In general we include the quantities of interest, which in this case (with a fixed strategy and noise model) are the action set and the unknown parameter. As for finite-armed bandits we define the sub-optimality gap of arm $x \in \cA$ by $\Delta_x = \max_{y \in \cA} \shortinner{y – x, \theta}$ and $\Delta_{\min} = \inf \set{\Delta_x : x \in \cA \text{ and } \Delta_x > 0}$. Note that the latter quantity can be zero if $\cA$ is infinitely large, but is non-zero if there are finitely many arms and the problem is non-trivial (there exists a sub-optimal arm). If $A$ is a matrix, then $\lambda_{\min}(A)$ is its smallest eigenvalue. We also recall the notation used for finite-armed bandits by defining $T_x(t) = \sum_{s=1}^t \one{A_s = x}$.

Worst case bounds

Our worst case bound relies on a specific action set and shows that the $\tilde O(d \sqrt{n})$ upper bound for linear version of UCB cannot be improved in general except (most likely) for the logarithmic factors.

Theorem Let the action set be $\mathcal A = \set{-1, 1}^d$. Then for any strategy there exists a $\theta \in \Theta = \set{-\sqrt{1/n}, \sqrt{1/n}}^d$ such that
\begin{align*}
R_n(\cA, \theta) \geq \frac{\exp(-2)}{4} \cdot d \sqrt{n}\,.
\end{align*}

Proof
For $\theta \in \Theta$ we denote $\mathbb{P}_\theta$ to be the measure on outcomes $A_1, X_1,\ldots,A_n,X_n$ induced by the interaction of the strategy and the bandit determined by $\theta$. By the relative entropy identity we have for $\theta, \theta’ \in \Theta$ that
\begin{align}
\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’}) = \frac{1}{2} \sum_{t=1}^n \E\left[\shortinner{A_t, \theta – \theta’}^2\right]\,.
\label{eq:linear-kl}
\end{align}
For $1 \leq i \leq d$ and $\theta \in \Theta$ define
\begin{align*}
p_{\theta,i} = \mathbb{P}_\theta\left(\sum_{t=1}^n \one{\sign(A_{t,i}) \neq \sign(\theta_i)} \geq n/2\right)\,.
\end{align*}
Now let $1 \leq i \leq d$ and $\theta \in \Theta$ be fixed and let $\theta’ = \theta$ except for $\theta’_i = -\theta_i$. Then by the high probability version of Pinsker’s inequality and (\ref{eq:linear-kl}) we have
\begin{align}
\label{eq:sphere-kl}
p_{\theta,i} + p_{\theta’,i}
&\geq \frac{1}{2} \exp\left(-\frac{1}{2}\sum_{t=1}^n \shortinner{A_t, \theta – \theta’}^2\right)
= \frac{1}{2} \exp\left(-2\right)\,.
\end{align}
Therefore using the notation $\sum_{\theta_{-i}}$ as an abbreviation for $\sum_{\theta_1,\ldots,\theta_{i-1},\theta_{i+1},\ldots,\theta_d \in \set{\pm \sqrt{1/n}}^{d-1}}$,
\begin{align*}
\sum_{\theta \in \Theta} 2^{-d} \sum_{i=1}^d p_{\theta,i}
&= \sum_{i=1}^d \sum_{\theta_{-i}} 2^{-d} \sum_{\theta_i \in \set{\pm \sqrt{1/n}}} p_{\theta,i} \\
&\geq \sum_{i=1}^d \sum_{\theta_{-i}} 2^{-d} \cdot \frac{1}{2}\exp\left(-2\right) \\
&= \frac{d}{4} \exp\left(-2\right)\,.
\end{align*}
Therefore there exists a $\theta \in \Theta$ such that
\begin{align*}
\sum_{i=1}^d p_{\theta,i} \geq \frac{d}{4} \exp\left(-2\right)\,.
\end{align*}
Let $x^* = \argmax_{x \in \cA} \shortinner{x^*, \theta}$. Then by the definition of $p_{\theta,i}$, the regret for this choice of $\theta$ is at least
\begin{align*}
R_n(\cA, \theta)
&= \sum_{t=1}^n \E_\theta\left[\shortinner{x^* – A_t, \theta}\right] \\
&= 2\sqrt{\frac{1}{n}} \sum_{i=1}^d \sum_{t=1}^n \mathbb{P}_\theta\left(\sign(A_{t,i}) \neq \sign(\theta_i)\right) \\
&\geq \sqrt{n} \sum_{i=1}^d \mathbb{P}_\theta\left(\sum_{t=1}^n \one{\sign(A_{t,i}) \neq \sign(\theta_i)} \geq n/2 \right) \\
&= \sqrt{n} \sum_{i=1}^d p_{\theta,i}
\geq \frac{\exp(-2)}{4} \cdot d \sqrt{n}\,.
\end{align*}
QED

Sparse case

We now tackle the sparse case where the underlying parameter $\theta$ is assumed to have $\norm{\theta}_0 = \sum_{i=1}^d \one{\theta_i > 0} \leq p$ for some $p$ that is usually much smaller than $d$. An extreme case is when $p= 1$, which essentially reduces to the finite-armed bandit problem where we observe the regret is at least $\Omega(\sqrt{dn})$ in the worst case. For this reason we cannot expect too much from sparsity. It turns out that the best one can hope for (in the worst case) is $\Omega(\sqrt{p d n})$, so again the lower bound is nearly matching the upper bound for an existing algorithm.

Theorem
Let $2 \leq p\leq d$ be even and define the set of actions $\cA$ by
\begin{align*}
\mathcal A = \set{ x \in \set{0, 1}^d : \sum_{i=1}^d \one{x_i > 0} = \frac{p}{2}}.
\end{align*}
Then for any strategy there exists a $\theta$ with $\norm{\theta}_0 \leq p$ such that
\begin{align*}
R_n \geq \frac{\sqrt{2d p n}}{16} \exp(-1)\,.
\end{align*}

The assumption that $p$ is even is non-restrictive, since in case it is not even the following proof goes through for $p – 1$ and the regret only changes by a very small constant factor. The proof relies on a slightly different construction than the previous result, and is fractionally more complicated because of it.

Proof
Let $\epsilon = \sqrt{2d/(p n)}$ and $\theta$ be given by
\begin{align*}
\theta_i = \begin{cases}
\epsilon\,, & \text{if } i \leq p / 2\,; \\
0\,, & \text{otherwise}\,.
\end{cases}
\end{align*}
Given $S \subseteq \{p/2+1,\ldots,d\}$ with $|S| = p/2$ define $\theta’$ by
\begin{align*}
\theta’_i = \begin{cases}
\epsilon\,, & \text{if } i \leq p / 2\,; \\
2\epsilon\,, & \text{if } i \in S \,; \\
0\,, & \text{otherwise}\,.
\end{cases}
\end{align*}
Let $\mathbb{P}_\theta$ and $\mathbb{P}_{\theta’}$ be the measures on the sequence of observations when a fixed strategy interacts with the bandits induced by $\theta$ and $\theta’$ respectively. Then the usual computation shows that
\begin{align*}
\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’}) = 2\epsilon^2 \E\left[\sum_{t=1}^n \sum_{i \in S} \one{A_{ti} \neq 0}\right]\,.
\end{align*}
By the pigeonhole principle we can choose an $S \subseteq \{p/2+1,\ldots,d\}$ with $|S| = p/2$ in such a way that
\begin{align*}
\E\left[\sum_{t=1}^n \sum_{i \in S} \one{A_{ti} \neq 0}\right] \leq \frac{n p}{d}\,.
\end{align*}
Therefore using this $S$ and with the high probability pinsker we have for any event $A$ that
\begin{align*}
\mathbb{P}_\theta(A) + \mathbb{P}_{\theta’}(A^c) \geq \frac{1}{2} \exp\left(-\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’})\right)
\geq \frac{1}{2} \exp\left(-\frac{np\epsilon^2}{2d}\right)
\geq \frac{1}{2} \exp\left(-1\right)
\end{align*}
Choosing $A = \set{\sum_{t=1}^n \sum_{i \in S} \one{A_{ti} > 0} \geq np/4}$ leads to
\begin{align*}
R_n(\mathcal A, \theta) &\geq \frac{n\epsilon p}{4} \mathbb{P}_{\theta}(A) &
R_n(\mathcal A, \theta’) &\geq \frac{n\epsilon p}{4} \mathbb{P}_{\theta’}(A^c)\,.
\end{align*}
Therefore
\begin{align*}
\max\set{R_n(\cA, \theta),\, R_n(\cA, \theta’)}
\geq \frac{n\epsilon p}{8} \left(\mathbb{P}_{\theta}(A) + \mathbb{P}_{\theta’}(A^c)\right)
\geq \frac{\sqrt{2ndp}}{16} \exp(-1)\,.
\end{align*}
QED

Unrealizable case

An important generalization of the linear model is the unrealizable case where the mean rewards are not assumed to follow a linear model exactly. Suppose that $\cA \subset \R^d$ and the mean reward is $\E[X_t|A_t = x] = \mu_x$ does not necessarily satisfy a linear model. It would be very pleasant to have an algorithm such that if $\mu_x = \shortinner{x, \theta}$ for all $x$, then
\begin{align*}
R_n(\cA, \mu) = \tilde O(d \sqrt{n})\,,
\end{align*}
while if there exists an $x \in \cA$ such that $\mu_x \neq \shortinner{x, \theta}$, then $R_n(\cA, \mu) = \tilde O(\sqrt{nK})$ recovers the UCB bound. That is, an algorithm that enjoys the bound of OFUL if the the linear model is correct, but recovers the regret of UCB otherwise. Of course one could hope for something even stronger, for example that
\begin{align}
R_n(\cA, \mu) = \tilde O\left(\min\set{\sqrt{Kn},\, d\sqrt{n} + n\epsilon}\right)\,, \label{eq:hope}
\end{align}
where $\epsilon = \min_{\theta \in \R^d} \max_{x \in \cA} |\mu_x – \shortinner{x, \theta}|$ is called the approximation error of the class of linear models. Unfortunately it turns out that results of this kind are not achievable. To show this we will prove a generic bound for the classical finite-armed bandit problem, and then show how this implies a lower bound on the ability to be adaptive to a linear model if possible and have acceptable regret if not.

Theorem
Let $\cA = \set{e_1,\ldots,e_K}$ be the standard basis vectors. Now define sets $\Theta, \Theta’ \subset \R^{K}$ by
\begin{align*}
\Theta &= \set{\theta \in [0,1]^K : \theta_i = 0 \text{ for } i > 1} \\
\Theta’ &= \set{\theta \in [0,1]^K}\,.
\end{align*}
If $2(K-1) \leq V \leq \sqrt{n(K-1)\exp(-2)/8}$ and $\sup_{\theta \in \Theta} R_n(\cA, \theta) \leq V$, then
\begin{align*}
\sup_{\theta’ \in \Theta’} R_n(\cA, \theta’) \geq \frac{n(K-1)}{8V} \exp(-2)\,.
\end{align*}

Proof
Let $\theta \in \Theta$ be given by $\theta_1 = \Delta = (K-1)/V \leq 1/2$. Therefore
\begin{align*}
\sum_{i=2}^K \E[T_i(n)] \leq \frac{V}{\Delta}
\end{align*}
and so by the pigeonhole principle there exists an $i > 1$ such that
\begin{align*}
\E[T_i(n)] \leq \frac{V}{(K-1)\Delta} = \frac{1}{\Delta^2}\,.
\end{align*}
Then define $\theta’ \in \Theta’$ by
\begin{align*}
\theta’_j = \begin{cases}
\Delta & \text{if } j = 1 \\
2\Delta & \text{if } j = i \\
0 & \text{otherwise}\,.
\end{cases}
\end{align*}
Then by the usual argument for any event $A$ we have
\begin{align*}
\mathbb{P}_\theta(A) + \mathbb{P}_{\theta’}(A^c)
\geq \frac{1}{2} \exp\left(\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’})\right)
= \frac{1}{2} \exp\left(-2 \Delta^2 \E[T_i(n)]\right)
\geq \frac{1}{2} \exp\left(-2\right)\,.
\end{align*}
Therefore
\begin{align*}
R_n(\mathcal A, \theta) + R_n(\mathcal A, \theta’)
\geq \frac{n\Delta}{4} \exp(-2) = \frac{n(K-1)}{4V} \exp(-2)
\end{align*}
Therefore by the assumption that $R_n(\mathcal A, \theta) \leq V \leq \sqrt{n(K-1) \exp(-2)/8}$ we have
\begin{align*}
R_n(\mathcal A, \theta’) \geq \frac{n(K-1)}{8V} \exp(-2)\,.
\end{align*}
Therefore $R_n(\cA, \theta) R_n(\cA, \theta’) \geq \frac{n(K-1)}{8} \exp(-2)$ as required.
QED

As promised we now relate this to the unrealizable linear bandits. Suppose that $d = 1$ (an absurd case) and that there are $K$ arms $\cA = \set{x_1, x_2,\ldots, x_{K}}$ where $x_1 = (1)$ and $x_i = (0)$ for $i > 1$. Clearly if the reward is really linear and $\theta > 0$, then the first arm is optimal, while otherwise any of the other arms have the same expected reward (of just $0$). Now simply add $K-1$ coordinates to each action so that the error in the 1-dimensional linear model can be modeled in a higher dimension and we have exactly the model used in the previous theorem. So $\cA = \set{e_1,e_2,\ldots,e_K}$. Then the theorem shows that (\ref{eq:hope}) is a pipe dream. If $R_n(\cA, \theta) = \tilde O(\sqrt{n})$ for all $\theta \in \Theta’$ (the realizable case), then there exists a $\theta’ \in \Theta’$ such that $R_n(\cA, \theta’) = \tilde \Omega(K \sqrt{n})$. To our knowledge it is still an open question of what is possible on this front. Our conjecture is that there is an algorithm for which
\begin{align*}
R_n(\cA, \theta) = \tilde O\left(\min\set{d\sqrt{n} + \epsilon n,\, \frac{K}{d}\sqrt{n}}\right)\,.
\end{align*}
In fact, it is not hard to design an algorithm that tries to achieve this bound by assuming the problem is realizable, but using some additional time to explore the remaining arms up to some accuracy to confirm the hypothesis. We hope to write a post on this in the future, but leave the claim as a conjecture for now.

Asymptotic lower bounds

Like in the finite-armed case, the asymptotic result is proven only for consistent strategies. Recall that a strategy is consistent in some class if the regret is sub-polynomial for any bandit in that class.

Theorem Let $\cA \subset \R^d$ be a finite set that spans $\R^d$ and suppose a strategy satisfies
\begin{align*}
\text{for all } \theta \in \R^d \text{ and } p > 0 \qquad R_n(\cA, \theta) = o(n^p)\,.
\end{align*}
Let $\theta \in \R^d$ be any parameter such that there is a unique optimal action and let $\bar G_n = \E_\theta \left[\sum_{t=1}^n A_t A_t^\top\right]$ be the expected Gram matrix when the strategy interacts with the bandit determined by $\theta$. Then $\liminf_{n\to\infty} \lambda_{\min}(\bar G_n) / \log(n) > 0$ (which implies that $\bar G_n$ is eventually non-singular). Furthermore, for any $x \in \cA$ it holds that:
\begin{align*}
\limsup_{n\to\infty} \log(n) \norm{x}_{\bar G_n^{-1}}^2 \leq \frac{\Delta_x^2}{2}\,.
\end{align*}

The reader should recognize $\norm{x}_{\bar G_n^{-1}}^2$ as the key term in the width of the confidence interval for the least squares estimator. This is quite intuitive. The theorem is saying that any consistent algorithm must prove statistically that all sub-optimal arms are indeed sub-optimal. Before the proof of this result we give a corollary that characterizes the asymptotic regret that must be endured by any consistent strategy.

Corollary
Let $\cA \subset \R^d$ be a finite set that spans $\R^d$ and $\theta \in \R^d$ be such that there is a unique optimal action. Then for any consistent strategy
\begin{align*}
\liminf_{n\to\infty} \frac{R_n(\cA, \theta)}{\log(n)} \geq c(\cA, \theta)\,,
\end{align*}
where $c(\cA, \theta)$ is defined as
\begin{align*}
&c(\cA, \theta) = \inf_{\alpha \in [0,\infty)^{\cA}} \sum_{x \in \cA} \alpha(x) \Delta_x \\
&\quad\text{ subject to } \norm{x}_{H_\alpha^{-1}}^2 \leq \frac{\Delta_x^2}{2} \text{ for all } x \in \cA \text{ with } \Delta_x > 0\,,
\end{align*}
where $H = \sum_{x \in \cA} \alpha(x) x x^\top$.

Proof of Theorem
The proof of the first part is simply omitted (see the reference below for details). It follows along similar lines to what follows, essentially that if $G_n$ is not “sufficiently large” in every direction, then some alternative parameter is not sufficiently identifiable. Let $\theta’ \in \R^d$ be an alternative parameter and let $\mathbb{P}$ and $\mathbb{P}’$ be the measures on the sequence of outcomes $A_1,Y_1,\ldots,A_n,Y_n$ induced by the interaction between the strategy and the bandit determined by $\theta$ and $\theta’$ respectively. Then for any event $E$ we have
\begin{align}
\Prob{E} + \mathbb{P}'(E^c)
&\geq \frac{1}{2} \exp\left(-\KL(\mathbb{P}, \mathbb{P}’)\right) \nonumber \\
&= \frac{1}{2} \exp\left(-\frac{1}{2} \E\left[\sum_{t=1}^n \inner{A_t, \theta – \theta’}^2\right]\right)
= \frac{1}{2} \exp\left(-\frac{1}{2} \norm{\theta – \theta’}_{\bar G_n}^2\right)\,. \label{eq:linear-asy-kl}
\end{align}
A simple re-arrangement shows that
\begin{align*}
\frac{1}{2} \norm{\theta – \theta’}_{\bar G_n}^2 \geq \log\left(\frac{1}{2 \Prob{E} + 2 \mathbb{P}'(E^c)}\right)
\end{align*}
Now we follow the usual plan of choosing $\theta’$ to be close to $\theta$, but so that the optimal action in the bandit determined by $\theta’$ is not $x^*$. Let $\epsilon \in (0, \Delta_{\min})$ and $H$ be a positive definite matrix to be chosen later such that $\norm{x – x^*}_H^2 > 0$. Then define
\begin{align*}
\theta’ = \theta + \frac{\Delta_x + \epsilon}{\norm{x – x^*}^2_H} H(x – x^*)\,,
\end{align*}
which is chosen so that
\begin{align*}
\shortinner{x – x^*, \theta’} = \shortinner{x – x^*, \theta} + \Delta_x + \epsilon = \epsilon\,.
\end{align*}
This means that $x^*$ is not the optimal action for bandit $\theta’$, and in fact is $\epsilon$-suboptimal. We abbreviate $R_n = R_n(\cA, \theta)$ and $R_n’ = R_n(\cA, \theta’)$. Then
\begin{align*}
R_n
&= \E_\theta\left[\sum_{x \in \cA} T_x(n) \Delta_x\right]
\geq \frac{n\Delta_{\min}}{2} \Prob{T_{x^*}(n) < n/2} \geq \frac{n\epsilon}{2} \Prob{T_{x^*}(n) < n/2}\,. \end{align*} Similarly, $x^*$ is at least $\epsilon$-suboptimal in bandit $\theta'$ so that \begin{align*} R_n' \geq \frac{n\epsilon}{2} \mathbb{P}'\left(T_{x^*}(n) \geq n/2\right)\,. \end{align*} Therefore \begin{align} \Prob{T_{x^*}(n) < n/2} + \mathbb{P}'\left(T_{x^*}(n) \geq n/2\right) \leq \frac{2}{n\epsilon} \left(R_n + R_n'\right)\,. \label{eq:regret-sum} \end{align} Note that this holds for practically any choice of $H$ as long as $\norm{x - x^*}_H > 0$. The logical next step is to select that $H$ (which determines $\theta’$) in such a way that (\ref{eq:linear-asy-kl}) is as large as possible. The main difficulty is that this depends on $n$, so instead we aim to choose an $H$ so the quantity is large enough infinitely often. We starting by just re-arranging things:
\begin{align*}
\frac{1}{2} \norm{\theta – \theta’}_{\bar G_n}^2
= \frac{(\Delta_x + \epsilon)^2}{2} \cdot \frac{\norm{x – x^*}_{H \bar G_n H}^2}{\norm{x-x^*}_H^4}
= \frac{(\Delta_x + \epsilon)^2}{2 \norm{x – x^*}_{\bar G_n^{-1}}^2} \rho_n(H)\,,
\end{align*}
where we introduced
\begin{align*}
\rho_n(H) = \frac{\norm{x – x^*}_{\bar G_n^{-1}}^2 \norm{x – x^*}_{H \bar G_n H}^2}{\norm{x – x^*}_H^4}\,.
\end{align*}
Therefore by choosing $E$ to be the event that $T_{x^*}(n) < n/2$ and using (\ref{eq:regret-sum}) and (\ref{eq:linear-asy-kl}) we have \begin{align*} \frac{(\Delta_x + \epsilon)^2}{2\norm{x - x^*}_{\bar G_n^{-1}}^2} \rho_n(H) \geq \log\left(\frac{n \epsilon}{4R_n + 4R_n'}\right)\,, \end{align*} which after re-arrangement leads to \begin{align*} \frac{(\Delta_x + \epsilon)^2}{2\log(n)\norm{x - x^*}_{\bar G_n^{-1}}^2} \rho_n(H) \geq 1 - \frac{\log((4R_n + 4R_n')/\epsilon)}{\log(n)}\,. \end{align*} The definition of consistency means that $R_n$ and $R_n'$ are both sub-polynomial, which implies that the second term in the previous expression tends to zero for large $n$ and so by sending $\epsilon$ to zero we see that \begin{align} \label{eq:lin-lower-liminf} \liminf_{n\to\infty} \frac{\rho_n(H)}{\log(n) \norm{x - x^*}_{\bar G_n^{-1}}^2} \geq \frac{2}{\Delta_x^2}\,. \end{align} We complete the result using proof by contradiction. Suppose that \begin{align} \limsup_{n\to\infty} \log(n) \norm{x - x^*}_{\bar G_n^{-1}}^2 > \frac{\Delta_x^2}{2}\,. \label{eq:linear-lower-ass}
\end{align}
Then there exists an $\epsilon > 0$ and infinite set $S \subseteq \N$ such that
\begin{align*}
\log(n) \norm{x – x^*}_{\bar G_n^{-1}}^2 \geq \frac{(\Delta_x + \epsilon)^2}{2} \quad \text{ for all } n \in S\,.
\end{align*}
Therefore by (\ref{eq:lin-lower-liminf}),
\begin{align*}
\liminf_{n \in S} \rho_n(H) > 1\,.
\end{align*}
We now choose $H$ to be a cluster point of the sequence $\{\bar G_n^{-1} / \norm{\bar G_n^{-1}}\}_{n \in S}$ where $\norm{\bar G_n^{-1}}$ is the spectral norm of the matrix $\bar G_n^{-1}$. Such a point must exist, since matrices in this sequence have unit spectral norm by definition, and the set of matrices with bounded spectral norm is compact. We let $S’ \subseteq S$ be a subset so that $\bar G_n^{-1} / \norm{\bar G_n^{-1}}$ converges to $H$ on $n \in S’$. We now check that $\norm{x – x^*}_H > 0$.
\begin{align*}
\norm{x – x^*}_H^2 = \lim_{n \in S} \frac{\norm{x – x^*}^2_{\bar G_n^{-1}}}{\norm{\bar G_n^{-1}}}
> 0\,,
\end{align*}
where the last inequality follows from the assumption in (\ref{eq:linear-lower-ass}) and the first part of the theorem. Therefore
\begin{align*}
1 < \liminf_{n \in S} \rho_n(H) \leq \liminf_{n \in S'} \frac{\norm{x - x^*}^2_{\bar G_n^{-1}} \norm{x - x^*}^2_{H \bar G_n^{-1}H}}{\norm{x - x^*}_H^4} = 1\,, \end{align*} which is a contradiction, and so we conclude that (\ref{eq:linear-lower-ass}) does not hold and so \begin{align*} \limsup_{n\to\infty} \log(n) \norm{x - x^*}_{\bar G_n^{-1}}^2 \leq \frac{\Delta_x^2}{2}\,. \end{align*} QED

We leave the proof of the corollary as an exercise for the reader. Essentially though, any consistent algorithm must choose its actions so that in expectation
\begin{align*}
\norm{x – x^*}^2_{\bar G_n^{-1}} \leq (1 + o(1)) \frac{\Delta_x^2}{2 \log(n)}\,.
\end{align*}
Now since $x^*$ will be chosen linearly often it is easily shown for sub-optimal $x$ that $\lim_{n\to\infty} \norm{x – x^*}_{\bar G_n^{-1}} / \norm{x}_{\bar G_n^{-1}} \to 1$. This leads to the required constraint on the actions of the algorithm, and the optimization problem in the corollary is derived by minimizing the regret subject to this constraint.

Clouds looming for optimism

The clouds are closing in

The clouds are closing in


The theorem and its corollary have disturbing implications for strategies based on the principle of optimism in the face of uncertainty, which is that they can never be asymptotically optimal! The reason is that these strategies do not choose actions for which they have collected enough statistics to prove they are sub-optimal, but in the linear setting it can still be worthwhile playing these actions in case they are very informative about other actions for which the statistics are not yet so clear. A problematic example appears in the simplest case where any information sharing between the arms occurs at all. Namely, when the dimension is $d = 2$ and there are $K = 3$ arms.

Specifically, let $\cA = \set{x_1, x_2, x_3}$ where $x_1 = e_1$ and $x_2 = e_2$ and $x_3 = (1-\epsilon, \gamma \epsilon)$ where $\gamma \geq 1$ and $\epsilon > 0$ is small. Let $\theta = (1, 0)$ so that the optimal action is $x^* = x_1$ and $\Delta_{x_2} = 1$ and $\Delta_{x_3} = \epsilon$. Clearly if $\epsilon$ is very small, then $x_1$ and $x_3$ point in nearly the same direction and so choosing only these arms does not yield which of $x_1$ or $x_3$ is optimal. On the other hand, $x_2$ and $x_1 – x_3$ point in very different directions and so choosing $x_2$ allows a learning agent to quickly identify that $x_1$ is in fact optimal.

A troubling example

A troubling example

We now show how the theorem and corollary demonstrate this. First we calculate what is the optimal solution to the optimization problem. Recall we are trying to minimize
\begin{align*}
\sum_{x \in \cA} \alpha(x) \Delta_x \qquad \text{subject to } \norm{x}^2_{H(\alpha)^{-1}} \leq \frac{\Delta_x^2}{2} \text{for all } x \in \cA\,,
\end{align*}
where $H = \sum_{x \in \cA} \alpha(x) x x^\top$. Clearly we should choose $\alpha(x_1)$ arbitrarily large, then a computation shows that
\begin{align*}
\lim_{\alpha(x_1) \to \infty} H(\alpha)^{-1} =
\left[\begin{array}{cc}
0 & 0 \\
0 & \frac{1}{\alpha(x_3)\epsilon^2 \gamma^2 + \alpha(x_2)}
\end{array}\right]\,.
\end{align*}
Then for the constraints mean that
\begin{align*}
&\frac{1}{\alpha(x_3)\epsilon^2 \gamma^2 + \alpha(x_2)} = \lim_{\alpha(x_1) \to \infty} \norm{x_2}^2_{H(\alpha)^{-1}} \leq \frac{1}{2} \\
&\frac{\gamma^2 \epsilon^2}{\alpha(x_3) \epsilon^2 \gamma^2 + \alpha(x_2)} = \lim_{\alpha(x_1) \to \infty} \norm{x_3}^2_{H(\alpha)^{-1}} \leq \frac{\epsilon^2}{2}\,.
\end{align*}
Provided that $\gamma \geq 1$ this reduces simply to the constraint that
\begin{align*}
\alpha(x_3) \epsilon^2 + \alpha(x_2) \geq 2\gamma^2\,.
\end{align*}
Since we are minimizing $\alpha(x_2) + \epsilon \alpha(x_3)$ we can easily see that $\alpha(x_2) = 2\gamma^2$ and $\alpha(x_3) = 0$ provided that $2\gamma^2 \leq 2/\epsilon$. Therefore if $\epsilon$ is chosen sufficiently small relative to $\gamma$, then the optimal rate of the regret is $c(\cA, \theta) = 2\gamma^2$ and so there exists a strategy such that
\begin{align*}
\limsup_{n\to\infty} \frac{R_n(\cA, \theta)}{\log(n)} = 2\gamma^2\,.
\end{align*}
Now we argue that for $\gamma$ sufficiently large and $\epsilon$ arbitrarily small that the regret for any consistent optimistic algorithm is at least
\begin{align*}
\limsup_{n\to\infty} \frac{R_n(\cA, \theta)}{\log(n)} = \Omega(1/\epsilon)\,,
\end{align*}
which can be arbitrarily worse than the optimal rate! So why is this so? Recall that optimistic algorithms choose
\begin{align*}
A_t = \argmax_{x \in \cA} \max_{\tilde \theta \in \cC_t} \inner{x, \tilde \theta}\,,
\end{align*}
where $\cC_t \subset \R^d$ is a confidence set that we assume contains the true $\theta$ with high probability. So far this does not greatly restrict the class of algorithms that we might call optimistic. We now assume that there exists a constant $c > 0$ such that
\begin{align*}
\cC_t \subseteq \set{\tilde \theta : \norm{\hat \theta_t – \tilde \theta}_{G_t} \leq c \sqrt{\log(n)}}\,.
\end{align*}
So now we ask how often can we expect the optimistic algorithm to choose action $x_2 = e_2$ in the example described above? Since we have assumed $\theta \in \cC_t$ with high probability we have that
\begin{align*}
\max_{\tilde \theta \in \cC_t} \shortinner{x_1, \tilde \theta} \geq 1\,.
\end{align*}
On the other hand, if $T_{x_2}(t-1) > 4c^2 \log(n)$, then
\begin{align*}
\max_{\tilde \theta \in \cC_t} \shortinner{x_2, \tilde \theta}
&= \max_{\tilde \theta \in \cC_t} \shortinner{x_2, \tilde \theta – \theta} \\
&\leq 2 c \sqrt{\norm{x_2}_{G_t^{-1}} \log(n)} \\
&\leq 2 c \sqrt{\frac{\log(n)}{T_{x_2}(t-1)}} \\
&< 1\,, \end{align*} which means that $x_2$ will not be chosen more than $1 + 4c^2 \log(n)$ times. So if $\gamma = \Omega(c^2)$, then the optimistic algorithm will not choose $x_2$ sufficiently often and a simple computation shows it must choose $x_3$ at least $\Omega(\log(n)/\epsilon^2)$ times and suffers regret of $\Omega(\log(n)/\epsilon)$. The key take-away from this is that optimistic algorithms do not choose actions that are statistically sub-optimal, but for linear bandits it can be optimal to choose these actions more often to gain information about other actions.

Notes

Note 1: The worst-case bound demonstrates the near-optimality of the OFUL algorithm for a specific action-set. It is an open question to characterize the optimal regret for a wide range of action-sets. We will return to these issues soon when we discuss adversarial linear bandits.

Note 2: There is an algorithm that achieves the asymptotic lower bound (see references below), but so far there is no algorithm that is simultaneously asymptotically optimal and (near) minimax optimal.

Note 3: The assumption that $x^*$ was unique in the asymptotic results can be relaxed at the price of a little more work, and simple (natural) modifications to the theorem statements.

References

Worst-case lower bounds for stochastic bandits have appeared in a variety of places, all with roughly the same bound, but for different action sets. Our very simple proof is new, but takes inspiration mostly from the paper by Shamir.

The asymptotic lower bound (along with a strategy for which the upper bound matches) is by the authors.

The example used to show optimistic approaches cannot achieve the optimal rate has been used before in the pure exploration setting where the goal is to simply find the best action, without the constraint that the regret should be small.

We should also mention that examples have been constructed before demonstrating the need for carefully balancing the trade-off between information and regret. For many examples, and a candidate design principle for addressing these issues see

The results for the unrealizable case are inspired by the work of one of the authors on the pareto regret frontier for bandits, which characterizes what trade-offs are available when it is desirable to have a regret that is unusually small relative to some specific arms.

Stochastic Linear Bandits and UCB

Recall that in the adversarial contextual $K$-action bandit problem, at the beginning of each round $t$ a context $c_t\in \Ctx$ is observed. The idea is that the context $c_t$ may help the learner to choose a better action. This led us to change the benchmark in the definition of regret. In this post we start with reviewing how contextual bandit problems can be defined in the stochastic setting. We use this setting to motivate the introduction of stochastic linear bandits, a fascinatingly rich model with much structure and which will be the topic of a few of the next posts. Besides defining stochastic linear bandits we also cover how UCB can be generalized to this setting.

Stochastic contextual bandits

In the standard $K$-action stochastic contextual bandit problem at the beginning of round $t$ the learner observes a context $C_t\in \Ctx$. The context may or may not be random. Next, the learner chooses its action $A_t\in [K]$ based on the information available. So far these is no difference to the adversarial setting. The difference comes from the assumption that the reward $X_t$ which is incurred satisfies
\begin{align*}
X_t = r(C_t,A_t) + \eta_t\,,
\end{align*}
where $r:\Ctx\times [K]\to \R$ is the so-called reward function, which is unknown to the learner, while $\eta_t$ is random noise.

In particular, the assumption on the noise is as follows: Let
\begin{align*}
\cF_t = \sigma(C_1,A_1,X_1,\dots,C_{t-1},A_{t-1},X_{t-1},C_t,A_t)
\end{align*}
be the $\sigma$-field summarizing the information available just before $X_t$ is observed. Then, given the past, we assume that $\eta_t$ is conditionally $1$-subgaussian:
\begin{align*}
\EE{ \exp( \lambda \eta_t ) | \cF_t } \le \exp(\frac{\lambda^2}{2})\,.
\end{align*}
(The constant $1$ is chosen to minimize the number of symbols; there is no difficulty considering the more general case of conditionally $\sigma^2$-subgaussian noise.) As discussed beforehand, subgaussian random variables have zero mean, hence the above assumption also implies that $\EE{\eta_t|\cF_t}=0$, or $\EE{X_t|\cF_t}=r(C_t,A_t)$. In words, for any given $(c,a)\in \Ctx\times [K]$ context-action pair, $r(c,a)$ gives the mean reward of action $a$ in context $c$.

If $r$ was given then the learner wishing to maximize the total expected reward would choose the action $A_t^* = \argmax_{a\in [K]} r(C_t,a)$ in round $t$ (if multiple maximizers exist, choose one). The loss due to the lack of knowledge of $r$ makes the learner incur the (expected) regret
\begin{align*}
R_n = \EE{ \sum_{t=1}^n \max_{a\in[K]} r(C_t,a) – \sum_{t=1}^n X_t }\,.
\end{align*}

Towards linear bandits

To act eventually optimally, the learner may estimate $r(c,a)$ for each $(c,a)\in \Ctx\times [K]$ pair. Similarly to what happens in the adversarial setting, this is ineffective when the number of context-action pairs is large in that in this case the regret can be very high for a long time. In particular, just like in the adversarial case the worst-case regret over all possible contextual problems with $M$ contexts and mean reward in $[0,1]$ is at least $\Omega( \sqrt{nMK} )$. One refinement of the bound is to replace $M$ by the number of effective contexts, i.e., contexts that appear frequently. But if context has a lot of detailed information, this still may very high. However, if the reward function enjoys additional structure, this worst case argument will fail. The additional structure may come in many different forms. “Smoothness”, which was also mentioned previously when we discussed adversarial contextual bandits, is one example.

An alternative (but related) assumption uses the linear structure of the set $\R^{\Ctx\times [K]}$ of all possible reward functions (recall that real-valued functions form a vector space over the reals). The so-called linearity assumption postulates that $r$ belongs to a low-dimensional linear subspace $\cS$ of $\cV \doteq \R^{\Ctx\times [K]}$.

A somewhat finer condition is to assume a specific “parameterization” of the subspace $\cS$. This works as follows: It is assumed that the learner is given access to a map $\psi: \Ctx \times [K] \to \R^d$ and that with some unknown parameter vector $\theta_*\in \R^d$,
\begin{align*}
r(c,a) = \theta_*^\top \psi(c,a)\,, \qquad \forall (c,a)\in \Ctx\times [K]\,.
\end{align*}
The map $\psi$, in line with the machine learning literature, is called a feature-map. Assuming the knowledge of $\psi$ is equivalent to knowing the linear subspace $\cS$. The refinement of the subspace condition comes from extra assumptions that one puts on $\theta_*$, such as that the magnitude of $\theta_*$ as measured in a certain norm $\norm{\cdot}$ is “small”, or that even $\norm{\theta_*}\le B$ with $B$ known, or, more generally, that $\theta_*\in \Theta$ for some known (small) set $\Theta\subset \R^d$ or a mix of these.

The idea of feature maps is best illustrated with an example: If the context denotes the visitor of a website selling books, the actions are books to recommend, the reward is the revenue on a book sold then the features could indicate the interests of the visitors as well as the domain and topic of the books. If the visitors and books are assigned to finitely many categories, indicator variables of all possible combinations of these categories could be used to create the feature map. Of course, many other possibilities exist. One such possibility is to train a neural network (deep or not) on historical data to predict the revenue and use the nonlinear map that we obtained by removing the last layer of the neural network. The subspace $\Psi$ spanned by the feature vectors $\{\psi(c,a)\}_{c,a}$ in $\R^d$ is called the feature-space.

An assumption on $\norm{\theta_*}$ encodes smoothness of $r$. In particular, from Hölder’s inequality,
\begin{align*}
|r(c,a)-r(c’,a’)| \le \norm{\theta_*} \norm{\psi(c,a)-\psi(c’,a’)}_*\,,
\end{align*}
where $\norm{\cdot}_*$ denotes the dual of $\norm{\cdot}$. This is how $\psi$ implicitly encodes “smoothness”. Restrictions on $\norm{\theta_*}$ have a similar effect to assuming that the dimensionality $d$ of the subspace $\cS$ is small. In fact, one may push this to the extreme and allow $d$ to be infinite, which can buy tremendous flexibility. With this much flexibility the linearity assumption perhaps feels less limiting.

Another assumption which is similar yet different from the previously mentioned ones is to assume that $\theta_*$ is sparse. By this we mean that many entries of $\theta_*$ are zero. Sometimes this is written as that the $0$-norm of $\theta_*$ is small. Here, the zero-norm, $\norm{\theta_*}_0$, just counts the number of nonzero entries of $\theta_*$: $\norm{\theta_*}_0 = |\{ i \,:\, \theta_{*,i}\ne 0 \}|$. The point of the sparsity assumption is to remove the burden from the designer of the feature map to leave out components that are unnecessary to get a good approximation of $r$: The designer may then include any component that may be relevant in predicting the reward, enforcing the selection of the “relevant” features by imposing a constraint on the number of nonzero entries of $\theta_*$.

Stochastic linear bandits

Stochastic linear bandits arise from realizing that when the reward is linear in the feature vectors, the identity of the actions becomes secondary and we rather let the algorithms choose the feature vectors directly: the identity of the actions adds no information or structure to the problem. This results in the following model: In round $t$, the learner is given the decision set $\cD_t\subset \R^d$ that it has to choose an action from. If the learner chooses $A_t\in \cD_t$, it incurs the reward
\begin{align*}
X_t = \ip{A_t,\theta_*} + \eta_t\,,
\end{align*}
where $\eta_t$ is $1$-subgaussian given $\cD_1,A_1,X_1,\dots,\cD_{t-1},A_{t-1},X_{t-1},\cD_t$ and $A_t$. The regret suffered is
\begin{align*}
R_n = \EE{ \sum_{t=1}^n \max_{a\in \cD_t} \ip{a,\theta_*} – \sum_{t=1}^n X_t }\,.
\end{align*}

With $\cD_t = \{ \phi(c_t,a) \,:\, a\in [K] \}$ the model reproduces contextual bandits. When $\cD_t = \{e_1,\dots,e_d\}$ (where $e_1,\dots,e_d$ are the unit vectors in the standard Euclidean basis), stochastic linear bandits reproduce finite action stochastic bandits. Linear stochastic bandits also arise naturally with combinatorial action sets, i.e., when $\cD \subset \{0,1\}^d$: Many combinatorial problems (such as matching, least-cost problems in directed graphs, choosing spanning trees, etc.) can be written as linear optimization over some combinatorial set $\cD$ obtained from considering incidence vectors often associated with some graph. We hope to cover some of these fun problems later.

UCB for linear bandits

The UCB algorithm is a very attractive algorithm for finite-action stochastic bandits: It is near-minimax optimal and is also almost instance optimal for any finite horizon and even asymptotically. It is thus quite natural to attempt to generalize UCB to the linear settings.

The generalization is based on the view that UCB implements the optimism in the face of uncertainty principle, according to which one should choose the actions as if the environment (in our case the linear bandit environment) was as nice as plausible possible. In finite-action stochastic bandit problems the principle dictates to choose the action with the largest upper confidence bound. In the case of linear bandit problems this still holds, but now to calculate the upper confidence bounds one should also better take into account the information conveyed by all the rewards observed because all the data $(A_1,X_1,\dots,A_{t-1},X_{t-1})$ is now connected through the unknown parameter vector.

One idea is to construct a “confidence set” $\cC_{t}$ based on $(A_1,X_1,\dots,A_{t-1},X_{t-1})$ that contains the unknown parameter vector $\theta_*$ with high probability. Leaving details of how the confidence set is constructed aside for a moment but assuming that the confidence set indeed contains $\theta_*$, for any given action $a\in \R^d$,
\begin{align}
\mathrm{UCB}_t(a) = \max_{\theta\in \cC_t} \ip{a,\theta}
\label{eq:UCBCt}
\end{align}
will be an upper bound on the mean payoff of $a$. The UCB algorithm that uses the confidence set $\cC_t$ at time $t$ then selects
\begin{align}
A_t = \argmax_{a\in \cD_t} \mathrm{UCB}_t(a)\,.
\label{eq:UCBgencDt}
\end{align}
In fact, the last equation is what we will take as the definition of UCB regardless of how the $\UCB_t(\cdot)$ values are defined. Of course the naming is justified only when the UCB values are indeed upper bounds on the mean payoffs of the actions.

Depending on the authors, UCB applied to linear bandits is known by many names, including but not limited to LinRel (after perhaps Linear Reinforcement Learning), LinUCB (an obvious choice), and OFUL (Optimism in the Face of Uncertainty for Linear bandits), just to mention a few.

Computation

Note that as long as $\cD_t$ has a few vectors in it, and the linear optimization problem \eqref{eq:UCBCt} can be efficiently solved (such as when $\cC_t$ convex), the computation is efficient. To discuss the computation cost further note that the computation of $A_t$ can also be written as
\begin{align}
(A_t,\tilde \theta_t) = \argmax_{(a,\theta)\in \cD_t\times \cC_t} \ip{a,\theta}\,.
\label{eq:UCBgenjoint}
\end{align}
This is a bilinear optimization problem over the set $\cD_t \times \cC_t$. In general, nothing much can be said about the computational efficiency of solving this problem. One special case when a solution can be found efficiently is when (i) the linear optimization problem $\max_{a\in \cD} \ip{a,\theta}$ can be efficiently solved for any $\theta$ (this holds in many combinatorial settings) and (ii) $\cC_t$ is the convex hull of a handful of vertices: $\cC_t = \mathrm{co}(c_{t1},\dots,c_{tp})$. Indeed, in this case for any $a\in \cD_t$, $\argmax_{\theta\in \cC_t} \ip{a,\theta}$ is one of $c_{t1},\dots,c_{tp}$. Hence, the solution to \eqref{eq:UCBgenjoint} can be obtained by solving $\max( \max_{a\in \cD_t} \ip{a,c_{t1}}, \dots, \max_{a\in \cD_t} \ip{a,c_{tp}} )$. A special case of this is when $\cC_t$ is the skewed and shifted $\ell^1$-ball, i.e. when for some nonsingular matrix $A$ and vector $\theta_0\in \R^d$, $\cC_t = \{ \theta\,:\, \norm{A(\theta – \theta_0)}_1 \le \beta_t \}$. Note that in this case $p=2d$.

Another notable case is when $\cC_t$ is an ellipsoid. To minimize clutter in writing the definition of an ellipsoid, let us introduce the notation $\norm{x}^2_V$ which is defined for a $V$ $d\times d$ positive definite matrix and its value is $x^\top V x$. The notation is justified since $\norm{\cdot}_{V}$ is indeed a norm. We will call $\norm{x}_V$ the $V$-norm of $x$. With this, choosing some center $\hat \theta\in \R^d$, $V\succ 0$ positive definite matrix and “radius” $\beta$, an ellipsoidal confidence set takes the form $\cC_t = \{\theta\in \R^d \,:\, \norm{\theta-\hat \theta}^2_{V}\le \beta \}$ (in general, $\hat \theta$, $V$ and $\beta$ will be dependent on past observations, and the reason of not absorbing $\beta$ into the definition of $V$ will become clear later).

When $\cC_t$ is of this ellipsoidal form, the UCB values are particularly simple. To see this, first we rewrite $\cC_t$ into an alternate, equivalent form. Defining $B_2 = \{x\in \R^d\,:\, \norm{x}_2\le 1\}$ to be the unit ball with respect to the Euclidean norm, it is easy to see that $\cC_t = \hat \theta + \beta^{1/2} V^{-1/2} B_2$. Using this, a short direct calculation gives
\begin{align}\label{eq:linucb}
\mathrm{UCB}_t(a) = \ip{a,\hat \theta} + \beta^{1/2} \norm{a}_{V^{-1}}\,.
\end{align}

Note the similarity to the standard finite-action UCB algorithm: Interpreting $\hat \theta$ as the estimate of $\theta_*$, $\ip{a,\hat \theta}$ can be seen as the estimate of the mean reward of $a$, while $\beta^{1/2} \norm{a}_{V^{-1}}$ is a bonus term. As we shall see later the confidence values could be defined by setting $\hat \theta$ to be the least–squares estimator (LSE) of $\theta_*$ with perhaps an appropriate regularization, while $V$ could be set to $V_t$, the “regularized Grammian” matrix defined using
\begin{align}
V_t=V_0 + \sum_{s=1}^{t} A_s A_s^\top, \qquad t\in [n]\,,
\label{eq:linbanditreggrammian}
\end{align}
where $V_0\succ 0$ is a fixed positive semidefinite matrix, which is often set to $\lambda I$ with some $\lambda>0$ tuning parameter.

When $\cD_t = \{e_1,\dots,e_d\}$ for all $t\in [n]$, $\hat \theta_i$ becomes the empirical mean of action $e_i$ (which could in this case be meaningfully called the $i$th action), while with $V_0=0$ the matrix $V_t$ is diagonal with its $i$ diagonal entry being the number of times action $e_i$ is used up to and including round $t$.

Soon we will see why the choice of $(V_t)_t$ is natural. The impatient reader may check that in the special case when $\eta_t$ is an i.i.d. zero-mean Gaussian sequence and the vectors $(A_t)_t$ are deterministically chosen (which is clearly far from what happens in UCB), $V_t$ with $V_0=0$ comes up naturally when defining a confidence set for $\theta_*$. The behavior of UCB is illustrated on the figure below.

Illustration of the behavior of UCB.

Illustration of the behavior of UCB. Three actions, labeled $a_1$, $a_2$ and $a_3$ live in the positive quadrant of $\R^2$. The plane, which also hosts $\theta_*$ is colored based on the identity of the optimal action as a function of where $\theta_*$ is. A confidence ellipsoid with center $\hat \theta$ is also shown as are the locations of the parameter vectors that give rise to the highest value for each of the actions. Action $a_1$ happens to have the highest UCB value.

A Generic Bound on the Regret of UCB

In this section we bound the regret of UCB as a function of the width of the confidence intervals used by UCB without explicitly specifying how the confidence bounds are constructed, though we make a specific assumption about the form of the width. Later we will specialize this result to two different settings. In particular, our assumptions are as follows:

  • Bounded scalar mean reward: $|\ip{a,\theta_*}|\le 1$ for any $a\in \cup_t \cD_t$;
  • Bounded actions: for any $a\in \cup_t \cD_t$, $\norm{a}_2 \le L$;
  • Honest confidence intervals: With probability $1-\delta$, for all $t\in [n]$, $a\in \cD_t$, $\ip{a,\theta_*}\in [\UCB_t(a)-2\sqrt{\beta_{t-1}} \norm{a}_{V_{t-1}^{-1}},\UCB_t(a)]$ where $(V_{t})_t$ are given by \eqref{eq:linbanditreggrammian}.

Note that the half-width $\sqrt{\beta_{t-1}} \norm{a}_{V_{t-1}^{-1}}$ used in the assumption is the same as one gets when a confidence set $\cC_t$ is used which satisfies
\begin{align}
\cC_t \subset
\cE_t \doteq \{ \theta \in \R^d \,:\,
\norm{\theta-\hat \theta_{t-1}}_{ V_{t-1}}^2 \le \beta_{t-1} \}\,
\label{eq:ellconf}
\end{align}
with some $\hat \theta_{t-1}\in \R^d$.

Our first main result is as follows:

Theorem (Regret of UCB for Linear Bandits): Let the conditions listed above hold. Further, assume that $(\beta_t)_t$ is nondecreasing and $\beta_{n-1}\ge 1$. Then, with probability $1-\delta$, the pseudo-regret $\hat R_n = \sum_{t=1}^n \max_{a\in \cD_t} \ip{a,\theta_*} – \ip{A_t,\theta_*}$ of UCB as defined by \eqref{eq:UCBgenjoint} satisfies
\begin{align*}
\hat R_n
\le \sqrt{ 8 n \beta_{n-1} \, \log \frac{\det V_{n}}{ \det V_0 } }
\le \sqrt{ 8 d n \beta_{n-1} \, \log \frac{\trace(V_0)+n L^2}{ d\det^{1/d} V_0 } }\,.
\end{align*}

Note that the condition on $(\beta_t)_t$ is not really restrictive because we can always arrange for it to hold at the price of increasing $\beta_1,\dots,\beta_n$. We also see from the result that to get an $\tilde O(\sqrt{n})$ regret bound one needs to show that $\beta_n$ has a polylogarithmic growth (here, $\tilde O(f(n))$ means $O( \log^p(n) f(n) )$ with some $p>0$). We can also get a bound on the (expected) regret $R_n$ if $\delta\le c/\sqrt{n}$ by combining the bound of the theorem with $\hat R_n \le 2n$, which follows from our assumption that the magnitude of the immediate reward is bounded by one.

Proof: It suffices to prove the bound on the event when the confidence intervals contain the expected rewards of all the actions available in all the respective rounds. Hence, in the remainder of the proof we will assume that this holds. Let $r_t = \max_{a\in \cD_t} \ip{a,\theta_*} – \ip{A_t,\theta_*}$ be the immediate pseudo-regret suffered in round $t\in [n]$.

Let $A_t^* = \argmax_{a\in \cD_t} \ip{a,\theta_*}$ be an optimal action for round $t$. Thanks to the choice of $A_t$, $\ip{A_t^*,\theta_*}\le \UCB_t(A_t^*) \le \UCB_t(A_t)$. By assumption $\ip{A_t,\theta_*}\ge \UCB_t(A_t)-2\beta_{t-1}^{1/2} \norm{A_t}_{V_{t-1}^{-1}}$. Combining these inequalities we get
\begin{align*}
r_t \le 2 \sqrt{\beta_{t-1}} \norm{A_t}_{V_{t-1}^{-1}}\,.
\end{align*}

By the assumption that the mean absolute reward is bounded by one we also get that $r_t\le 2$. This combined with $\beta_n \ge \max(\beta_t,1)$ gives
\begin{align*}
r_t
& \le 2 \wedge 2 \sqrt{\beta_{t-1}} \norm{A_t}_{V_{t-1}^{-1}}
\le 2 \sqrt{\beta_{n-1}} (1 \wedge \norm{A_t}_{V_{t-1}^{-1}})\,,
\end{align*}
where, as before, $a\wedge b = \min(a,b)$.

Jensen’s inequality, or Cauchy-Schwarz gives $R_n \le \sqrt{n \sum_t r_t^2 }$. So it remains to bound $\sum_t r_t^2$. This is done by applying the following technical lemma:

Lemma (Elliptical Potential): Let $x_1,\dots,x_n\in \R^d$, $V_t = V_0 + \sum_{s=1}^t x_s x_s^\top$, $t\in [n]$, $v_0 = \trace(V_0)$ and $L\ge \max_t \norm{x_t}_2$. Then,
\begin{align*}
\sum_{t=1}^n 1 \wedge \norm{x_t}_{V_{t-1}^{-1}}^2 \le 2 \log \frac{\det V_{n}}{\det V_0} \le d \log \frac{v_0+n L^2}{d\det^{1/d} V_0}\,.
\end{align*}

The idea underlying the proof of this lemma is to first use that $u\wedge 1 \le 2 \ln(1+u)$ and then use an algebraic identity to prove that the sum of the logarithmic terms is equal to $\log \frac{\det V_n}{\det V_0}$. Then one can bound $\log \det V_{n}$ by noting that $\det V_{n}$ is the product of the eigenvalues of $V_{n}$, while $\trace V_{n}$ is the sum of the eigenvalues and then using the AM-GM inequality. The full proof is given at the end of the post.

Applying the bound and putting everything together gives the bound stated in the theorem.

QED.

In the next post we will look into how to construct tight ellipsoidal confidence sets. The theorem just proven will then be used to derive a regret bound.

Proof of the Elliptical Potential Lemma

Recall that we want to prove that
\begin{align*}
\sum_{t=1}^n 1 \wedge \norm{x_t}_{V_{t-1}^{-1}}^2 \le 2 \log \frac{\det V_{n}}{\det V_0} \le d \log \frac{v_0+n L^2}{d\det^{1/d} V_0}\,.
\end{align*}
where $x_1,\dots,x_n\in \R^d$, $V_t = V_0+\sum_{s=1}^t x_s x_s^\top$, $t\in [n]$, $v_0=\trace V_0$, and $\max_t \norm{x_t}_2\le L$ (cf. here).

Using that for any $u\in [0,1]$, $u \wedge 1 \le 2 \ln(1+u)$, we get
\begin{align*}
\sum_{t=1}^n 1 \wedge \norm{x_t}_{V_{t-1}^{-1}}^2 \le 2 \sum_t \log(1+\norm{x_t}_{V_{t-1}^{-1}}^2)\,.
\end{align*}
Now, we argue that this last expression is $\log \frac{\det V_{n}}{\det V_0}$. For $t \ge 1$ we have $V_{t} = V_{t-1} + x_t x_t^\top = V_{t-1}^{1/2}(I+V_{t-1}^{-1/2} x_t x_t^\top V_{t-1}^{-1/2}) V_{t-1}^{1/2}$. Hence, $\det V_t = \det(V_{t-1}) \det (I+V_{t-1}^{-1/2} x_t x_t^\top V_{t-1}^{-1/2})$. Now, it is easy to check that the only eigenvalues of $I + y y^\top$ are $1+\norm{y}_2$ and $1$ (this is because this matrix is symmetric, hence its eigenvectors are orthogonal to each other and $y$ is an eigenvector). Putting things together we see that $\det V_{n} = \det (V_0) \prod_{t=1}^{n} (1+ \norm{x_t}^2_{V_{t-1}^{-1}})$, which is equivalent to the first inequality that we wanted to prove. To get the second inequality note that by the AM-GM inequality, $\det V_n = \prod_{i=1}^d \lambda_i \le (\frac1d \trace V_n)^d \le ((v_0+n L^2)/d)^d$, where $\lambda_1,\dots,\lambda_d$ are the eigenvalues of $V_n$.

References

The literature of linear bandits is rather large. Here we restrict ourselves to the works most relevant to this post.

Stochastic linear bandits were introduced into the machine learning literature by Abe and Long:

(An earlier version of this paper from 1999 can be found here.)

The first paper to consider UCB-style algorithms is by Peter Auer:

This paper considered the case when the number of actions is finite. The core ideas of the analysis of optimistic algorithms (and more) is already present in this paper.

An algorithm that is based on a confidence ellipsoid is described in the paper by Varsha Dani, Thomas Hayes and Sham Kakade:

The regret analysis presented here, just like our discussion of the computational questions (e.g., the use of $\ell^1$-confidence balls) is largely based on this paper. The paper also stresses that an expected regret of $\tilde O( d \sqrt{n} )$ can be achieved regardless of the shape of the decision sets $\cD_t$ as long as the immediate reward belongs to a bounded interval (this also holds for the result presented here). They also give a lower bound for the expected regret which shows that in the worst case the regret must scale with $\Omega(\min(n,d \sqrt{n}))$. In a later post we will give an alternate construction.

A variant of SupLinRel that is based on ridge regression (as opposed to LinRel, which is based on truncating the smallest eigenvalue of the Grammian) is described in

The authors of this paper call the UCB algorithm described in this post LinUCB, while the previous paper calls an essentially identical algorithm OFUL (after optimism in the face of uncertainty for linear bandits).

The paper by Paat Rusmevichientong and John N. Tsitsiklis considers both optimistic and explore-then-commit strategies (which they call PEGE):

PEGE is shown to be optimal up to logarithmic factors for the unit ball.

The observation that explore-then-commit works for the unit ball and in general for sets with a smooth boundary was independently made in

and was also included in the Masters thesis of Yasin Abbasi-Yadkori.

Notes

Note: There is entire line of research for studying bandits with smooth reward functions, contextual or not. See, e.g., here, here, here, here, here or here.

Note: Nonlinear structured bandits where the payoff function belongs to a known set is studied e.g. here, here and here.

Note: It was mentioned that the feature map $\psi$ may map its arguments to an infinite dimensional space. The question then is whether computationally efficient methods exist at all. The answer is yes when $\Psi$ is equipped with an inner product $\ip{\cdot,\cdot}$ such that for any $(c,a),(c’,a’)$ context-action pairs $\ip{\psi(c,a),\psi(c’,a’)} = \kappa( (c,a), (c’,a’) )$ for some kernel function $\kappa: (\cC\times [K]) \times (\cC\times [K]) \to \R$ that can quickly be evaluated for any arguments of interest. Learning algorithms that use such an implicit calculation of inner products by means of evaluating some “kernel function” are said to use the “kernel trick”. Such methods in fact never calculate $\psi$, which is only implicitly defined in terms of the kernel function $\kappa$. The choice of the kernel function is in this way analogous to choosing a map $\psi$.