Category Archives: Bandits


Sparse linear bandits

In the last two posts we considered stochastic linear bandits, when the actions are vectors in the $d$-dimensional Euclidean space. According to our previous calculations, under the condition that the expected reward of all the actions are in a fixed bounded range (say, [-1,1]), the regret of Ellipsodial-UCB is $\tilde O(d \sqrt{n} )$ ($\tilde O(\cdot)$ hides logarithmic factors). This is definitely an advance when the number of actions $K$ is much larger than $d$, but $d$ itself can be quite large in some applications. In particular, in contextual bandits $d$ would be the dimension of the feature space that the actions are mapped into. The previous result indicates that there is a high price to be paid for having more features. This is not what we want. Ideally, one should be able to “add features” without suffering much additional regret if the feature added does not contribute in a significant way. This can be captured by the notion of sparsity, which is the central theme of this post.

Sparse linear stochastic bandits

The sparse linear stochastic bandit problem is the same as the stochastic linear bandit problem with a small difference. Just like in the standard setting, at the beginning of a round with index $t\in \N$ the learner receives a decision set $\cD_t\subset \R^d$. Let $A_t\in \cD_t$ be the action chosen by the learner based on the information available to it. The learner then receives the reward
\begin{align}
X_t = \ip{A_t,\theta_*} + \eta_t\,,
\label{eq:splinmodel}
\end{align}
where $(\eta_t)_t$ is zero-mean noise (subject to the usual subgaussianity assumption) and $\theta_*\in \R^d$ is an unknown vector. The specialty of the sparse setting is that the parameter vector $\theta_*$ is assumed to have many zero entries. Introducing the “$0$-norm”
\begin{align*}
\norm{\theta_*}_0 = \sum_{i=1}^d \one{ \theta_{*,i}\ne 0 }\,,
\end{align*}
the assumption is that
\begin{align*}
p \doteq \norm{\theta_*}_0 \ll d\,.
\end{align*}

Much ink has been spilled on what can be said about the speed of learning in linear models like (\ref{eq:splinmodel}) when $\{A_t\}_t$ are passively generated and the parameter vector is known to be sparse. Most results are phrased about recovering $\theta_*$, but there also exist a few results that quantify the speed at which good predictions can be made. The ideal outcome would be that the learning speed depends mostly on $p$, while the dependence on $d$ becomes less severe (e.g., the learning speed is a function of $p \log(d)$). Almost all the results come under the assumption that the vectors $\{A_t\}_t$ are “incoherent”, which means that the Grammian underlying $A_t$ should have good conditioning. The details are a bit more complicated, but the main point is that these conditions are exactly what the action vector resulting from a good bandit algorithm will not satisfy: Bandit algorithms want to choose the optimal action as often as possible, while for standard theory on fast learning under sparsity one needs all directions of the space to be thoroughly explored. We need some approach that does not rely on such strong assumptions.

Elimination on the hypercube

As a warmup problem consider linear bandits when the decision set (of every round) is the $d$-dimensional hypercube: $\cD = [-1,1]^d$. To reduce clutter we will denote the “true” parameter vector by $\theta$. Thus, in every round $t$, $X_t = \ip{A_t,\theta_t}+\eta_t$. We assume that $\eta_t$ is conditionally $1$-subgaussian given the past:
\begin{align*}
\EE{ \exp(\lambda \eta_t ) | \cF_{t-1} } \le \exp( \frac{\lambda^2}{2} ) \quad \text{for all } \lambda \in \R\,.
\end{align*}
Here $\cF_{t-1} = \sigma( A_1,X_1,\dots,A_{t-1},X_{t-1}, A_t )$. Since conditional subgaussanity comes up frequently, we introduce a notation for it: When $X$ is $\sigma^2$ subgaussian given some $\sigma$-field $\cF$ we will write $X|\cF \sim \subG(\sigma^2)$. Our earlier statement that the sum of independent subgaussian random variables is subgaussian with a subgaussianity factor that is the sum of the two factors also holds for conditionally subgaussian random variables.

We shall also assume that the mean rewards lie in the range $[-1,1]$. That is, $|\ip{a,\theta}|\le 1$ for all $a\in \cD$. Note that this is equivalent to $\norm{\theta}_1\le 1$.

Stochastic linear bandits on the hypercube is notable as it enjoys perfect “separability”: For each dimension $i\in [d]$, $A_{ti}$ can be selected regardless of the choice of $A_{tj}$ for the other dimensions $j\ne i$. It follows among other things that the optimal action can be computed dimension-wise. In particular, the optimal action for dimension $i$ is simply the sign of $\theta_i$. This is good and bad news. In the worst case, this structure means that each sign has to be learned independently. As we shall see later this implies that in the worst case the regret will be as large as $\Omega(d\sqrt{n})$ (our next post will expand on this). However, the separable structure is also good news: An algorithm can estimate $\theta_i$ for each dimension separately while paying absolutely no price for this experimentation when $\theta_i=0$ (because the choice of $A_{ti}$ does not restrict the choice of $A_{tj}$ with $j\ne i$). As we shall see, because of this we can design an algorithm whose regret scales with $O(\norm{\theta}_0\sqrt{n})$ even without knowing the value of $\norm{\theta}_0$.

The algorithm that achieves this is a randomized variant of Explore then Commit. For reasons that will become clear, we will call it “Selective Explore then Commit” or SETC. The algorithm works as follows: SETC keeps an interval-estimate $U_{ti}\subset \R$ of all components $\theta_i$. These will be constructed such that $\theta_i$ is contained in $\cap_{t\in [n]} U_{ti}$ with high probability (for simplicity we assume that a horizon $n$ is given to the algorithm; lifting this assumption is left as an exercise to the reader).

Assume for a moment that we are given intervals $(U_{ti})_{t}$ that contain $\theta_i$ with high probability. Consider the event when the intervals actually do contain $\theta_i$. If at some point $t$ it holds that $0\not \in U_{ti}$, then the sign of $\theta_i$ can be determined by examining on which size of $0$ the interval $U_{ti}$ falls, which allows us to set $A_{ti}$ optimally! In particular, if $U_{ti}\subset (0,\infty)$ then $\theta_i>0$ must hold and we will set $A_{ti}$ to $1$, while if $U_{ti}\subset (-\infty,0)$ then $\theta_i<0$ must hold and we set $A_{ti}$ to $-1$. On the other hand, if at some round $t$, $0\in U_{ti}$ then the sign of $\theta_i$ is uncertain and more information needs to be collected on $\theta_i$, i.e., one needs more exploration.In this case, to maximize the information gained, $A_{ti}$ can be chosen at random to be either $1$ or $-1$ with equal probabilities (a random variable that takes on values of $+1$ and $-1$ with equal probabilities is called a Rademacher random variable with parameter $0.5$). This leads to new information from which $U_{ti}$ can be refined (when no exploration is needed, $U_{ti}$ is not updated: $U_{t+1,i} = U_{ti}$). The algorithm, as far as the choice of $A_{ti}$ is concerned is summarized on the figure shown below.

SETC

The action choices in the SETC algorithm. In each dimension $i$ an interval is kept that holds $\theta_i$ with high probability. If the interval for a given dimension does not contain $0$, the action for that dimension is chosen as $+1$ or $-1$ depending on which side of zero the interval is. When the interval contains zero, the action is selected at random to be $+1$ or $-1$ with equal probabilities. In this, and only in this case the interval for $\theta_i$ is updated. For a detailed explanation, see the text.

To derive the form of $U_{t+1,i}$ consider $A_{ti} X_t$, the correlation of $A_{ti}$ and the reward observed. Expanding on the definition of $X_t$, using that $A_{ti}^2=1$, we see that
\begin{align*}
A_{ti} X_t = \theta_i + \underbrace{A_{ti} \sum_{j\ne i} A_{tj} \theta_j + A_{ti} \eta_t}_{Z_{ti}}\,.
\end{align*}
This suggests to set
\begin{align*}
\hat \theta_{ti} = \frac1t \sum_{s=1}^t A_{si} X_s\,,\quad
c_{ti} = 2\sqrt{ \frac{ \log(2n^2)}{t} }\, \text { and }\,\,
U_{t+1,i} = [\hat \theta_{ti} – c_{ti}, \hat \theta_{ti} + c_{ti} ]\,.
\end{align*}

To explain the choice of $c_{ti}$ we need a version of our the subgaussian concentration inequality which was stated for independent sequences of subgaussian random variables. In our case $(Z_{ti})_t$, as we shall soon see, are conditionally subgaussian. Using Chernoff’s method it is not hard to prove the following result:

Lemma (Hoeffding-Azuma): Let $\cF\doteq (\cF_t)_{0\le t\le n}$ be a filtration, $(Z_t)_{t\in [n]}$ be an $\cF$-adapted martingale difference sequence such that $Z_t$ is conditionally $\sigma^2$-subgaussian given $\cF_{t-1}$: $Z_t|\cF_{t-1} \sim \subG(\sigma^2)$. Then, $\sum_{t=1}^n Z_t$ is $n\sigma^2$ subgaussian and for any $\eps>0$,
\begin{align*}
\Prob{ \sum_{t=1}^n Z_t \ge \eps } \le \exp( – \frac{ \eps^2}{2n\sigma^2} )\,.
\end{align*}

Proof: Let $S_t = \sum_{s=1}^t Z_t$ with $S_0 = 0$. It suffices to show that $S_n$ is $n\sigma^2$ subgaussian. To show this, note that for $t\ge 1$, $\lambda\in \R$,
\begin{align*}
\EE{ \exp(\lambda S_t) } = \EE{\exp(\lambda S_{t-1}) \EE{ \exp(\lambda Z_t) | \cF_{t-1}} } \le \exp(\frac{\lambda^2\sigma^2}{2}) \EE{ \exp(\lambda S_{t-1}) }\,.
\end{align*}
Since $\exp(\lambda S_0)=1$, chaining the inequalities obtained we get the desired bound.
QED.

Sometimes, it is more convenient to use the “deviation” form of the above inequality which states that for any $\delta\in (0,1)$, with probability $1-\delta$,
\begin{align*}
\sum_{t=1}^n Z_t \le \sqrt{ 2 n \sigma^2 \log(1/\delta) }\,.
\end{align*}

The Hoeffding-Azuma inequality together with some union bounds leads to the following lemma:

Lemma (Reliable Intervals): Let $F_i = \{ \exists t\in [n] \text{ s.t. } \theta_i \not \in U_{ti} \}$ be the “failure” event that some of the intervals $U_{ti}$ fails to include $\theta_i$. Then,
\begin{align*}
\Prob{F_i} \le \frac{1}{n}\,.
\end{align*}

The proof will be provided at the end of the section.

Let $R_{ni} = n|\theta_i|-\EE{\sum_{t=1}^n A_{ti}\theta_i }$ and let $R_n = \max_{a\in \cD} n \ip{a,\theta} – \EE{ \sum_{t=1}^n \ip{A_t,\theta}}$ be the regret of SETC. Then, $R_n = \sum_{i:\theta_i\ne 0} R_{ni}$. Hence, it suffices to bound $R_{ni}$ for $i$ such that $\theta_i\ne 0$. Now, when the confidence intervals for dimension $i$ do not fail, i.e., on $F^c_{ni}$, the regret is at most $2|\theta_i| \tau_i$ where $\tau_i = \sum_{s=1}^n E_{si} $ is the number of exploration steps for dimension $i$. On the same event, we claim that $\tau_i\le 1 + 16 \log(2n^2)/|\theta_i|^2$ also holds. To see this it is enough to show that $\tau_i \le t$ for any $t>t_0\doteq 16 \log(2n^2) / |\theta_i|^2$. Pick such a $t$ and observe that $t>t_0$ is equivalent to $4 \sqrt{ \log(2n^2)/t} < | \theta_i|$. By definition, $c_{ti} = 2 \sqrt{ \log(2n^2)/(\tau_i\wedge t) }$. If $\tau_i\le t$ there is nothing to be proven. Hence, assume that $\tau_i>t$. Then, $\tau_i \wedge t = t$ and thus we get that $2c_{ti}<|\theta_i|$. Then, $|\hat \theta_{ti} | – c_{ti} \ge |\theta_i| – 2c_{ti}>0$, where the first inequality follows from that $|\theta_i-\hat\theta_{ti}|\le c_{ti}$. Since $|\hat \theta_{ti}|-c_{ti}>0$, $0\not\in U_{ti}$ and hence $E_{t+1,i}=\dots = E_{ni}=0$, which implies that $\tau_i\le t$, which contradicts with $\tau_i>t$, finishing the proof. Putting things together we get
\begin{align*}
R_{ni}
& = \EE{ \one{F_i} \sum_{t=1}^n (\sgn(\theta_i)-A_{ti}) \theta_i }
+ \EE{ \one{F_i^c} \sum_{t=1}^n (\sgn(\theta_i)-A_{ti}) \theta_i }\\
& \le 2n |\theta_i| \Prob{F_i} + \EE{ \one{F_i^c} 2|\theta_i| \tau_i } \\
& \le 2n|\theta_i| \Prob{F_i} + |\theta_i| (1+ \frac{16 \log(2n^2) }{|\theta_i|^2}) \\
&= 3|\theta_i| + \frac{16 \log(2n^2) }{|\theta_i|}\,,
\end{align*}
yielding the following theorem:

Theorem (Regret of SETC): Let $\norm{\theta}_1\le 1$. Then, the regret $R_n$ of SETC satisfies
\begin{align*}
R_n \le 3 \norm{\theta}_1 + 16 \sum_{i:\theta_i\ne 0} \frac{\log(2n^2) }{|\theta_i|}\,.
\end{align*}

Since $R_{ni} \le 2 n |\theta_i|$ also holds, we immediately get a bound on the worst-case regret of SETC for $p$-sparse vectors:

Corollary (Worst-case regret of SETC): Let $n\ge 2$. The minimax regret of SETC $R_n^*$ over $\norm{\theta}_1\le 1$, $\norm{\theta}_0\le p$ satisfies
\begin{align*}
R_n \le 3p + 8 p \sqrt{ n \log(2n^2)} \,.
\end{align*}

Thus we see that SETC successfully trades the dimension $d$ for the sparsity index $p$ without ever needing the knowledge of $p$. This should be encouraging! Can this result be generalized beyond the hypercube? This is the question we investigate next.

However, we first finish the proof of the theorem by providing the proof of the lemma that claimed that the event $F_i$ has a small probability.

Proof of $\Prob{F_i}\le 1/n$.

Proof: Setting $\cF_t = \sigma(A_1,X_1,\dots,A_t,X_t)$, we see that $(Z_{ti})_t$ is $(\cF_t)_t$-adapted (note that $\cF_{t-1}$ now excludes $A_t$!). Letting $S_{ti} = \sum_{j\ne i} A_{tj} \theta_j$, we can write $Z_{ti} = A_{ti} S_{ti} + A_{ti}\eta_t$. We check that $Z_{ti}|\cF_{t-1}\sim \subG(2)$. With repeated conditioning we calculate
\begin{align*}
\EE{ \exp(\lambda Z_{ti} )|\cF_{t-1} }
&= \EE{ \EE{ \exp(\lambda Z_{ti} )|\cF_{t-1},A_t} |\cF_{t-1} } \\
&= \EE{ \exp(\lambda A_{ti} S_{ti} ) \EE{ \exp(\lambda A_{ti} \eta_t )|\cF_{t-1},A_t} |\cF_{t-1} } \\
&\le \EE{ \exp(\lambda A_{ti}S_{ti} ) \exp(\frac{\lambda^2}{2}) |\cF_{t-1} } \\
&= \exp(\frac{\lambda^2}{2})
\EE{ \EE{\exp(\lambda A_{ti} S_{ti} )|\cF_{t-1},S_{ti}} |\cF_{t-1} } \\
&\le \exp(\frac{\lambda^2}{2})
\EE{ \exp( \frac{\lambda^2 S_{ti}^2}{2} ) |\cF_{t-1} } \\
&\le \exp(\lambda^2)\,,
\end{align*}
where the first inequality used that $\eta_t$ and $A_t$ are conditionally independent given $\cF_{t-1}$ and that $\eta_t|\cF_{t-1}\sim \subG(1)$, the second to last inequality used that $S_{ti}$ and $A_{ti}$ are conditionally independent given $\cF_{t-1}$ and that $A_{ti}|\cF_{t-1}\sim \subG(1)$, the last step used that $|S_{ti}|\le 1$. From this, we conclude that $Z_{ti}|\cF_{t-1} \sim \subG(2)$. Let $E_{si}$ be the indicator of whether dimension $i$ is “explored” in round $s$. That is, $E_{si}\in \{0,1\}$ and $E_{si}=1$ iff dimension $i$ is explored in round $s$. Note that for any $t\in [n]$,
\begin{align*}
\hat \theta_{ti} = \frac{\sum_{s=1}^t E_{si} A_{si} X_s}{\sum_{s=1}^t E_{si}}\,,
\qquad
c_{ti} = 2\sqrt{ \frac{ \log(2n^2) }{ \sum_{s=1}^t E_{si} } }\,.
\end{align*}
Then,
\begin{align*}
\Prob{ \exists t\in [n] \,:\, \theta_i \not\in U_{ti} }
& \le \sum_{t=1}^n \Prob{ \theta_i \not\in U_{ti} } \\
& = \sum_{t=1}^n \Prob{ |\hat\theta_{ti} – \theta_i| > c_{ti} } \\
& = \sum_{t=1}^n \Prob{ \left| \sum_{s=1}^t E_{si} Z_{si} \right| > 2\sqrt{ (\sum_{s=1}^t E_{si}) \log(2n^2) } } \\
& = \sum_{t=1}^n
\sum_{p=1}^t
\Prob{ \left| \sum_{s=1}^t E_{si} Z_{si} \right| > 2\sqrt{ (\sum_{s=1}^t E_{si}) \log(2n^2) },
\sum_{s=1}^t E_{si}=p } \\
& \le \sum_{t=1}^n
\sum_{p=1}^t
\Prob{ \left| \sum_{s=1}^p E_{si} Z_{si} \right| > 2\sqrt{ p \log(2n^2) }
} \\
& \le \sum_{t=1}^n
\sum_{p=1}^t
\exp\left( \frac{4 p \log(2n^2)} { 2 (2p) } \right) =1/n\,.
\end{align*}
QED.

UCB with sparsity

Let us now tackle the question of how to exploit sparsity when there is no restriction on the action set $\cD_t$. The plan is to use UCB where the ellipsoidal confidence set used earlier is replaced by another confidence set, which is also ellipsoidal but has a smaller radius when the parameter vector is sparse.

Our starting point is the generic regret bound for UCB, which we replicate here for the reader’s convenience.

Let $A_t$ be the action chosen by UCB in round $t$: $A_t = \argmax_{a\in \cD_t} \UCB_t(a)$ where $\UCB_t(a)$ is the UCB index of action $a$ and $\cD_t\subset \R^d$ is the set of actions available in round $t$. Define
\begin{align}
V_t=V_0 + \sum_{s=1}^{t} A_s A_s^\top, \qquad t\in [n]\,,
\label{eq:sparselinbanditreggrammian}
\end{align}
where $V_0\succ 0$ is a fixed positive semidefinite matrix, which is often set to $\lambda I$ with some $\lambda>0$ tuning parameter. Let the following conditions hold:

  • Bounded scalar mean reward: $|\ip{a,\theta_*}|\le 1$ for any $a\in \cup_t \cD_t$;
  • Bounded actions: for any $a\in \cup_t \cD_t$, $\norm{a}_2 \le L$;
  • Honest confidence intervals: With probability $1-\delta$, for all $t\in [n]$, $a\in \cD_t$, $\ip{a,\theta_*}\in [\UCB_t(a)-2\sqrt{\beta_{t-1}} \norm{a}_{V_{t-1}^{-1}},\UCB_t(a)]$ where $(V_{t})_t$ are given by \eqref{eq:sparselinbanditreggrammian}.

Our earlier generic result for UCB for linear bandits was as follows:

Theorem (Regret of UCB for Linear Bandits): Let the conditions listed above hold. Further, assume that $(\beta_t)_t$ is nondecreasing and $\beta_n\ge 1$. Then, with probability $1-\delta$, the pseudo-regret $\hat R_n = \sum_{t=1}^n \max_{a\in \cD_t} \ip{a,\theta_*} – \ip{A_t,\theta_*}$ of UCB satisfies
\begin{align*}
\hat R_n
% \le \sqrt{ 8 n \beta_{n-1} \, \log \frac{\det V_{n}}{ \det V_0 } }
\le \sqrt{ 8 d n \beta_{n-1} \, \log \frac{\trace(V_0)+n L^2}{ d\det V_0 } }\,.
\end{align*}

As noted earlier the half-width $\sqrt{\beta_{t-1}} \norm{a}_{V_{t-1}^{-1}}$ used in the assumption is the same as the one that we get when a confidence set $\cC_t$ is used which satisfies
\begin{align}
\cC_t \subset
\cE_t \doteq \{ \theta \in \R^d \,:\,
\norm{\theta-\hat \theta_{t-1}}_{ V_{t-1}}^2 \le \beta_{t-1} \}\,
\label{eq:ellconf}
\end{align}
with some $\hat \theta_{t-1}\in \R^d$. Our approach will be to identify some $\hat \theta_{t-1}$ and $\beta_{t-1}$ such that $\beta_{t-1}$ will scale with the sparsity of $\theta_*$.

In particular, we will prove the following result:

Theorem (Sparse Bandit UCB): There exist a strategy to compute the $\UCB_t(\cdot)$ indices such that the expected regret $R_n$ of the resulting policy satisfies $R_n = \tilde{O}(\sqrt{dpn})$.

Before presenting how this is done note that the best regret that we can hope to get is $\tilde{O}(\sqrt{ d n })$ with some additional terms depending on $p = \norm{\theta_*}_0$. This may be a bit disappointing. However, it is easy to see that the appearance of $d$ is the price one must pay for the increased generality of the result. In particular, we know that the regret for the standard $d$-action stochastic bandit problem is $\Omega(\sqrt{dn})$. Recall that $d$-action bandits can be represented as linear bandits with $\cD_t = \{e_1,\dots,e_d\}$ where $e_1,\dots,e_d$ is the standard Euclidean basis. Furthermore, checking the proof of the lower bound reveals that the proof uses $2$-sparse parameter vectors when rerepresented as linear bandits. Thus the appearance of $\sqrt{d}$, however unfortunate, is unavoidable in the general case. Later we will return to the question of lower bounds for sparse linear bandits.

Online to confidence set conversion

The idea of the construction that we present here is to build a confidence set around a prediction method that enjoys strong learning guarantees in the sparse case. The construction is actually generic and can be applied beyond sparse problems.

The prediction problem considered is online linear prediction where the prediction error is measured in the squared loss. This is also known as the online linear regression problem. In this problem setting a learner interacts with a strategic environment in a sequential manner in discrete rounds. In round $t\in \N$ the learner is first presented with a vector $A_t\in \R^d$ that is chosen by the environment ($A_t$ is allowed to depend on past choices of the learner) and the learner needs to produce a real-valued predictions $\hat X_t$, which is then compared to the target $X_t\in \R$ which is also chosen by the environment. The learner’s goal is to produce predictions whose total loss is not much worse than the loss suffered by any of the linear predictors in some set. The loss in a given round is the squared difference. The regret of the learner against a linear predictor that uses the “weights” $\theta\in \R^d$ is
\begin{align}
\rho_n(\theta) \doteq \sum_{t=1}^n (X_t – \hat X_t)^2 – \sum_{t=1}^n (X_t – \ip{A_t,\theta})^2
\label{eq:olrregretdef}
\end{align}
and we say that the learner enjoys a regret guarantee $B_n$ against some $\Theta \subset \R^d$ if no matter the environment’s strategy,
\begin{align*}
\sup_{\theta\in \Theta} \rho_n(\theta)\le B_n\,.
\end{align*}

The online learning literature in machine learning has a number of powerful algorithms for this learning problem with equally powerful regret guarantees. Later we will give a specific result for the sparse case.

Now take any learner for online linear regression and assume that the environment generates $X_t$ in a stochastic manner like in linear bandits:
\begin{align}
X_t = \ip{A_t,\theta_*} + \eta_t\,,
\label{eq:sparselbmodel234}
\end{align}
where $\eta_t|\cF_{t-1} \sim \subG(1)$ and $\cF_{t} = \sigma(A_1,X_1,\dots,A_t,X_t,A_{t+1})$. Observing that a confidence set $\theta_*$ is nothing but a constraint on $\theta_*$ in terms of known quantities, combining \eqref{eq:olrregretdef} and \eqref{eq:sparselbmodel234} by elementary algebra we derive
\begin{align}
\sum_t (\hat X_t – \ip{A_t,\theta_*})^2 = \rho_n(\theta_*) + 2 \sum_t \eta_t (\hat X_t – \ip{A_t,\theta_*})\,.
\label{eq:spb_regret_identity}
\end{align}
Let $Z_t = \sum_{s=1}^t \eta_t (\hat X_t – \ip{A_t,\theta_*})$ with $Z_0=0$. Since $\hat X_t$ is chosen based on information available at the beginning of round $t$, $\hat X_t$ is $\cF_{t-1}$-measurable. Hence, $(Z_t – Z_{t-1})| \cF_{t-1} \sim \subG( \sigma_t^2 )$ where $\sigma_t^2 = (\hat X_t – \ip{A_t,\theta_*})^2$. Now define $Q_t = \sum_{s=1}^t \sigma_t^2$. The uniform self-normalized tail bound at the end of our previous post implies that for any $u>0$ and any $\lambda>0$,
\begin{align*}
\Prob{ \exists t\ge 0 \text{ such that }
|Z_t| \ge \sqrt{ (\lambda+Q_t) \log \frac{(\lambda+Q_t)}{\delta^2\lambda} }
} \le \delta\,.
\end{align*}
Choose $\lambda=1$. On the event $\cE$ when $|Z_t|$ is bounded by $\sqrt{ (1+Q_t) \log \frac{(1+Q_t)}{\delta^2} }$, from \eqref{eq:spb_regret_identity} and $\rho_t(\theta_*)\le B_t$ we get
\begin{align*}
Q_t \le B_t + 2 \sqrt{ (1+Q_t) \log \frac{(1+Q_t)}{\delta^2} }\,.
\end{align*}
While both sides depend on $Q_t$, the left-hand side grows linearly, while the right-hand side grows sublinearly in $Q_t$. This means that the largest value of $Q_t$ that satisfies the above inequality is finite. A tedious calculation then shows this value must be less than
\begin{align}
\beta_t(\delta) \doteq 1 + 2 B_t + 32 \log\left( \frac{\sqrt{8}+\sqrt{1+B_t}}{\delta} \right)\,.
\label{eq:betadef}
\end{align}
As a result on $\cE$, $Q_t \le \beta_t(\delta)$, implying the following result:

Theorem (Sparse confidence set): Assume that for some strategy for online linear regression with $\theta\in \Theta$, $\rho_t(\theta)\le B_t$. Let $(A_t,X_t,\hat X_t)_t$, $(\cF_t)_t$ be such that $A_t,\hat X_t$ are $\cF_{t-1}$ measurable, $X_t$ is $\cF_{t}$-measurable and for some $\theta_*\in \Theta$, $X_t = \ip{A_t,\theta_*}+\eta_t$ where $\eta_t|\cF_{t-1}\sim \subG(1)$. Fix $\delta\in (0,1)$ and define
\begin{align*}
C_{t+1} = \{ \theta\in \R^d \,:\, \sum_{s=1}^t (\hat X_s – \ip{A_s,\theta})^2 \le \beta_t(\delta) \}\,.
\end{align*}
Then, $\Prob{ \exists t\in \N \text{ s.t. } \theta_* \not\in C_{t+1} }\le \delta$.

To recap, our online linear regression based UCB (OLR-UCB) in round $t$ works as follows. For specificity, let us call $\pi$ the online linear regression method.

  1. Receive the action set $\cD_t\subset \R^d$.
  2. Choose $A_t = \argmax_{a\in \cD_t} \max_{\theta\in C_t} \ip{a,\theta}$
  3. Feed $A_t$ to $\pi$ and get its prediction $\hat X_t$.
  4. Construct $C_{t+1}$ via the help of $A_t$ and $\hat X_t$.
  5. Receive the reward $X_t = \ip{A_t,\theta_*} + \eta_t$
  6. Feed $X_t$ to $\pi$ as feedback.

The set $C_{t+1}$ is an ellipsoid underlying the Grammian $V_t = I + \sum_{s=1}^t A_t A_t^\top$ and center
\begin{align*}
\hat \theta_t = \argmin_{\theta\in \R^d} \norm{\theta}_2^2 + \sum_{s=1}^t (\hat X_t – \ip{A_t,\theta})^2\,.
\end{align*}
On the event when $\theta_*\in C_{t+1}$, $C_{t+1}\not=\emptyset$ and hence $\hat\theta_t\in C_{t+1}$. Thus, we can write
\begin{align*}
C_{t+1} =
\{ \theta\in \R^d \,:\,
\norm{\theta-\hat\theta_t}_{V_t}^2 + \norm{\hat\theta_t}_2^2
+ \sum_{s=1}^t (\hat X_s – \ip{A_s,\hat \theta_t})^2 \le \beta_t(\delta) \}\,.
\end{align*}
Thus,
\begin{align*}
C_{t+1} \subset
\{ \theta\in \R^d \,:\, \norm{\theta-\hat\theta_t}_{V_t}^2 \le \beta_t(\delta) \}\,.
\end{align*}
Hence, our general theorem applies and gives that the UCB algorithm which uses $C_{t+1}$ as its confidence bound enjoys the following regret bound:

Theorem (UCB for sparse linear bandits): Fix some $\Theta\subset \R^d$ and assume that the strategy $\pi$ enjoys the regret bounds $(B_t)_t$ against $\Theta$. Then, with probability $1-\delta$, the pseudo-regret $\hat R_n = \sum_{t=1}^n \max_{a\in \cD_t} \ip{a,\theta_*} – \ip{A_t,\theta_*}$ of OLR-UCB satisfies
\begin{align*}
\hat R_n
\le \sqrt{ 8 d n \beta_{n-1}(\delta) \, \log\left( 1+ \tfrac{n L^2}{ d }\right) }\,,
\end{align*}
where $(\beta_{t})_t$ is given by \eqref{eq:betadef}.

Note that $\beta_n = \tilde{O}(B_n)$ and hence $\hat R_n = \tilde{O}( d B_n n )$.

To finish, let us quote a result for online sparse linear regression:

Theorem (Online sparse linear regression): There exists a strategy $\pi$ for the learner such that for any $\theta\in \R^d$, the regret $\rho_n(\theta)$ of $\pi$ against any strategic environment such that $\max_{t\in [n]}\norm{A_t}_2\le L$ and $\max_{t\in [n]}|X_t|\le X$ satisfies
\begin{align*}
\rho_n(\theta) \le c X^2 \norm{\theta}_0
\left\{\log(e+n^{1/2}L) + C_n \log(1+\tfrac{\norm{\theta}_1}{\norm{\theta}_0})\right\}
+ (1+X^2)C_n\,,
\end{align*}
where $c>0$ is some universal constant and $C_n = 2+ \log_2 \log(e+n^{1/2}L)$.

The strategy is a variant of the exponential weights method except that the method is now adjusted so that the set of experts is now $\R^d$. An appropriate sparsity prior is used and when predicting an appropriate truncation strategy is used. The details of the procedure are less important at this stage for our purposes and are thus left out. A reference to the work containing the missing details will be given at the end of the post.

Note that $A_n = O(\log \log(n))$. Hence, dropping the dependence on $X$ and $L$, for $p>0$, $\sup_{\theta: \norm{\theta}_0\le p, \norm{\theta}_2\le L} \rho_n(\theta) = O(p \log(n))$. Note how strong this is: The guarantee hold no matter what strategy the environment uses!

Now, in sparse linear bandits with subgaussian noise, the noise $(\eta_t)_t$ is not necessarily bounded, and as a consequence the rewards $(X_t)_t$ are also not necessarily bounded. However, the subgaussian property implies with probability $1-\delta$, $|\eta_t| \le \log(2/\delta)$. Now, choosing $\delta = 1/n^2$, we thus see that for problems with bounded mean reward, $\max_{t\in [n]} |X_t| \le X \doteq 1+\log(2n^2)$ with probability at least $1-1/n$. Putting things together then yields the announced result:
The expected regret of OLR-UCB when using the strategy $\pi$ from above satisfies
\begin{align*}
R_n = \tilde{O}( \sqrt{ d p n } )\,.
\end{align*}

Two important notes are in order: Firstly, while here we focused on the sparse case the results and techniques apply to other settings. For example, we can also get alternative confidence sets from results in online learning even for the standard non-sparse case. Or one may consider additional or different structural assumptions on $\theta$ (e.g., $\theta$, when reshaped into a matrix, could have a low spectral norm). Secondly, when the online linear regression results are applied it is important to use the tightest possible, data-dependent regret bounds $B_n$. In online learning most regret bounds start as tight, data-dependent bounds, which are then loosened to get further insight into the structure of problems. For our application, naturally one should use the tightest available regret bounds (or one should attempt to modify the existing proofs to get tighter data-dependent bounds). The gains from using data-dependent bounds can be very significant.

Finally, we need to emphasize that the sparsity parameter $p$ must be known in advance and that no algorithm can simultaneously enjoy a regret of $\Omega(\sqrt{d p n})$ for all $p$ simultaneously. This will be seen shortly in a post focusing exclusively on lower bounds for stochastic linear bandits.

Summary

In this post we considered sparse linear bandits in the stochastic setting. When the action was the hypercube we have shown a simple variant of the explore-then-commit strategy which selectively stops exploration in each dimension, independently of the others. We have shown that this strategy is able to adapt to the unknown sparsity of the parameter vector defining the linear bandit environment.

In the second part of the post we considered the more general setting when the action sets can change in time and they may lack the nice separable structure of the hypercube which made the first result possible. In this case we first argued that the dependence on the full dimensionality of the parameter vector is unavoidable. Then we constructed a method, OLR-UCB, which is a variant of UCB and which uses as a subroutine an online linear regression procedure. We showed that in the case of sparse linear bandits this gives a regret bound of $\tilde{O}(\sqrt{d p n })$. In our next post we will consider lower bounds for linear bandits and we will actually show that this bound is unimprovable in general.

References

The SETC algorithm is from a paper of the authors:

Look at Appendix E if you want to see how things are done in this paper.

The construction for the sparse case is from another paper co-authored by one of the authors:

This paper has the few details that we have skipped in this blog.

Independently and simultaneously to this paper, Alexandra Carpentier and Remi Munos have also published a paper on sparse linear stochastic bandits:

Their setting is a considerably different than the one studied here. In particular, they rely on the action set being nicely rounded and containing zero, while, perhaps, most importantly, the noise in their model effects the parameter vector: $X_t = \ip{A_t, \theta_*+\eta_t}$. Just like in the case of the hypercube this makes it possible to avoid the poor dependence on the dimension $d$: Their regret bounds take the form $R_n = O( p\sqrt{\log(d)n})$. We hope to return to discuss the differences and similarities between these two noise models in some later post.

The theorem on online sparse linear regression that we cited is due to Sebastien Gerschinovitz. The reference is

The theorem cited here is Theorem 10 in the above paper.

A very recent paper by Alexander Rakhlin, Karthik Sridharan also discusses relationship between online learning regret bounds and self-normalized tail bounds of the type given here:

Interestingly, what they show is that the relationship goes in both directions: Tail inequalities imply regret bounds and regret bounds imply tail inequalities.

I am told by Francesco Orabona that techniques similar to used here for constructing confidence bounds have been used earlier in a series of papers by Claudio Gentile and friends. For completeness, here is the list for further exploration:

Ellipsoidal Confidence Sets for Least-Squares Estimators

Continuing the previous post, here we give a construction for confidence bounds based on ellipsoidal confidence sets. We also put things together and show bound on the regret of the UCB strategy that uses the constructed confidence bounds.

Constructing the confidence bounds


To construct the confidence bounds we will construct appropriate confidence sets $\cC_t$, which will be based on least-squares, more precisely penalized least-squares estimates. In a later post we will show a different construction that improves the regret when the parameter vector is sparse. But first things first, let’s see how to construct those confidence bounds in the lack of additional knowledge.

Assume that we are at the end of stage $t$ when a bandit algorithm has chosen $A_1,\dots,A_t\in \R^d$ and received the respective payoffs $X_1,\dots,X_t$. The penalized least-squares, also known as the ridge-regression estimate of $\theta_*$, is defined as the minimizer of the penalized squared empirical loss,
\begin{align*}
L_{t}(\theta) = \sum_{s=1}^{t} (X_s – \ip{A_s,\theta})^2 + \lambda \norm{\theta}_2^2\,,
\end{align*}
where $\lambda\ge 0$ is the “penalty factor”. Choosing $\lambda>0$ helps because it ensures that the loss function has a unique minimizer even when $A_1,\dots,A_t$ do not span $\R^d$, which simplifies the math. By solving for $L_t'(\theta)=0$, the optimizer $\hat \theta_t \doteq \argmin_{\theta\in \R^d} L_t(\theta)$ of $L_t$ can be easily seen to satisfy
\begin{align*}
\hat \theta_t = V_t(\lambda)^{-1} \sum_{s=1}^t X_s A_s\,,
\end{align*}
where
\begin{align*}
V_t(\lambda) = \lambda I + \sum_{s=1}^t A_s A_s^\top\,.
\end{align*}
The matrix $V_t(0)$ is known as the Grammian underlying $\{A_s\}_{s\le t}$ and we will keep calling $V_t(\lambda)$ also as the Grammian. Just looking at the definition of the least-squares estimate in the case of a fixed sequence of $\{A_s\}_s$ and independent Gaussian noise a confidence set is very easy to get. To get some intuition this is exactly what we will do first.

Building intuition: Fixed design regression and independent Gaussian noise

To get a sense of what a confidence set $\cC_{t+1}$ should look like we start with a simplified setting, where we make the following extra assumptions.

  • Gaussian noise: $(\eta_s)_s$ is an i.i.d. sequence and in particular $\eta_s\sim \mathcal N(0,1)$;
  • Nonsingular Grammian: $V \doteq V_t(0)$ is invertible.
  • “Fixed design”: $A_1,\dots,A_t$ are deterministically chosen without the knowledge of $X_1,\dots,X_t$;

The distributional assumption on $(\eta_s)_s$ and the second assumption are for convenience. In particular, the second assumption lets us set $\lambda=0$, which we will indeed use. The independent of $(\eta_s)_s$ and the third assumption, on the other hand, are anything but innocent. In the absence of these we will be forced to use specific techniques.

A bit about notation: To emphasize that $A_1,\dots,A_t$ are chosen deterministically, we will use $a_s$ in place of $A_s$ (recall our convention that lowercase letters denote nonrandom, deterministic values). With this, we have $V = \sum_s a_s a_s^\top$ and $\hat \theta_t = V^{-1} \sum_s X_s a_s$.

Plugging in $X_s = \ip{a_s,\theta_*}+\eta_s$, $s=1,\dots,n$, into the expression of $\hat \theta_t$, we get
\begin{align}
% \hat \theta_t = V^{-1} \sum_{s=1}^t a_s a_s^\top \theta_* + V^{-1} \sum_{s=1}^t \eta_s a_s
\hat \theta_t -\theta_*
= V^{-1} Z\,,
\label{eq:lserror0}
\end{align}
where
\begin{align*}
Z = \sum_{s=1}^t \eta_s a_s\,.
\end{align*}
Noting that the distribution of the the linear combination of Gaussian random variables is also Gaussian, we see that $Z$ is also normally distributed. In particular, from $\EE{Z}= 0$ and $\EE{ Z Z^\top } = V$ we immediately see that $Z \sim \mathcal N(0, V )$ (a Gaussian distribution is fully determined by its mean and covariance). From this it follows that
\begin{align}
V^{1/2} (\hat \theta_t -\theta_*) = V^{-1/2} Z \sim \mathcal N(0,I)\,,
\label{eq:standardnormal}
\end{align}
where $V^{1/2}$ is a square root of the symmetric matrix $V$ (i.e., $V^{1/2} V^{1/2} = V$).

To get a confidence set for $\theta_*$ we can then choose any $S\subset \R^d$ such that
\begin{align}
\frac{1}{\sqrt{(2\pi)^d}}\int_S \exp\left(-\frac{1}{2}\norm{x}_2^2\right) \,dx \ge 1-\delta\,.
\label{eq:lbanditsregion}
\end{align}
Indeed, for such a subset $S$, defining $\cC_{t+1} = \{\theta\in \R^d\,:\, V^{1/2} (\hat \theta_t -\theta) \in S \}$, we see that $\cC_{t+1}$ is a $(1-\delta)$-level confidence set:
\begin{align*}
\Prob{\theta_*\in \cC_{t+1}} = \Prob{ V^{1/2} (\hat \theta_t -\theta) \in S } \ge 1-\delta\,.
\end{align*}
(In particular, if \eqref{eq:lbanditsregion} holds with an equality, we also have an equality in the last display.)

How should the set $S$ be chosen? One natural choice is based on constraining the $2$-norm of $V^{1/2} (\hat \theta_t -\theta)$. This has the appeal that $S$ will be a Euclidean ball, which makes $\cC_{t+1}$ an ellipsoid. The details are as follows: Recalling that the distribution of the sum of the squares of $d$ independent $\mathcal N(0,1)$ random variables is the $\chi^2$ distribution with $d$ degrees of freedom (in short, $\chi^2_d$), from \eqref{eq:standardnormal} we get
\begin{align}
\norm{\hat \theta_t – \theta_*}_{V}^2 = \norm{ Z }_{V^{-1}}^2 \sim \chi^2_d\,.
\label{eq:lschi2}
\end{align}
Now, if $F(t)$ is the tail probability of the $\chi^2_d$ distribution: $F(t) = \Prob{ U> t}$ for $U\sim \chi_d^2$, it is easy to verify that
\begin{align*}
\cC_{t+1} = \left\{ \theta\in \R^d \,:\, \norm{\hat \theta_t – \theta}_{V}^2 \le t \right\}
\end{align*}
contains $\theta_*$ with probability $1-F(t)$. Hence, $\cC_{t+1}$ is a $(1-F(t))$-level confidence set for $\theta_*$. To get the value of $t$ given $F(t) = \delta$, we can either resort to numerical calculations, or use Chernoff’s method. After some calculation, this latter approach gives $t \le d + 2 \sqrt{ d \log(1/\delta) } + 2 \log(1/\delta)$, which implies that
\begin{align}
\cC_{t+1} = \left\{ \theta\in \R^d\,:\, \norm{ \hat \theta_t-\theta }_{V}^2 \le d + 2 \sqrt{ d \log(1/\delta) } + 2 \log(1/\delta) \right\}\,
\label{eq:confchi2}
\end{align}
is an $1-\delta$-level confidence set for $\theta_*$ (see Lemma 1 on page 1355 of a paper by Laurent and Massart).

Martingale noise and Laplace’s method

We now start working towards gradually removing the extra assumptions. In particular, we first ask what happens when we only know that $\eta_1,\dots,\eta_t$ are conditionally $1$-subgaussian:
\begin{align}
\EE{ \exp(\lambda \eta_s) | \eta_1,\dots,\eta_{s-1} } \le \exp( \frac{\lambda^2}{2} )\,, \qquad s = 1,\dots, t\,.
\label{eq:condsgnoise}
\end{align}
Can we still get a confidence set, say, of the form \eqref{eq:confchi2}? Recall that previously to get this confidence set all we had to do was to upper bound the tail probabilities of the “normalized error” $\norm{Z}_{V^{-1}}^2$ (cf. \eqref{eq:lschi2}). How can we get this when we only know that $(\eta_s)_s$ are conditionally $1$-subgaussian?

Before diving into this let us briefly mention that \eqref{eq:condsgnoise} implies that $(\eta_s)_s$ is a $(\sigma(\eta_1,\dots,\eta_s))_s$-adapted martingale difference process:

Definition (martingale difference process): Let $(\cF_s)_s$ be a filtration over a probability space $(\Omega,\cF,\PP)$ (i.e., $\cF_{s} \subset \cF_{s+1}$ for all $s$ and also $\cF_s \subset \cF$). The sequence of random variables $(U_s)_s$ is an $(\cF_s)_s$-adapted martingale difference process if for all $s$, $\EE{U_s}$ exists and $U_s$ is $\cF_s$-measurable and $\EE{ U_s|\cF_{s-1}}=0$.

A collection of random variables is in general called a “random process”. Somewhat informally a martingale difference process is also called “martingale noise”. We see that in the linear bandit model, the noise process, $(\eta_s)_s$, is necessarily martingale noise with the filtration given by $\cF_s = \{A_1,X_1,\dots,A_{s-1},X_{s-1},A_s\}$. Note the inclusion of $A_s$ in the definition of $\cF_s$. The martingale noise assumption allows the noise $\eta_s$ impacting the feedback in round $s$ to depend on past choices, including the most recent action. This is actually essential if we have for example Bernoulli payoffs. If $(U_s)_s$ is an $(\cF_s)_s$ martingale difference process, the partial sums $M_t = \sum_{s=1}^t U_s$ define an $(\cF_s)_s$-adapted martingale. When the filtration is clear from the context, the reference to it is often dropped.

Let us return to the construction of confidence sets. Since we want exponentially decaying tail probabilities, one is tempted to try Chernoff’s method, which given a random variable $U$ yields $\Prob{U\ge u}\le \EE{\exp(\lambda U)}\exp(-\lambda u)$ which holds for any $\lambda\ge 0$. When $U$ is $1$-subgaussian, $\EE{\exp(\lambda U)}$ can be conveniently bounded by $\exp(\lambda^2/2)$, after which we can choose $\lambda$ to get the tightest bound, minimizing the quadratics $\lambda^2/2-\lambda u$ over nonnegative values.

To make this work with $U= \norm{Z}_{V^{-1}}^2$, we need to bound $\EE{\exp(\lambda \norm{Z}_{V^{-1}}^2 )}$. Unfortunately, this turns out to be a daunting task! Can we still somehow use Chernoff’s method? Let us start what we know: We know that there are (conditionally) $1$-subgaussian random variables that make up $Z$, namely $\eta_1,\dots,\eta_t$. Hence, we may try to see just how $\EE{ \exp( \lambda^\top Z ) }$ behaves for some $\lambda\in \R^d$. Note that we had to switch to a vector $\lambda$ since $Z$ is vector-valued. An easy calculation (with using \eqref{eq:condsgnoise} first with $s=t$, then with $s=t-1$, etc.) gives
\begin{align*}
\EE{ \exp(\lambda^\top Z) } = \EE{ \exp( \sum_{s=1}^t (\lambda^\top a_s) \eta_s ) } \le \exp( \frac12 \sum_{s=1}^t (\lambda^\top a_s)^2) = \exp( \frac12 \lambda^\top V \lambda )\,.
\end{align*}
How convenient that $V$ appears on the right-hand side of this inequality! But does this have anything to do with $U=\norm{Z}_{V^{-1}}^2$?

Rewriting the above inequality as
\begin{align*}
\EE{ \exp(\lambda^\top Z -\frac12 \lambda^\top V \lambda) } \le 1\,,
\end{align*}
we may notice that
\begin{align*}
\max_\lambda \exp(\lambda^\top Z -\frac12 \lambda^\top V \lambda)
= \exp( \max_\lambda \lambda^\top Z -\frac12 \lambda^\top V \lambda ) = \exp(\frac12 \norm{Z}_{V^{-1}}^2)\,,
\end{align*}
where the last equality comes from solving $f'(\lambda)=0$ for $\lambda$ where $f(\lambda)=\lambda^\top Z -\frac12 \lambda^\top V \lambda$. It will be useful to explicitly write the expression of the optimal $\lambda$:
\begin{align*}
\lambda_* = V^{-1} Z\,.
\end{align*}
It is worthwhile to point out that this argument uses a “linearization trick” of ratios which can be applied in all kind of situations. Abstractly, the trick is to write a ratio as an expression that depends on the square root of the numerator and the denominator in a linear fashion: For $a\in \R$, $b\ne 0$, $\max_{x\in \R} ax-\frac12 bx^2 = \frac{a^2}{2b}$.

Let us summarize what we have so far. For this, introduce
\begin{align*}
M_\lambda = \exp(\lambda^\top Z -\frac12 \lambda^\top V \lambda)\,.
\end{align*}
Then, on the one hand, we have that $\EE{ M_{\lambda} }\le 1$, while on the other hand we have that $\max_{\lambda} M_{\lambda} = \exp(\frac12 \norm{Z}_{V^{-1}}^2)$. Combining this with Chernoff’s method we get
\begin{align*}
\Prob{ \frac12 \norm{Z}_{V^{-1}}^2 > u } = \Prob{ \max_\lambda M_{\lambda} > \exp(u) } \le \exp(-u) \EE{ \max_\lambda M_{\lambda} }\,.
\end{align*}
Thus, we are left with bounding $\EE{ \max_\lambda M_\lambda}$. Unfortunately, $\EE{ \max_\lambda M_{\lambda} }>\max_{\lambda}\EE{ M_{\lambda} }$, so the knowledge that for any fixed $\lambda$ the inequality $\EE{ M_\lambda } \le 1$ is not useful on its own. We need to somehow argue that the expectation of the maximum is still not too large.

There are at least two possibilities, both having their own virtues. The first one is to replace the continuous maximum with a maximum over an appropriately selected finite subset of $\R^d$ and argue that the error introduced this way is small. This is known as the “covering argument” as we need to cover the “parameter space” sufficiently finely to argue that the approximation error is small. An alternative, perhaps lesser known but quite powerful approach, is based on Laplace’s method of approximating the maximum value of a function using an integral. The power of this is that it removes the need for bounding $\EE{ \max_\lambda M_\lambda }$. We will base our construction on this approach.

Illustration of Laplace's method.


Illustration of Laplace’s method with $f(x) = \sin(x)/x$.


To understand how the integral approximation of a maximum works, let us review briefly Laplace’s method. The best is to do this in a simple case. Thus, assume that we are given a smooth function $f:[a,b]\to \R$ which has a unique maximum at $x_0\in (a,b)$, Laplace’s method to approximate $f(x_0)$ is to compute the integral
\begin{align*}
I_s \doteq \int_a^b \exp( s f(x) ) dx
\end{align*}
for some large value of $s>0$. The idea is that this behaves like a Gaussian integral. Indeed, writing $f(x) = f(x_0) + f'(x_0)(x-x_0) + \frac12 f’’(x_0) (x-x_0)^2 + R_3(x)$, since $x_0$ is a maximizer of $f$, $f'(x_0)=0$ and $-q\doteq f’’(x_0)<0$. Under appropriate technical assumptions, \begin{align*} I_s \sim \int_a^b \exp( s f(x_0) ) \exp( -\frac{(x-x_0)^2}{2/(sq)} ) \,dx \end{align*} as $s\to \infty$. Now, as $s$ gets large, $\int_a^b \exp( -\frac{(x-x_0)^2}{2/(sq)} ) \,dx \sim \int_{-\infty}^{\infty} \exp( -\frac{(x-x_0)^2}{2/(sq)} ) \,dx = \sqrt{ \frac{2\pi}{sq} }$ and hence \begin{align*} I_s \sim \exp( s f(x_0) ) \sqrt{ \frac{2\pi}{sq} }\,. \end{align*} Intuitively, the dominant term in the integral $I_s$ is $\exp( s f(x_0) )$. It should also be clear that the fact that we integrate with respect the Lebesgue measure does not matter much: We could have integrated with respect to any other measure as long as that measure puts a positive mass on the neighborhood of the maximizer. The method is illustrated on the figure shown below. The take home message of this is that if we integrate the exponential of a function that has a pronounced maximum then we can expect that the integral will be close to the exponential function of the maximum. Since $M_\lambda$ is already the exponential function of the expression to be maximized, this gives us the idea to replace $\max_\lambda M_\lambda$ with $\bar M = \int M_\lambda h(\lambda) d\lambda$ where $h$ will be conveniently chosen so that the integral can be calculated in closed form (this is not really a requirement against the method, but it just makes the argument shorter). The main benefit of replacing the maximum with an integral is of course that (from Fubini's theorem) we easily get \begin{align} \EE{ \bar M } = \int \EE{ M_\lambda } h(\lambda) d\lambda \le 1 \label{eq:barmintegral} \end{align} and thus \begin{align} \Prob{ \log(\bar M) \ge u } \le e^{-u}\,. \label{eq:barmbound} \end{align} Thus, it remains to choose $h$ and calculate $\bar M$. When choosing $h$ we want two things: $h$ should put a large mass at the maximizer of $M_\lambda$ (which is $V^{-1}Z$), and either $\bar M$ should be available in closed form (with $\bar M \approx \max_\lambda M_\lambda$ in some sense), or a lower bound on $\bar M$ should be easy to obtain which is still close to $\max_\lambda M_\lambda$. Recalling the form of $M_\lambda$, we can realize that if we choose $h$ to be the density of Gaussian then the calculation of $\bar M$ will reduce to the calculation of a Gaussian integral, a convenient outcome since Gaussian integrals can be evaluated in closed form. In particular, setting $h$ to be the density of $\mathcal N(0,H^{-1})$, we find that \begin{align*} \bar M = \frac{1}{\sqrt{(2\pi)^d \det H^{-1}}} \int \exp( \lambda^\top Z - \frac12 \norm{\lambda}_{V}^2 - \frac12 \norm{\lambda}_{H}^2 ) d\lambda\,. \end{align*} Completing the square we get \begin{align*} \lambda^\top Z - \frac12 \norm{\lambda}_{V}^2 - \frac12 \norm{\lambda}_{H}^2 & = %\frac12 \norm{Z}_{V^{-1}}^2 -\frac12 \left\{ \norm{\lambda - V^{-1}Z}_{V}^2 + \norm{\lambda}_H^2 \right\} \\ %& = %\frac12 \norm{Z}_{V^{-1}}^2 -\frac12 \left\{ \norm{\lambda-(H+V)^{-1}Z}_{H+V}^2+\norm{Z}_{V^{-1}}^2-\norm{Z}_{(H+V)^{-1}}^2 \right\} \\ %& = \frac12 \norm{Z}_{(H+V)^{-1}}^2 -\frac12 \norm{\lambda-(H+V)^{-1}Z}_{H+V}^2\,. \end{align*} Hence, a short calculation gives \begin{align*} \bar M %&= %\frac{1}{\sqrt{(2\pi)^d \det H^{-1}}} \exp( \frac12 \norm{Z}_{(H+V)^{-1}}^2 ) \int \exp( -\frac12 \norm{\lambda-(H+V)^{-1}Z}_{H+V}^2 ) d\lambda\\ %& = %\frac{1}{\sqrt{(2\pi)^d \det H^{-1}}} \exp( \frac12 \norm{Z}_{(H+V)^{-1}}^2 ) %\sqrt{(2\pi)^d \det (H+V)^{-1} }\\ & = \left(\frac{\det (H)}{\det (H+V)}\right)^{1/2} \exp( \frac12 \norm{Z}_{(H+V)^{-1}}^2 ) \,, \end{align*} which, combined with \eqref{eq:barmbound} gives \begin{align} \label{eq:selfnormalizedbound} \Prob{ \frac12 \norm{Z}_{(H+V)^{-1}}^2 \ge u+ \frac12 \log \frac{\det (H+V)}{\det(H)} } \le e^{-u}\,. \end{align} Now choosing $H=V$, we have $\det(H+V) = 2^d \det(V)$ and $\norm{Z}_{(H+V)^{-1}}^2 = \norm{Z}_{(2V)^{-1}}^2 = Z^\top (2V)^{-1} Z = \frac12 \norm{Z}_{V^{-1}}^2$, giving \begin{align*} \Prob{ \norm{Z}_{V^{-1}}^2 \ge 2\log(2)d+ 4u } \le e^{-u}\,. \end{align*} Using the identity $V^{1/2}(\hat \theta_t - \theta_*) = \norm{Z}_{V^{-1}}^2$ we get that \begin{align*} \cC_{t+1} = \left\{ \theta\in \R^d \,:\, \norm{\hat\theta_t-\theta}_V^2 \le 2\log(2)d+4 \log(\tfrac1\delta) \right\} \end{align*} is a $(1-\delta)$-level confidence set for $\theta_*$. Compared with \eqref{eq:confchi2} (the confidence set that is based on approximating the tail of the chi-square distribution using Chernoff's method), we see that the two radii scale similarly as a function of $d$ and $\delta$, with the new confidence set losing a bit (though only by a constant factor) as $\delta\to 0$. In general, the radii are incomparable. This is quite remarkable given the generality gained.

Confidence sets for sequential designs

This approach just described generalizes almost without any changes to the case when $A_1,\dots,A_t$ are not fixed, but are sequentially chosen as it is done by UCB (this is what is known as a “sequential design” in statistics). The main difference is that in this case it is not possible to choose $H=V$ in the last step. We will also drop the assumption that $V_t(0)$ is invertible and hence use $V_t(\lambda)$ with $\lambda>0$ in place of $V = V_t(0)$. Because of this, we need to replace the identity \eqref{eq:lserror0} with
\begin{align}
%\hat \theta_t
% &= V_t^{-1}(\lambda) \sum_{s=1}^t A_s A_s^\top \theta_* + V_t^{-1}(\lambda) \sum_{s=1}^t \eta_s A_s \\
% & = V_t^{-1}(\lambda) \left(\lambda I + \sum_{s=1}^t A_s A_s^\top\right) \theta_*
% + V_t^{-1}(\lambda) \sum_{s=1}^t \eta_s A_s – \lambda V_t^{-1}(\lambda)\theta_*
% & = \theta_* + V_t^{-1}(\lambda) Z – \lambda V_t^{-1}(\lambda)\theta_*
\hat \theta_t -\theta_* = V_t^{-1}(\lambda) Z – \lambda V_t^{-1}(\lambda)\theta_*\,,
\label{eq:hthdeviation}
\end{align}
and thus
\begin{align*}
V_t^{1/2}(\lambda) (\hat \theta_t -\theta_*) = V_t^{-1/2}(\lambda) Z – \lambda V_t^{-1/2}(\lambda)\theta_*\,.
\end{align*}
Inequality \eqref{eq:selfnormalizedbound} still holds, though, as already noted, in this case the choice $H=V$ is not available because this would make the density $h$ random, which would undermine the equality in \eqref{eq:barmintegral} (one may try to condition on $V$ to bound $\EE{\bar M}$ but this introduces other problems). Hence, we will simply set $H=\lambda I$, which gives a high-probability bound on $\norm{ Z }_{V_t^{-1}(\lambda)}^2$ and eventually giving rise to the following theorem:

Theorem: Assuming that $(\eta_s)_s$ are conditionally $1$-subgaussian, for any $u\ge 0$,
\begin{align}
\Prob{ \norm{\hat \theta_t – \theta_*}_{V_t(\lambda)} \ge \sqrt{\lambda} \norm{\theta_*} + \sqrt{ 2 u + \log \frac{\det V_t(\lambda)}{\det (\lambda I)} } } \le e^{-u}\,
\label{eq:ellipsoidbasic}
\end{align}
and in particular, assuming $\norm{\theta_*}\le S$,
\begin{align}
C_{t+1} = \left\{ \theta\in \R^d\,:\,
\norm{\hat \theta_t – \theta}_{V_t(\lambda)} \le \sqrt{\lambda} S + \sqrt{ 2 \log(\frac1\delta) + \log \frac{\det V_t(\lambda)}{\det (\lambda I)} } \right\}
\label{eq:ellipsoidconfset}
\end{align}
is a $(1-\delta)$-level confidence set: $\Prob{\theta_*\in C_{t+1}}\ge 1-\delta$.

Avoiding union bounds

For our results we need to ensure that $\theta_*$ is included in all of $C_1,C_2,\dots$. To ensure this one can use the union bound: In particular, we can replace $\delta$ used in the definition of $C_{t+1}$ by $\delta/(t(t+1))$. Then, the probability of none of $C_1,C_2,\dots$ containing $\theta_*$ is at most $\delta \sum_{t=1}^\infty \frac{1}{t(t+1)} = \delta$. The effect of this is that the radius of the confidence ellipsoid used in round $t$ is increased by a factor of $O(\log(t))$, which results in loser bounds and a larger regret. Fortunately, this is actually easy to avoid by resorting to a stopping time argument due to Freedman.

For developing the argument fix some positive integer $n$. To explain the technique we need to make the time dependence of $Z$ explicit. Thus, for $t\in [n]$ we let $Z_t = \sum_{s=1}^t X_s A_s$. Define also $M_\lambda(t)\doteq\exp( \lambda^\top Z_t – \frac12 \lambda^\top V_t(0) \lambda )$. In constructing $C_{t+1}$, the core inequality in the previous proof was that $\EE{ M_\lambda(t) } \le 1$ which allowed us to conclude the same for $\bar M(t) \doteq \int h(\lambda) M_\lambda(t) d\lambda$ and thus via Chernoff’s method led to $\Prob{\bar M(t) \ge e^{t}}\le e^{-t}$. As it was briefly mentioned on earlier, the proof of $\EE{ M_\lambda(t) } \le 1$ is based on chaining the inequalities
\begin{align}
\EE{ M_\lambda(s) | M_\lambda(s-1) } \le M_\lambda(s-1)
\label{eq:supermartingale}
\end{align}
for $s=t,t-1,\dots,1$, where we can define $M_\lambda(0)=1$. That $(M_\lambda(s))_s$ satisfies \eqref{eq:supermartingale} makes this sequence what is called a supermartingale adapted to the filtration $(\cF_s)_s \doteq (\sigma(A_1,X_1,\dots,A_s,X_s))_s$:

Definition (supermartingale): Let $(\cF_s)_{s\ge 0}$ be a filtration. A sequence of random variables, $(X_s)_{s\ge 0}$, is called an $(\cF_s)_s$-adapted supermartingale if $(X_s)_s$ is $(\cF_s)_s$ adapted (i.e., $X_s$ is $\cF_s$-measurable), the expectation of all the random variables is defined, and $\EE{X_s|\cF_{s-1}}\le X_{s-1}$ holds for $s=1,2,\dots$.

Integrating the above inequalities in \eqref{eq:supermartingale} with respect to $h(\lambda) d\lambda$ we immediately see that $(\bar M(s))_s$ is also an $(\cF_s)_s$ supermartingale with the filtration $(\cF_s)_s$ as before. A supermartingale process $(X_s)_s$ has the advantageous property that if we “stop it” at a random time $\tau\in [n]$ “without peeking into the future” then its mean still cannot increase: $\EE{X_\tau} \le \EE{X_1}$. When $\tau$ is a random time with this property, it is called a stopping time:

Definition (stopping time): Let $(\cF_s)_{s\in [n]}$ be a filtration. Then a random variable $\tau$ with values in $[n]$ is a stopping time given $(\cF_s)_s$ if for any $s\in [n]$, $\{\tau=s\}$ is an event of $\cF_s$.

Stopping times are after also defined when $n=\infty$ but we will not need this generality here.

Let $\tau$ be thus an arbitrary stopping time given the filtration $(\cF_s)_s \doteq (\sigma(A_1,X_1,\dots,A_s,X_s))_s$. In accordance of our discussion, $\EE{ \bar M(\tau) } \le \EE{ \bar M(0) } = 1$. From this it immediately follows that \eqref{eq:ellipsoidbasic} holds even when $t$ is replaced by $\tau$.

To see how this can be used to our advantage it will be convenient to introduce the events
\begin{align*}
\cE_t = \left\{ \norm{\hat \theta_t – \theta_*}_{V_t(\lambda)} \ge \sqrt{\lambda} S + \sqrt{ 2u + \log \frac{\det V_t(\lambda)}{\det (\lambda I)} } \right\}\,, \qquad t=1,\dots,n\,.
\end{align*}
With this, \eqref{eq:ellipsoidbasic} takes the form $\Prob{\cE_t}\le e^{-u}$ and by our discussion, for any random index $\tau\in [n]$ which is a stopping time with respect to $(\cF_t)_t$, we also have $\Prob{\cE_{\tau}} \le e^{-u}$. Now, choose $\tau$ to be the smallest round index $t\in [n]$ such that $\cE_t$ holds, or $n$ when none of $\cE_1,\dots,\cE_n$ hold. Formally, if the probability space holding the random variables is $(\Omega,\cF,\PP)$, we define $\tau(\omega) = t$ if $\omega\not \in \cE_1,\dots,\cE_{t-1}$ and $\omega \in \cE_t$ for some $t\in [n]$ and we let $\tau(\omega)=n$ otherwise. Since $\cE_1,\dots,\cE_t$ are $\cF_t$ measurable, $\{\tau=t\}$ is also $\cF_t$ measurable. Thus $\tau$ is a stopping time with respect to $(\cF_t)_t$. Now, consider the event
\begin{align*}
\cE= \left\{ \exists t\in [n] \mathrm{ s.t. } \norm{\hat \theta_t – \theta_*}_{V_t(\lambda)} \ge \sqrt{\lambda} S + \sqrt{ 2u + \log \frac{\det V_t(\lambda)}{\det (\lambda I)} } \right\}\,.
\end{align*}
Clearly, if $\omega\in \cE$ then $\omega \in \cE_{\tau}$. Hence, $\cE\subset \cE_{\tau}$ and $\Prob{\cE}\le \Prob{\cE_{\tau}}\le e^{-u}$. Finally, since $n$ was arbitrary we also get that the upper limit on $t$ in the definition of $\cE$ can be removed. This shows that the bad event that any of the confidence sets $C_{t+1}$ of the previous theorem fail to hold the parameter vector $\theta_*$ is also bounded by $\delta$:

Corollary (Uniform bound):
\begin{align*}
\Prob{ \exists t\ge 0 \text{ such that } \theta_*\not\in C_{t+1} } \le \delta\,.
\end{align*}

Recalling \eqref{eq:hthdeviation}, for future reference we now restate the conclusion of our calculations in a form concerning the tail of the process $(Z_s)_s \doteq (\sum_{s=1}^t \eta_s A_s)_s$:

Corollary (Uniform self-normalized tail bound on $(Z_s)_s$): For any $u\ge 0$,
\begin{align*}
\Prob{ \exists t\ge 0 \text{ such that }
\norm{Z_t}_{V_t^{-1}(\lambda)} \ge \sqrt{ 2u + \log \frac{\det V_t(\lambda)}{\det (\lambda I)} }
} \le e^{-u}\,.
\end{align*}

Putting things together: The regret of Ellipsoidal-UCB

We will call the version of UCB that uses the confidence set of the previous section the “Ellipsoidal-UCB”. To state a bound on the regret of Ellipsoidal-UCB, let us summarize the conditions we will need: Recall that $\cF_t = \sigma(A_1,X_1,\dots,A_{t-1},X_{t-1},A_t)$, $X_t = \ip{A_t,\theta_*}+\eta_t$ and $\cD_t\subset \R^d$ is the action set available at the beginning of round $t$. We will assume that the following hold true:

  • $1$-subgaussian martingale noise: $\forall \lambda\in \R$, $s\in \N$, $\EE{\exp(\lambda \eta_s)|\cF_{s} } \le \exp(\frac{\lambda^2}{2})$.
  • Bounded parameter vector: $\norm{\theta_*}\le S$
  • Bounded actions: $\sup_{t} \sup_{a\in \cD_t} \norm{a}_2\le L$
  • Bounded mean reward: $|\ip{a,\theta_*}|\le 1$ for any $a\in \cup_t \cD_t$

Combining our previous results gives the following corollary:

Theorem (Regret of Ellipsoidal-UCB): Assume that the conditions listed above hold. Let $\hat R_n = \sum_{t=1}^n \max_{a\in \cD_t} \ip{a,\theta_*} – \ip{A_t,\theta_*}$ be the pseudo-regret of the Ellipsoidal-UCB algorithm that uses the confidence set \eqref{eq:ellipsoidconfset} in round $t+1$. With probability $1-\delta$, simultaneously for all $n\ge 1$,
\begin{align*}
R_n
\le \sqrt{ 8 d n \beta_{n-1} \, \log \frac{d\lambda+n L^2}{ d\lambda } }\,,
\end{align*}
where
\begin{align*}
\beta_{n-1}^{1/2}
& = \sqrt{\lambda} S + \sqrt{ 2\log(\frac1\delta) + \log \frac{\det V_{n-1}(\lambda)}{\det (\lambda I)}} \\
& \le \sqrt{\lambda} S + \sqrt{ 2\log(\frac1\delta) \,+\,\frac{d}{2}\ \log\left( 1+n \frac{L^2}{ d\lambda }\right)} \,.
\end{align*}

Choosing $\delta = 1/n$, $\lambda = \mathrm{const}$ we get that $\beta_n^{1/2} = O(d^{1/2} \log^{1/2}(n/d))$ and thus the expected regret of Ellipsoidal-UCB, as a function of $d$ and $n$ satisfies
\begin{align*}
R_n
& = O( \beta_n^{1/2} \sqrt{ dn \log(n/d) } )
= O( d \log(n/d) \sqrt{ n } )\,.
\end{align*}
Note that this holds simultaneously for all $n\in \N$. We also see that (apart from logarithmic factors) the regret scales linearly with the dimension $d$, while it is completely free of the cardinality of the action set $\cD_t$. Later we will see that this is indeed unavoidable in general.

Fixed action set

When the action set is fixed and small, it is better to construct the upper confidence bounds for the payoffs of the actions directly. This is best illustrated for the fixed-design case when $A_1,\dots,A_t$ are deterministic, the noise is i.i.d. standard normal and the Grammian is invertible even with $\lambda=0$. In this case, the confidence set for $\theta_*$ was given by \eqref{eq:confchi2}, i.e., the radius is $\beta_t = 2d+8\log(1/\delta)$. By our earlier observation (see here),
\begin{align*}
\UCB_{t+1}(a) = \ip{a,\hat \theta_t} + (2d+8\log(1/\delta))^{1/2} \norm{a}_{V_t^{-1}}\,.
\end{align*}
Notice that the “radius” $\beta_t$ scales linearly with $d$, which then propagates into the UCB values. By our main theorem, this then propagates into the regret bound making the regret scale linearly with $d$. It is easy to see that the linear dependence of $\beta_t$ is an unavoidable consequence of using the confidence set construction which relied on the properties of the chi-square distribution with $d$ degrees of freedom. Unfortunately, this means that even when $\cD_t = \{e_1,\dots,e_d\}$, corresponding to the standard finite-action stochastic bandit case, the regret will scale linearly with $d$ (the number of actions), whereas we have seen it earlier that the optimal scaling is $\sqrt{d}$. To get this scaling we thus see that we need to avoid a confidence set construction which is based on ellipsoids.

A simple construction which avoids this problem is as follows: Staying with the fixed design and independent Gaussian noise, recall that $\hat \theta_t – \theta_* = V^{-1} Z \sim \mathcal N(0,I)$ (cf. \eqref{eq:standardnormal}). Fix $a\in \R^d$. Then, $\ip{a, \hat \theta_t – \theta_*} = a^\top V^{-1} Z \sim \mathcal N(0, \norm{a}_{V^{-1}}^2)$. Hence, by the subgaussian property of Gaussians, defining
\begin{align}\label{eq:lingaussianperarmucb}
\mathrm{UCB}_{t+1}(a) = \ip{a,\hat \theta_t} + \sqrt{ 2 \log(1/\delta) }\, \norm{a}_{V_t^{-1}} \,
\end{align}
we see that $\Prob{ \ip{a,\theta_*}>\mathrm{UCB}_{t+1}(a) } \le \delta$. Note that this bound indeed removed the extra $\sqrt{d}$ factor.

Unfortunately, the generalization of this method to the sequential design case is not obvious. The difficulty comes from controlling $\ip{a, V^{-1} Z}$ without relying on the exact distributional properties of $Z$.

Notes

An alternative to the $2$-norm based construction is to use $1$-norms. In the fixed design setting, under the independent Gaussian noise assumption, using Chernoff’s method this leads to
\begin{align}
\cC_{t+1} = \left\{ \theta\in \R^d\,:\, \norm{ V^{1/2}(\hat \theta_t-\theta) }_1 \le
\sqrt{2 \log(2) d^2 +2d \log(1/\delta) }
\right\}\,.
\label{eq:confl1}
\end{align}

Illustration of 2-norm and 1-norm based confidence sets in 2 dimension with $\delta=0.1$.

Illustration of 2-norm and 1-norm based confidence sets in 2 dimension with $\delta=0.1$, $\hat \theta_t=0$, $V=I$.

This set, together with the one based on the $2$-norm (cf. \eqref{eq:confchi2}), is illustrated on the figure on the side, which shows the $1-\delta=0.9$-level confidence sets. For low dimensions, and not too small values of $\delta$ the $1$-norm based confidence set is fully included in the $2$-norm based confidence set as shown here. This happens of course due to the approximations used in deriving the radii of these sets. Illustration of 2-norm and 1-norm based confidence sets in 2 dimension. The $1-\delta=0.9$-level confidence sets are shown. For low dimension, and not too small values of $\delta$ the $1$-norm based confidence set is fully included in the $2$-norm based confidence set. This happens of course due to the approximations used in deriving the radii of these sets. In general, the two confidence sets are incomparable.

References

As mentioned in the previous post, the first paper to consider UCB-style algorithms is by Peter Auer:

The setting of the paper is the one when in every round the number of actions is bounded by a constant $K>0$. For this setting an algorithm (SupLinRel) is given and it is shown that its expected regret is at most $O(\sqrt{ d n \log^3(K n \log(n) ) })$, which is better by a factor of $\sqrt{d}$ than the bound derived here, but it also depends on $K$ (although only logarithmically) and is slightly worse than the bound shown here in its dependence on $n$, too. The dependence on $K$ can be traded for the dependence on $d$ by considering an $L/\sqrt{n}$-cover of the ball $\{ a\,:\, \norm{a}_2 \le L \}$, which gives $K = \Theta( (\sqrt{n}/L)^d )$ and a regret of $O(d^2 \sqrt{n \log^3(n)})$ for $n$ large, which is larger than the regret given here. Note that SupLinRel can also be run without actually discretizing the action set, just its confidence intervals have to be set based on the cardinality of the discretization (in particular, inflated by a factor of $\sqrt{d}$).

SupLinRel builds on LinRel, which, as we noted, is UCB with a specific upper confidence value. LinRel uses confidence bounds of the form \eqref{eq:lingaussianperarmucb} with a confidence parameter roughly of the size $\delta = 1/(n \log(n) K)$. This is possible because SupLinRel uses LinRel in a nontrivial way, “filtering” the data that LinRel works on.

In particular, SupLinRel uses a version of the “doubling trick”. The main idea is to keep at most $\log_2(n^{1/2})$ lists that hold mutually exclusive data. In every round SupLinRel starts with list of index $s=1$ and feeds the data of this list to LinRel which calculates UCB values and confidence width based on the data it received. If all the calculated widths are small (below $\theta=n^{-1/2}$) then the action with the highest UCB value is selected and the data generated is thrown away. Otherwise, if any width is above $2^{-s}$ then the corresponding action is chosen and the data observed is added to the list with index $s+1$. If all the widths are below $2^{-s}$ then all actions which, based on the current confidence intervals calculated by LinRel, cannot be optimal are eliminated, $s$ is incremented and the process is repeated until an action is chosen. Overall the effect of this is that the lists grow, lists with smaller index growing first until they are sufficiently rich for their desired target accuracy. Furthermore, the contents of a list is determined not by data of the list but by data on lists with smaller index. Because of this, the fixed-design confidence interval design as described here can be used, which ultimately saves the $O(\sqrt{d})$ factor. While apart from log-factors SupLinRel is unimprovable, in practice SupLinRel is excessively wasteful.

A confidence ellipsoid construction based on covering arguments can be found in the paper by Varsha Dani, Thomas Hayes and Sham Kakade:

An analogous construction is given by

The confidence ellipsoid construction described in this post is based on

Laplace’s method is also called the “Method of Mixtures” in the probability literature and it’s use goes back to the work of Robbins and Siegmund that was done in 1970. In practice, the improvement that results from using Laplace’s method as compared to the previous ellipsoidal constructions that are based on covering arguments is quite enormous.

As mentioned earlier, a variant of SupLinRel that is based on ridge regression (as opposed to LinRel, which is based on truncating the smallest eigenvalue of the Grammian) is described in

The algorithm, which is called SupLinUCB, uses the same construction as SupLinRel and enjoys the same regret.

For a fixed action set (i.e., when $\cD_t=\cD$), one can use an elimination based algorithm, which in every phase collects data by using a “spanning set” of the remaining actions. At the end of the phase, since the data collected in the phase only depends on data collected in previous phases, one can use Hoeffding’s bounds to construct UCB values for the actions. This is the idea underlying the “SpectralEliminator” algorithm in the paper