## Category Archives: Finite-armed bandits

Continuing the previous post, we prove the claimed minimax lower bound.

We start with a useful result that quantifies the difficulty of identifying whether or not an observation is drawn from similar distributions $P$ and $Q$ defined over the same measurable space $(\Omega,\cF)$. In particular, the difficulty will be shown to increase as a function of the relative entropy of $P$ with respect to $Q$. In the result below, as usual in probability theory, for $A\subset \Omega$, $A^c$ denotes the complement of $A$ with respect to $\Omega$: $A^c = \Omega \setminus A$.

Theorem (High-probability Pinsker): Let $P$ and $Q$ be probability measures on the same measurable space $(\Omega, \cF)$ and let $A \in \cF$ be an arbitrary event. Then,
\begin{align}\label{eq:pinskerhp}
P(A) + Q(A^c) \geq \frac{1}{2}\, \exp\left(-\KL(P, Q)\right)\,.
\end{align}

The logic of the result is as follows: If $P$ and $Q$ are similar to each other (e.g., in the sense that $\KL(P,Q)$ is small) then for any event $A\in \cF$, $P(A)\approx Q(A)$ and $P(A^c) \approx Q(A^c)$. Since $P(A)+P(A^c)=1$, we thus also expect $P(A)+Q(A^c)$ to be “large”. How large is “large” is the contents of the result. In particular, using $2 \max(a,b)\ge a+b$, it follows that for any $0<\delta\le 1$, to guarantee $\max(P(A),Q(A^c))\le \delta$, it is necessary that $\KL(P,Q)\ge \log(\frac{1}{4\delta})$, that is relative entropy of $P$ with respect to $Q$ must be larger than the right-hand side. Connecting the dots, this means that the information coming from an observation drawn from $P$ relative to expecting an observation from $Q$ must be just large enough.

Note that if $P$ is not absolutely continuous with respect to $Q$ then $\KL(P,Q)=\infty$ and the result is vacuous. Also note that the result is symmetric. We could replace $\KL(P, Q)$ with $\KL(Q, P)$, which sometimes leads to a stronger result because the relative entropy is not symmetric.

This theorem is useful for proving lower bounds because it implies that at least one of $P(A)$ and $Q(A^c)$ is large. Then, in an application where after some observations a decision needs to be made, $P$ and $Q$ are selected such that the a correct decision under $P$ is incorrect under $Q$ and vice versa. Then one lets $A$ be the event when the decision is incorrect under $P$ and $A^c$ is the event that the decision is incorrect under $Q$. The result then tells us that it is impossible to design a decision rule such that the probability of an incorrect decision is small under both $P$ and $Q$.

To illustrate this, consider the normal means problem when $X\sim N(\mu,1)$ is observed with $\mu \in \{0,\Delta\}$. In this case, we let $P$ the distribution of $X$ under $\mu=0$, $Q$ the distribution of $X$ under $\mu=\Delta$ and if $f: \R \to \{0,\Delta\}$ is the measurable function that is proposed as the decision rule, $A = \{\, x\in \R \,: f(x) \ne 0 \}$. Then, using e.g. $\Delta=1$, the theorem tells us that
\begin{align*}
P(A) + Q(A^c) \geq \frac{1}{2}\, \exp\left(-\KL(P, Q)\right)
= \frac{1}{2}\, \exp\left(-\frac{\Delta^2}{2\sigma^2}\right)
= \frac{1}{2}\, \exp\left(-1/2\right) \geq 3/10\,.
\end{align*}
This means that no matter how we chose our decision rule $f$, we simply do not have enough data to make a decision for which the probability of error on either $P$ or $Q$ is smaller than $3/20$. All that remains is to apply this idea to bandits and we will have our lower bound. The proof of the above result is postponed to the end of the post.

After this short excursion into information theory, let us return to the world of $K$-action stochastic bandits. In what follows we fix $n>0$, the number of rounds and $K>0$, the number of actions. Recall that given a $K$-action bandit environment $\nu = (P_1,\dots,P_K)$ ($P_i$ is the distribution of rewards for action $i$) and a policy $\pi$, the outcome of connecting $\pi$ and $\nu$ was defined as the random variables $(A_1,X_1,\dots,A_n,X_n)$ such that for $t=1,2,\dots,n$, the distribution of $A_t$ given the history $A_1,X_1,\dots,A_{t-1},X_{t-1}$ is uniquely determined by the policy $\pi$, while the distribution of $X_t$ given the history $A_1,X_1,\dots,A_{t-1},X_{t-1},A_t$ was required to be $P_{A_t}$. While the probability space $(\Omega,\cF,\PP)$ that carries these random variables can be chosen in many different ways, you should convince yourself that the constraints mentioned uniquely determine the distribution of the random element $H_n = (A_1,X_1,\dots,A_n,X_n)$ (or wait two paragraphs). As such, despite that $(\Omega,\cF,\PP)$ is non-unique, we can legitimately define $\PP_{\nu,\pi}$ to be the distribution of $H_n$ (suppressing dependence on $n$).

In fact, we will find it useful to write this distribution with a density. To do this, recall from the Radon-Nykodim theorem that if $P,Q$ are measures over a common measurable space $(\Omega,\cF)$ such that $Q$ is $\sigma$-finite and $P\ll Q$ (i.e., $P$ is absolutely continuous w.r.t. $Q$, which is sometimes written as $Q$ dominates $P$) then there exists a random variable $f:\Omega \to \R$ such that
\begin{align}\label{eq:rnagain}
\int_A f dQ = \int_A dP, \qquad \forall A\in \cF\,.
\end{align}
The function $f$ is said to be the density of $P$ with respect to $Q$, also denoted by $\frac{dP}{dQ}$. Since writing $\forall A\in \cF$, $\int_A \dots = \int_A \dots$ is a bit redundant, to save typing, following standard practice, in the future we will write $f dQ = dP$ (or $f(\omega) dQ(\omega) = dP(\omega)$ to detail the dependence on $\omega\in \Omega$) as a shorthand for $\eqref{eq:rnagain}$. This also explains why we write $f = \frac{dP}{dQ}$.

With this, it is not hard to verify that for any $(a_1,x_1,\dots,a_n,x_n)\in \cH_n \doteq ([K]\times \R)^n$ we have
\begin{align}
\begin{split}
d\PP_\nu(a_1,x_1,\dots,a_n,x_n)
& = \pi_1(a_1) p_{a_1}(x_1) \, \pi_2(a_2|a_1,x_1) p_{a_2}(x_2)\cdot \\
& \quad \cdots \pi_n(a_n|a_1,x_1,\dots,a_{n-1},x_{n-1} )\,p_{a_n}(x_n) \\
& \qquad \times d\lambda(x_1) \dots d\lambda(x_n) d\rho(a_1) \dots d\rho(a_n)\, ,
\end{split}
\label{eq:jointdensity}
\end{align}
where

• $\pi = (\pi_t)_{1\le t \le n}$ with $\pi_t(a_t|a_1,x_1,\dots,a_{t-1},x_{t-1})$ being the probability that policy $\pi$ in round $t$ chooses action $a_t\in [K]$ when the sequence of actions and rewards in the previous rounds are $(a_1,x_1,\dots,a_{t-1},x_{t-1})\in \cH_{t-1}$;
• $p_i$ is defined by $p_i d\lambda = d P_i$, where $\lambda$ is a common dominating measure for $P_1,\dots,P_K$, and
• $\rho$ is the counting measure on $[K]$: For $A\subset [K]$, $\rho(A) = |A \cap [K]|$.

There is no loss of generality in assuming that $\lambda$ exist: For example, we can always use $\lambda = P_1+\dots+P_K$. In a note at the end of the post, for full mathematical rigor, we add another condition to the definition of $\pi$. (Can you guess what this condition should be?)

Note that $\PP_{\nu,\pi}$, as given by the right-hand side of the above displayed equation, is a distribution over $(\cH_n,\cF_n)$ where $\cF_n=\mathcal L(\R^{2n})|_{\cH_n}$. Here, $\mathcal L(\R^{2n})$ is the Lebesgue $\sigma$-algebra over $\R^{2n}$ and $\mathcal L(\R^{2n})|_{\cH_n}$ is its restriction to $\cH_n$.

As mentioned in the previous post, given any two bandit environments, $\nu=(P_1,\dots,P_K)$ and $\nu’=(P’_1,\dots,P’_K)$, we expect $\PP_{\nu,\pi}$ and $\PP_{\nu’,\pi}$ to be close to each other if all of $P_i$ and $P_i’$ are close to each other, regardless of the policy $\pi$ (a sort of continuity result). Our next result will make this precise; in fact we will show an identity that expresses $\KL( \PP_{\nu,\pi}, \PP_{\nu’,\pi})$ as a function $\KL(P_i,P’_i)$, $i=1,\dots,K$.

Before stating the result, we need to introduce the concept of ($n$-round $K$-action) canonical bandit models. The idea is to fix the measurable space $(\Omega,\cF)$ (that holds the variables $(X_t,A_t)_t$) in a special way. In particular, we set $\Omega=\cH_n$ and $\cF=\cF_n$. Then, we define $A_t,X_t$, $t=1,\dots,n$ to be the coordinate projections:
\begin{align}
\begin{split}
A_t( a_1,x_1,\dots,a_n,x_n ) &= a_t\,, \text{ and }\\
X_t( a_1,x_1,\dots,a_n,x_n) &= x_t,\, \quad \forall t\in [n]\,\text{ and } \forall (a_1,x_1,\dots,a_n,x_n)\in \cH_n\,.
\end{split}
\label{eq:coordproj}
\end{align}
Now, given some policy $\pi$ and bandit environment $\nu$, if we set $\PP=\PP_{\nu,\pi}$ then, trivially, the distribution of $H_n=(A_1,X_1,\dots,A_n,X_n)$ is indeed $\PP_{\nu,\pi}$ as expected. Thus, $A_t,X_t$ as defined by $\eqref{eq:coordproj}$, satisfies the requirements we posed against the random variables that describe the outcome of the interaction of a policy and an environment. However, $A_t,X_t$ are fixed, no matter the choice of $\nu,\pi$. This is in fact the reason why it is convenient to use the canonical bandit model! Furthermore, since all our past and future statements concern only the properties of $\PP_{\nu,\pi}$ (and not the minute details of the definition of the random variables $(A_1,X_1,\dots,A_n,X_n)$), there is no loss of generality in assuming that in all our expressions $(A_1,X_1,\dots,A_n,X_n)$ is in fact defined over the canonical bandit model using $\eqref{eq:coordproj}$. (We avoided initially the canonical model because up to now there was no advantage to introducing and using it.)

Before we state our promised result, we introduce one last notation that we need: Since we use the same measurable space $(\cH_n,\cF_n)$ with multiple environments $\nu$ and policies $\pi$, which give rise to various probability measures $\PP = \PP_{\nu,\pi}$, when writing expectations we will index $\E$ the same way as $\PP$ is indexed. Thus, $\E_{\nu,\pi}$ denotes expectation underlying $\PP_{\nu,\pi}$ (i.e., $\E_{\nu,\pi}[X] = \int X d\PP_{\nu,\pi}$. When $\pi$ is fixed, we will also drop $\pi$ from the indices and just write $\PP_{\nu}$ (and $\E_{\nu}$) for $\PP_{\nu,\pi}$ (respectively, $\E_{\nu,\pi}$).

With this, we are ready to state the promised result:

Lemma (Divergence decomposition): Let $\nu=(P_1,\ldots,P_K)$ be the reward distributions associated with one $K$-armed bandit, and let $\nu’=(P’_1,\ldots,P’_K)$ be the reward distributions associated with the another $K$-armed bandit. Fix some policy $\pi$ and let $\PP_\nu = \PP_{\nu,\pi}$ and $\PP_{\nu’} = \PP_{\nu’,\pi}$ be the distributions induced by the interconnection of $\pi$ and $\nu$ (respectively, $\pi$ and $\nu’$). Then
\begin{align}
\KL(\PP_{\nu}, \PP_{\nu’}) = \sum_{i=1}^K \E_{\nu}[T_i(n)] \, \KL(P_i, P’_i)\,.
\label{eq:infoproc}
\end{align}

In particular the lemma implies that $\PP_{\nu}$ and $\PP_{\nu’}$ will be close if for all $i$, $P_i$ and $P’_i$ were close. A nice feature of the lemma is that it provides not only a bound on the divergence of $\PP_{\nu}$ from $\PP_{\nu’}$, but it actually expresses this with an identity.

Proof
First, assume that $\KL(P_i,P’_i)<\infty$ for all $i\in [K]$. From this, it follows that $P_i\ll P’_i$. Define $\lambda = \sum_i P_i + P’_i$ and let $p_i = \frac{dP_i}{d\lambda}$ and let $p’_i = \frac{dP’_i}{d\lambda}$. By the chain rule of Radon-Nikodym derivatives, $\KL(P_i,P’_i) = \int p_i \log( \frac{p_i}{p’_i} ) d\lambda$.

Let us now evaluate $\KL( \PP_{\nu}, \PP_{\nu’} )$. By the fundamental identity for the relative entropy, this is equal to $\E_\nu[ \log \frac{d\PP_\nu}{d\PP_{\nu’}} ]$, where $\frac{d\PP_\nu}{d\PP_{\nu’}}$ is the Radon-Nikodym derivative of $\PP_\nu$ with respect to $\PP_{\nu’}$. It is easy to see that both $\PP_\nu$ and $\PP_{\nu’}$ can be written in the form given by $\eqref{eq:jointdensity}$ (this is where we use that $\lambda$ dominates all of $P_i$ and $P_i’$). Then, using the chain rule we get
\begin{align*}%\label{eq:logrn}
\log \frac{d\PP_\nu}{d\PP_{\nu’}} = \sum_{t=1}^n \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)}\,
\end{align*}
(the terms involving the policy cancel!). Now, taking expectations of both sides, we get
\begin{align*}
\E_{\nu}\left[ \log \frac{d\PP_\nu}{d\PP_{\nu’}} \right]
= \sum_{t=1}^n \E_{\nu}\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right]\,,
\end{align*}
and
\begin{align*}
\E_{\nu}\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right]
& = \E_{\nu}\left[ \E_{\nu}\left[\log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \, \big|\, A_t \right] \, \right]
= \E_{\nu}\left[ \KL(P_{A_t},P’_{A_t}) \right]\,,
\end{align*}
where in the second equality we used that under $\PP_{\nu}$, conditionally on $A_t$, the distribution of $X_t$ is $dP_{A_t} = p_{A_t} d\lambda$. Plugging back into the previous display,
\begin{align*}
\E_{\nu}\left[ \log \frac{d\PP_\nu}{d\PP_{\nu’}} \right]
&= \sum_{t=1}^n \E_{\nu}\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right] \\
&= \sum_{t=1}^n \E_{\nu}\left[ \KL(P_{A_t},P’_{A_t}) \right] \\
&= \sum_{i=1}^K \E_{\nu}\left[ \sum_{t=1}^n \one{A_t=i} \KL(P_{A_t},P’_{A_t}) \right] \\
&= \sum_{i=1}^K \E_{\nu}\left[ T_i(n)\right]\KL(P_{i},P’_{i})\,.
\end{align*}

When the right-hand side of $\eqref{eq:infoproc}$ is infinite, by our previous calculation, it is not hard to see that the left-hand side will also be infinite.

QED

Using the previous theorem and lemma we are now in a position to present and prove the main result of the post, a worst-case lower bound on the regret of any algorithm for finite-armed stochastic bandits with Gaussian noise. Recall that previously we used $\cE_K$ to denote the class of $K$ action bandit environments where the reward distributions have means in $[0,1]^K$ and the noise in the rewards have 1-subgaussian tails. Denote by $G_\mu\in \cE_K$ the bandit environment where all distributions are Gaussian with unit variance and the vector of means is $\mu\in [0,1]^K$. To make the dependence of the regret on the policy and the environment explicit, in accordance with the notation of the previous post, we will use $R_n(\pi,E)$ to denote the regret of policy $\pi$ on a bandit instance $E$.

Theorem (Worst-case lower bound): Fix $K>1$ and $n\ge K-1$. Then, for any policy $\pi$ there exists a mean vector $\mu \in [0,1]^K$ such that
\begin{align*}
R_n(\pi,G_\mu) \geq \frac{1}{27}\sqrt{(K-1) n}\,.
\end{align*}

Since $G_\mu \in \cE_K$, it follows that the minimax regret for $\cE_K$ is lower bounded by the right-hand side of the above display as soon as $n\ge K-1$:
\begin{align*}
R_n^*(\cE_K) \geq \frac{1}{27}\sqrt{(K-1) n}\,.
\end{align*}
The idea of the proof is illustrated on the following figure:

The idea of the minimax lower bound. Given the policy and one environment, the evil antagonist picks another environment so that the policy will suffer a large regret in at least one of the environments.

Proof
Fix a policy $\pi$. Let $\Delta\in [0,1/2]$ be some constant to be chosen later. As described in the previous post, we choose two Gaussian environments with unit variance. In the first environment, the vector of means of the rewards per arm is
\begin{align*}
\mu = (\Delta, 0, 0, \dots,0)\,.
\end{align*}
This environment and $\pi$ gives rise to the distribution $\PP_{G_\mu,\pi}$ on the canonical bandit model $(\cH_n,\cF_n)$. For brevity, we will use $\PP_{\mu}$ in place of $\PP_{G_\mu,\pi}$. The expectation under $\PP_{\mu}$ will be denoted by $\E_{\mu}$. Recall that on measurable space $(\cH_n,\cF_n)$, $A_t,X_t$ are just the coordinate projections defined using $\eqref{eq:coordproj}$.

To choose the second environment, pick
\begin{align*}
i = \argmin_{j\ne 1} \E_{\mu}[ T_j(n) ]\,.
\end{align*}
Note that $\E_{\mu}[ T_i(n) ] \le n/(K-1)$ because otherwise we would have $\sum_{j\ne 1} \E_{\mu}[T_j(n)]>n$, which is impossible. Note also that $i$ depends on $\pi$. The second environment has means
\begin{align*}
\mu’ = (\Delta,0,0, \dots, 0, 2\Delta, 0, \dots, 0 )\,,
\end{align*}
where specifically $\mu’_i = 2\Delta$. In particular, $\mu_j = \mu’_j$ except at index $i$. Note that the optimal action in $G_{\mu}$ is action one, while the optimal action in $G_{\mu’}$ is action $i\ne 1$. We again use the shorthand $\PP_{\mu’} = \PP_{G_{\mu’},\pi}$.

Intuitively, if $\pi$ chooses action one infrequently while interacting with $\mu$ (i.e., if $T_1(n)$ is small with high probability), then $\pi$ should do poorly in $G_{\mu}$, while if it chooses action one frequently while interacting with $G_{\mu’}$, then it will do poorly in $G_{\mu’}$. In particular, by the basic regret decomposition identity and a simple calculation,
\begin{align*}
R_n(\pi,G_\mu) %= \E_\mu[ (n-T_1(n)) \Delta ] \ge \E_\mu[ \one{T_1(n)\le n/2} ]\, \left(n-\frac{n}{2}\right)\, \Delta =
R_n(\pi,G_{\mu’}) > \PP_{\mu’}(T_1(n)> n/2)\, \frac{n\Delta}{2}\,.
%R_n(\pi,G_{\mu’}) %\ge \E_{\mu’}[ T_1(n) \Delta ] > \E_{\mu’}[ \one{T_1(n)> n/2} ]\, \frac{n\Delta}{2} =
%\ge \PP_{\mu’}(T_1(n)> n/2)\, \frac{n\Delta}{2}\,.
\end{align*}
Chaining this with the high-probability Pinsker inequality,
\begin{align*}
R_n(\pi,G_\mu) + R_n(\pi,G_{\mu’})
& > \frac{n\Delta}{2}\,\left( \PP_{\mu}(T_1(n)\le n/2) + \PP_{\mu’}(T_1(n)> n/2) \right) \\
& \ge \frac{n\Delta}{4}\, \exp( – \KL(\PP_{\mu},\PP_{\mu’}) )\,.
\end{align*}
It remains to upper bound $\KL(\PP_{\mu},\PP_{\mu’})$, i.e., to show that $\PP_\mu$ and $\PP_{\mu’}$ are indeed not far. For this, we use the divergence decomposition lemma, the definitions of $\mu$ and $\mu’$ and then exploit the choice of $i$ to get
\begin{align*}
\KL(\PP_\mu, \PP_{\mu’}) = \E_{\mu}[T_i(n)] \KL( \mathcal N(0,1), \mathcal N(2\Delta,1) ) =
\E_{\mu}[T_i(n)]\, \frac{(2\Delta)^2}{2} \leq \frac{2n\Delta^2}{K-1} \,.
\end{align*}
Plugging this into the previous display, we find that
\begin{align*}
R_n(\pi,G_\mu) + R_n(\pi,G_{\mu’}) \ge \frac{n\Delta}{4}\, \exp\left( – \frac{2n\Delta^2}{K-1} \right)\,.
\end{align*}
The result is completed by tuning $\Delta$. In particular, the optimal choice for $\Delta$ happens to be $\Delta = \sqrt{(K-1)/4n}$, which is below $1/2$ when $n$ is large compared to $K$, as postulated in the conditions of the theorem. The final steps are lower bounding $\exp(-1/2)$ and using $2\max(a,b) \ge a+b$.

QED

This concludes the proof of the worst-case lower bound, showing that no algorithm can do better than $O(\sqrt{nK})$ over all problems simultaneously. Coming in the next post we will show that the asymptotic logarithmic regret of UCB is essentially not improvable when the noise is Gaussian.

In the remainder of the post, we give the proof of the “high probability” Pinsker inequality, followed by miscellaneous thoughts.

# Proof of the Pinsker-type inequality

It remains to show the proof of the Pinsker-type inequality stated above. In the proof, to save space, we use the convention of denoting the minimum of $a$ and $b$ by $a\wedge b$, while we denote their maximum by $a \vee b$.

If $\KL(P,Q)=\infty$, there is nothing to be proven. On the other hand, by our previous theorem on relative entropy, $\KL(P,Q)<\infty$ implies that $P \ll Q$. Take $\nu = P+Q$. Note that $P,Q\ll \nu$. Hence, we can define $p = \frac{dP}{d\nu}$ and $q = \frac{dQ}{d\nu}$ and use the chain rule for Radon-Nikodym derivatives to derive that $\frac{dP}{dQ} \frac{dQ}{d\nu} = \frac{dP}{d\nu}$, or $\frac{dP}{dQ} = \frac{\frac{dP}{d\nu}}{\frac{dQ}{d\nu}}$. Thus,
\begin{align*}
\KL(P,Q) = \int p \log(\frac{p}{q}) d\nu\,.
\end{align*}
For brevity, when writing integrals with respect to $\nu$, in this proof, we will drop $d\nu$. Thus, we will write, for example $\int p \log(p/q)$ for the above integral.

Instead of $\eqref{eq:pinskerhp}$, we prove the stronger result that
\begin{align}\label{eq:pinskerhp2}
\int p \wedge q \ge \frac12 \, \exp( – \KL(P,Q) )\,.
\end{align}
This indeed is sufficient since $\int p \wedge q = \int_A p \wedge q + \int_{A^c} p \wedge q \le \int_A p + \int_{A^c} q = P(A) + Q(A^c)$.

We start with an inequality attributed to Lucien Le Cam, which lower bounds the left-hand side of $\eqref{eq:pinskerhp2}$. The inequality states that
\begin{align}\label{eq:lecam}
\int p \wedge q \ge \frac12 \left( \int \sqrt{pq} \right)^2\,.
\end{align}
Starting from the right-hand side above, using $pq = (p\wedge q) (p \vee q)$ and then Cauchy-Schwartz, we get
\begin{align*}
\left( \int \sqrt{pq} \right)^2
= \left(\int \sqrt{(p\wedge q)(p\vee q)} \right)^2
\le \left(\int p \wedge q \right)\, \left(\int p \vee q\right)\,.
\end{align*}
Now, using $p\wedge q + p \vee q = p+q$, the proof is finished by substituting $\int p \vee q = 2-\int p \wedge q \le 2$ and dividing both sides by two.

Thus, it remains to lower bound the right-hand side of $\eqref{eq:lecam}$. For this, we use Jensen’s inequality. First, we write $(\cdot)^2$ as $\exp (2 \log (\cdot))$ and then move the $\log$ inside the integral:
\begin{align*}
\left(\int \sqrt{pq} \right)^2 &= \exp\left( 2 \log \int \sqrt{pq} \right)
= \exp\left( 2 \log \int p \sqrt{\frac{q}{p}} \right) \\
&\ge \exp\left( 2 \int p \,\frac12\,\log \left(\frac{q}{p}\right)\, \right) \\
&= \exp\left( 2 \int_{pq>0} p \,\frac12\,\log \left(\frac{q}{p}\right)\, \right) \\
&= \exp\left( – \int_{pq>0} p \log \left(\frac{p}{q}\right)\, \right) \\
&= \exp\left( – \int p \log \left(\frac{p}{q}\right)\, \right)\,.
\end{align*}
Here, in the fourth and the last step we used that since $P\ll Q$, $q=0$ implies $p=0$, hence $p>0$ implies $q>0$, and eventually $pq>0$. Putting together the inequalities finishes the proof.
QED

# Notes

Note 1: We used the Gaussian noise model because the KL divergences are so easily calculated in this case, but all that we actually used was that $\KL(P_i, P_i’) = O((\mu_i – \mu_i’)^2)$ when the gap between the means $\Delta = \mu_i – \mu_i’$ is small. While this is certainly not true for all distributions, it very often is. Why is that? Let $\{P_\mu : \mu \in \R\}$ be some parametric family of distributions on $\Omega$ and assume that distribution $P_\mu$ has mean $\mu$. Then assuming the densities are twice differentiable and that everything is sufficiently “nice” that integrals and derivatives can be exchanged (as is almost always the case), we can use a Taylor expansion about $\mu$ to show that
\begin{align*}
\KL(P_\mu, P_{\mu+\Delta})
&\approx \KL(P_\mu, P_\mu) + \left.\frac{\partial}{\partial \Delta} \KL(P_\mu, P_{\mu+\Delta})\right|_{\Delta = 0} \Delta + \frac{1}{2}\left.\frac{\partial^2}{\partial\Delta^2} \KL(P_\mu, P_{\mu+\Delta})\right|_{\Delta = 0} \Delta^2 \\
&= \left.\frac{\partial}{\partial \Delta} \int_{\Omega} \log\left(\frac{dP_\mu}{dP_{\mu+\Delta}}\right) dP_\mu \right|_{\Delta = 0}\Delta
+ \frac12 I(\mu) \Delta^2 \\
&= -\left.\int_{\Omega} \frac{\partial}{\partial \Delta} \log\left(\frac{dP_{\mu+\Delta}}{dP_{\mu}}\right) \right|_{\Delta=0} dP_\mu \Delta
+ \frac12 I(\mu) \Delta^2 \\
&= -\left.\int_{\Omega} \frac{\partial}{\partial \Delta} \frac{dP_{\mu+\Delta}}{dP_{\mu}} \right|_{\Delta=0} dP_\mu \Delta
+ \frac12 I(\mu) \Delta^2 \\
&= – \frac{\partial}{\partial \Delta} \left.\int_{\Omega} \frac{dP_{\mu+\Delta}}{dP_{\mu}} dP_\mu \right|_{\Delta=0} \Delta
+ \frac12 I(\mu) \Delta^2 \\
&= -\left.\frac{\partial}{\partial \Delta} \int_{\Omega} dP_{\mu+\Delta} \right|_{\Delta=0}\Delta
+ \frac12 I(\mu) \Delta^2 \\
&= \frac12 I(\mu)\Delta^2\,,
\end{align*}
where $I(\mu)$, introduced in the second line, is called the Fisher information of the family $(P_\mu)_{\mu}$ at $\mu$. Note that if $\lambda$ is a common dominating measure for $(P_{\mu+\Delta})$ for $\Delta$ small, $dP_{\mu+\Delta} = p_{\mu+\Delta} d\lambda$ and we can write
\begin{align*}
I(\mu) = -\int \left.\frac{\partial^2}{\partial \Delta^2} \log\, p_{\mu+\Delta}\right|_{\Delta=0}\,p_\mu d\lambda\,,
\end{align*}
which is the form that is usually given in elementary texts. The upshot of all this is that $\KL(P_\mu, P_{\mu+\Delta})$ for $\Delta$ small is indeed quadratic in $\Delta$, with the scaling provided by $I(\mu)$, and as a result the worst-case regret is always $O(\sqrt{nK})$, provided the class of distributions considered is sufficiently rich and not too bizarre.

Note 2: In the previous post we showed that the regret of UCB is at most $O(\sqrt{nK \log(n)})$, which does not match the lower bound that we just proved. It is possible to show two things. First, that UCB does not achieve $O(\sqrt{nK})$ regret. The logarithm is real, and not merely a consequence of lazy analysis. Second, there is a modification of UCB (called MOSS) that matches the lower bound up to constant factors. We hope to cover this algorithm in future posts, but the general approach is to use a less conservative confidence level.

Note 3: We have now shown a lower bound that is $\Omega(\sqrt{nK})$, while in the previous post we showed an upper bound for UCB that was $O(\log(n))$. What is going on? Do we have a contradiction? Of course, the answer is no and the reason is that the logarithmic regret guarantees depended on the inverse gaps between the arms, which may be very large.

Note 4: Our lower bounding theorem was only proved when $n$ is at least linear in $K$. What happens when $n$ is small compared to $K$ (in particular, $n\le (K-1)$) above? A careful investigation of the proof of the theorem shows that in this case a linear lower bound of size $n/8\exp(-1/2)$ also holds for the regret (choose $\Delta=1/2$). Note that for small values of $n$ compared to $K$, the square-root lower bound gives larger than linear regret owning to that for $0\le x\le 1$, $\sqrt{x}\ge x$ and if the mean rewards are restricted the $[0,1]$ interval than the regret can not be larger than $n$, i.e., for $n$ small, the lower bound is vacuous (in particular, this happens when $n\le (K-1)/27$). Hence, the square-root form is simply too strong to hold for small values of $n$. A lower bound, which joins the two cases and thus holds for any value of $n$ regardless of $K$ is $\frac{\exp(-1/2)}{16}\, \min( n, \sqrt{(K-1)n} )$.

Note 5: To guarantee that the right-hand side of $\eqref{eq:jointdensity}$ defines a measure, we also must have that for any $t\ge 1$, $\pi_t(i|\cdot): \cH_{t-1} \to [0,1]$ is a measurable map. In fact, $\pi_t$ is a simple version of a so-called probability kernel. For the interested reader, we add the general definition for a probability kernel $K$: Let $(\cX,\cF)$, $(\cY,\cG)$ be a measurable spaces. $K:\cX \times \cG \to [0,1]$ is a probability kernel if it satisfies the following two properties: (i) For any $x\in \cX$, $K(x,\cdot)$ as the $U \mapsto K(x,U)$ ($U\in \cG$) map is a probability measure; (ii) For any $U\in \cG$, $K(\cdot,U)$ as the $x\mapsto K(x,U)$ ($x\in \cX$) map is measurable. When $\cY$ is finite and $\cG= 2^{\cY}$, instead of $K(x,U)$ for $U\subset \cY$, it is common that we specify the values of $K(x,\{y\})$ for $x\in \cX$ and $y\in \cY$ and in these cases we also often write $K(x,y)$. It is also common to write $K(U|x)$ (or in the discrete case $K(y|x)$).

Note 6: There is another subtlety in connection to the definition of $\PP_{\nu,\pi}$ is that we emphasized that $\lambda$ should be a measure over the Lebesgue $\sigma$-algebra $\mathcal L(\R)$. This will be typically easy to satisfy and holds for example when $P_1,\dots,P_K$ are defined over $\mathcal L(\R)$. This requirement would not be needed though if we did not insist on using the canonical bandit model. In particular, by their original definition, $X_1,\dots,X_n$ (as maps from $\Omega$ to $\R$) need only be $\cF/\mathcal B(\R)$ measurable (where $\mathcal B(\R)$ is the Borel $\sigma$-algebra over $\R$) and thus at a first sight it may seem strange to require that $P_1,\dots,P_K$ are defined over $\mathcal L(\R)$. The reason we require this is because in the canonical model we wish to use the Lebesgue $\sigma$-algebra of $\R^{2n}$ (in particular its restriction to $\cH_n$), which is in fact demanded by our wish to work with complete measure-spaces. This latter, as explained at the link, is useful because it essentially allows us not to worry about (and drop) sets of zero measure. Now, $\PP_{\nu,\pi}$, as the joint distribution of $(A_1,X_1,\dots,A_n,X_n)$ would only be defined for Borel measurable sets. Then, $\eqref{eq:jointdensity}$ could be first derived for the restriction of $\lambda$ to Borel measurable sets and then $\eqref{eq:jointdensity}$ with $\lambda$ not restricted to Borel measurable sets gives the required extension of $\PP_{\nu,\pi}$ to Lebesgue measurable sets. We can also take $\eqref{eq:jointdensity}$ as our definition of $\PP_{\nu,\pi}$, in which case the issue does not arise.

Note 7: (A simple version of) Jensen’s inequality states that if for any $U\subset \R$ convex, with non-empty interior and for any $f: U \to \R$ convex function and random variable $X\in U$ such that $\EE{X}$ exists, $f(\EE{X})\le \EE{f(X)}$. The proof is simple if one notes that for such a convex $f$ function, at every point $m\in R$ in the interior of $U$, there exists a “slope” $a\in \R$ such that $a(x-m)+f(m)\le f(x)$ for all $x\in \R$ (if $f$ is differentiable at $m$, take $a = f'(m)$). Indeed, is such a slope exists, taking $m = \EE{X}$ and replacing $x$ by $X$ we get $a(X-\EE{X}) + f(\EE{X}) \le f(X)$. Then, taking the expectation of both sides we arrive at Jensen’s inequality. The idea can be generalized into multiple directions, i.e., the domain of $f$ could be a convex set in a vector space, etc.

Note 8: We can’t help ourselves not to point out that the quantity $\int \sqrt{p q}$ that appeared in the previous proof is called the Bhattacharyya coefficient and that $\sqrt{1 – \int \sqrt{pq}}$ is the Hellinger distance, which is also a measure of the distance between probability measures $P$ and $Q$ and is frequently useful (perhaps especially in mathematical statistics) as a more well-behaved distance than the relative entropy (it is bounded, and symmetric and satisfies the triangle inequality!).

# References

The first minimax lower bound that we know of was given by Auer et al. below. Our proof uses a slightly different technique, but otherwise the idea is the same.

Note 9: Earlier, what in the last version of the post is called the divergence decomposition lemma was called the information processing lemma. This previous choice of name however is not very fortunate as it collides with the information processing inequality, which is a related inequality, but still quite different. Perhaps we should call this the bandit divergence decomposition lemma, or its claim the bandit divergence decomposition identity, though there is very little used of the specifics of bandits problems here. In particular, the identity holds whenever a sequential policy interacts with a stochastic environment where the feedback to the stochastic policy is based on a fixed set of distributions (swap the reward distributions with feedback distributions in the more general setting).

The high-probability Pinsker inequality is appears as as Lemma 2.6 in Sasha Tsybakov‘s book on nonparametric estimation. According to him, the result is originally due to Bretagnolle and Huber and the original reference is:

• Bretagnolle, J. and Huber, C. (1979) Estimation des densitiés: risque minimax. Z. für Wahrscheinlichkeitstheorie und verw. Geb., 47, 199-137.

One of the authors learned this inequality from the following paper:

• Sébastien Bubeck, Vianney Perchet, Philippe Rigollet: Bounded regret in stochastic multi-armed bandits. COLT 2013: 122-134

## The Upper Confidence Bound Algorithm

We now describe the celebrated Upper Confidence Bound (UCB) algorithm that overcomes all of the limitations of strategies based on exploration followed by commitment, including the need to know the horizon and sub-optimality gaps. The algorithm has many different forms, depending on the distributional assumptions on the noise.

Optimism in the face of uncertainty but on overdose: Not recommended!

The algorithm is based on the principle of optimism in the face of uncertainty, which is to choose your actions as if the environment (in this case bandit) is as nice as is plausibly possible. By this we mean that the unknown mean payoffs of each arm is as large as plausibly possible based on the data that has been observed (unfounded optimism will not work — see the illustration on the right!). The intuitive reason that this works is that when acting optimistically one of two things happens. Either the optimism was justified, in which case the learner is acting optimally, or the optimism was not justified. In the latter case the agent takes some action that they believed might give a large reward when in fact it does not. If this happens sufficiently often, then the learner will learn what is the true payoff of this action and not choose it in the future. The careful reader may notice that this explains why this rule will eventually get things right (it will be “consistent” in some sense), but the argument does not quite explain why an optimistic algorithm should actually be a good algorithm among all consistent ones. However, before getting to this, let us clarify what we mean by plausible.

Recall that if $X_1, X_2,\ldots, X_n$ are independent and $1$-subgaussian (which means that $\E[X_i] = 0$) and $\hat \mu = \sum_{t=1}^n X_t / n$, then
\begin{align*}
\Prob{\hat \mu \geq \epsilon} \leq \exp\left(-n\epsilon^2 / 2\right)\,.
\end{align*}
Equating the right-hand side with $\delta$ and solving for $\epsilon$ leads to
\begin{align}
\label{eq:simple-conc}
\Prob{\hat \mu \geq \sqrt{\frac{2}{n} \log\left(\frac{1}{\delta}\right)}} \leq \delta\,.
\end{align}
This analysis immediately suggests a definition of “as large as plausibly possible”. Using the notation of the previous post, we can say that when the learner is deciding what to do in round $t$ it has observed $T_i(t-1)$ samples from arm $i$ and observed rewards with an empirical mean of $\hat \mu_i(t-1)$ for it. Then a good candidate for the largest plausible estimate of the mean for arm $i$ is
\begin{align*}
\hat \mu_i(t-1) + \sqrt{\frac{2}{T_i(t-1)} \log\left(\frac{1}{\delta}\right)}\,.
\end{align*}
Then the algorithm chooses the action $i$ that maximizes the above quantity. If $\delta$ is chosen very small, then the algorithm will be more optimistic and if $\delta$ is large, then the optimism is less certain. We have to be very careful when comparing the above display to \eqref{eq:simple-conc} because in one the number of samples is the constant $n$ and in the other it is a random variable $T_i(t-1)$. Nevertheless, this is in some sense a technical issue (that needs to be taken care of properly, of course) and the intuition remains that $\delta$ is approximately an upper bound on the probability of the event that the above quantity is an underestimate of the true mean.

The value of $1-\delta$ is called the confidence level and different choices lead to different algorithms, each with their pros and cons, and sometimes different analysis. For now we will choose $1/\delta = f(t)= 1 + t \log^2(t)$, $t=1,2,\dots$. That is, $\delta$ is time-dependent, and is decreasing to zero slightly faster than $1/t$. Readers are not (yet) expected to understand this choice whose pros and cons we will discuss later. In summary, in round $t$ the UCB algorithm will choose arm $A_t$ given by
\begin{align}
A_t = \begin{cases}
\argmax_i \left(\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(t)}{T_i(t-1)}}\right)\,, & \text{if } t > K\,; \\
t\,, & \text{otherwise}\,.
\end{cases}
\label{eq:ucb}
\end{align}
The reason for the cases is that the term inside the square root is undefined if $T_i(t-1) = 0$ (as it is when $t = 1$), so we will simply have the algorithm spend the first $K$ rounds choosing each arm once. The value inside the argmax is called the index of arm $i$. Generally speaking, an index algorithm chooses the arm in each round that maximizes some value (the index), which usually only depends on current time-step and the samples from that arm. In the case of UCB, the index is the sum of the empirical mean of rewards experienced and the so-called exploration bonus, also known as the confidence width.

Illustration of UCB with 2 actions. The true means are shown in red ink. The observations are shown by crosses. Action 2 received fewer observations than Action 1. Hence, although its empirical mean is about the same as that of Action 1, Action 2 will be chosen in the next round.

Besides the slightly vague “optimism guarantees optimality or learning” intuition we gave before, it is worth exploring other intuitions for this choice of index. At a very basic level, we should explore arms more often if they are (a) promising (in that $\hat \mu_i(t-1)$ is large) or (b) not well explored ($T_i(t-1)$ is small). As one can plainly see from the definition, the UCB index above exhibits this behaviour. This explanation is unsatisfying because it does not explain why the form of the functions is just so.

An alternative explanation comes from thinking of what we expect from any reasonable algorithm. Suppose in some round we have played some arm (let’s say arm $1$) much more frequently than the others. If we did a good job designing our algorithm we would hope this is the optimal arm. Since we played it so much we can expect that $\hat \mu_1(t-1) \approx \mu_1$. To confirm the hypothesis that arm $1$ is indeed optimal the algorithm better be highly confident about that other arms are indeed worse. This leads very naturally to confidence intervals and the requirement that $T_i(t-1)$ for other arms $i\ne 1$ better be so large that
\begin{align}\label{eq:ucbconstraint}
\hat \mu_i(t-1) + \sqrt{\frac{2}{T_i(t-1)} \log\left(\frac{1}{\delta}\right)} \leq \mu_1\,,
\end{align}
because, at a confidence level of $1-\delta$ this guarantees that $\mu_i$ is smaller than $\mu_1$ and if the above inequality did not hold, the algorithm would not be justified in choosing arm $1$ much more often than arm $i$. Then, planning for $\eqref{eq:ucbconstraint}$ to hold makes it reasonable to follow the UCB rule as this will eventually guarantee that this inequality holds when arm $1$ is indeed optimal and arm $i$ is suboptimal. But how to choose $\delta$? If the confidence interval fails, by which we mean, if actually it turns out that arm $i$ is optimal and by unlucky chance it holds that
\begin{align*}
\hat \mu_i(t-1) + \sqrt{\frac{2}{T_i(t-1)} \log\left(\frac{1}{\delta}\right)} \leq \mu_i\,,
\end{align*}
then arm $i$ can be disregarded even though it is optimal. In this case the algorithm may pay linear regret (in $n$), so it better be the case that the failure occurs with about $1/n$ probability to fix the upper bound on the expected regret to be constant for the case when the confidence interval fails. Approximating $n \approx t$ leads then (after a few technicalities) to the choice of $f(t)$ in the definition of UCB given in \eqref{eq:ucb}. With this much introduction, we state the main result of this post:

Theorem (UCB Regret): The regret of UCB is bounded by
\begin{align} \label{eq:ucbbound}
R_n \leq \sum_{i:\Delta_i > 0} \inf_{\epsilon \in (0, \Delta_i)} \Delta_i\left(1 + \frac{5}{\epsilon^2} + \frac{2}{(\Delta_i – \epsilon)^2} \left( \log f(n) + \sqrt{\pi \log f(n)} + 1\right)\right)\,.
\end{align}
Furthermore,
\begin{align} \label{eq:asucbbound}
\displaystyle \limsup_{n\to\infty} R_n / \log(n) \leq \sum_{i:\Delta_i > 0} \frac{2}{\Delta_i}\,.
\end{align}

Note that in the first display, $\log f(n) \approx \log(n) + 2\log\log(n)$. We thus see that this bound scales logarithmically with the length of the horizon and is able to essentially reproduce the bound that we obtained for the unfeasible version of ETC with $K=2$ (when we tuned the exploration time based on the knowledge of $\Delta_2$). We shall discuss further properties of this bound later, but now let us present a simpler version of the above bound, avoiding all these epsilons and infimums that make for a confusing theorem statement. By choosing $\epsilon = \Delta_i/2$ inside the sum leads to the following corollary:

Corollary (UCB Simplified Regret): The regret of UCB is bounded by
\begin{align*}
R_n \leq \sum_{i:\Delta_i > 0} \left(\Delta_i + \frac{1}{\Delta_i}\left(8 \log f(n) + 8\sqrt{\pi \log f(n)} + 28\right)\right)\,.
\end{align*}
and in particular there exists some universal constant $C>0$ such that for all $n\ge 2$, $R_n \le \sum_{i:\Delta_i>0} \left(\Delta_i + \frac{C \log n}{\Delta_i}\right)$.

Note that taking the limit of the ratio of the bound above and $\log(n)$ does not result in the same rate as in the theorem, which is the main justification for introducing the epsilons in the first place. In fact, as we shall see the asymptotic bound on the regret given in \eqref{eq:asucbbound}, which is derived from~\eqref{eq:ucbbound} by choosing $\epsilon = \log^{-1/4}(n)$, is unimprovable in a strong sense.

The proof of the theorem relies on the basic regret decomposition identity that expresses the expected regret as the weighted sum of the expected number of times the suboptimal actions are chosen. So why will $\EE{T_i(n)}$ be small for a suboptimal action $i$? This is based on a couple of simple observations: First, (disregarding the initial period when all arms are chosen once) the suboptimal action $i$ can only be chosen if its UCB index is higher than that of an optimal arm. Now, this can only happen if the UCB index of action $i$ is “too high”, i.e., higher than $\mu^*-\epsilon>\mu_i$ or the UCB index of that optimal arm is “too low”, i.e., if it is below $\mu^*-\epsilon<\mu^*$. Since the UCB index of any arm is with reasonably high probability an upper bound on the arm’s mean, we don’t expect the index of any arm to be below its mean. Hence, the total number of times when the optimal arm’s index is “too low” (as defined above) is expected to be negligibly small. Furthermore, if the sub-optimal arm $i$ is played sufficiently often, then its exploration bonus becomes small and simultaneously the empirical estimate of its mean converges to the true value, making the expected total number of times when its index stays above $\mu^*-\epsilon$ small.

We start with a useful lemma that will help us quantify the last argument.

Lemma Let $X_1,X_2,\ldots$ be a sequence of independent $1$-subgaussian random variables, $\hat \mu_t = \sum_{s=1}^t X_s / t$, $\epsilon > 0$ and
\begin{align*}
\kappa = \sum_{t=1}^n \one{\hat \mu_t + \sqrt{\frac{2a}{t}} \geq \epsilon}\,.
\end{align*}
Then, $\displaystyle \E[\kappa] \leq 1 + \frac{2}{\epsilon^2} (a + \sqrt{\pi a} + 1)$.

Because the $X_i$ are $1$-subgaussian and independent we have $\E[\hat \mu_t] = 0$, so we cannot expect $\hat \mu_t + \sqrt{2a/t}$ to be smaller than $\epsilon$ until $t$ is at least $2a/\epsilon^2$. The lemma confirms that this is indeed of the right order as an estimate for $\EE{\kappa}$.

Proof
Let $u = 2a \epsilon^{-2}$. Then, by the concentration theorem for subgaussian variables,
\begin{align*}
\E[\kappa]
&\leq u + \sum_{t=\ceil{u}}^n \Prob{\hat \mu_t + \sqrt{\frac{2a}{t}} \geq \epsilon} \\
&\leq u + \sum_{t=\ceil{u}}^n \exp\left(-\frac{t\left(\epsilon – \sqrt{\frac{2a}{t}}\right)^2}{2}\right) \\
&\leq 1 + u + \int^\infty_u \exp\left(-\frac{t\left(\epsilon – \sqrt{\frac{2a}{t}}\right)^2}{2}\right) dt \\
&= 1 + \frac{2}{\epsilon^2}(a + \sqrt{\pi a} + 1)\,.
\end{align*}
QED

Before the proof of the UCB regret theorem we need a brief diversion back to the bandit model. We have defined $\hat \mu_i(t)$ as the empirical mean of the $i$th arm after the $t$th round, which served us well enough for the analysis of the explore-then-commit strategy where the actions were chosen following a deterministic rule. For UCB it is very useful also to have $\hat \mu_{i,s}$, the empirical average of the $i$th arm after $s$ observations from that arm, which occurs at a random time (or maybe not at all). To define $\hat \mu_{i,s}$ rigorously, we argue that without the loss of generality one may assume that the reward $X_t$ received in round $t$ comes from choosing the $T_i(t)$th element from the reward sequence $(Z_{i,s})_{1\le s \le n}$ associated with arm $i$, where $(Z_{i,s})_s$ is an i.i.d. sequence with $Z_{i,s}\sim P_i$. Formally,
\begin{align}\label{eq:rewardindepmodel}
X_t = Z_{A_t,T_{A_t}(t)}\,.
\end{align}
The advantage of introducing $(Z_{i,s})_s$ is that it allows a clean definition (without $Z_{i,s}$, how does one even define $\hat \mu_{i,s}$ if $T_i(n) \leq s$?). In particular, we let
\begin{align*}
\hat \mu_{i,s} &= \frac{1}{s} \sum_{u=1}^s Z_{i,u}\,.
\end{align*}
Note that $\hat \mu_{i,s} = \hat \mu_i(t)$ when $T_i(t)=s$ (formally: $\hat \mu_{i,T_i(t)} = \hat \mu_i(t)$).

Proof of Theorem
As in the analysis of the explore-then-commit strategy we start by writing the regret decomposition.
\begin{align*}
R_n = \sum_{i:\Delta_i > 0} \Delta_i \E[T_i(n)]\,.
\end{align*}
The rest of the proof revolves around bounding $\E[T_i(n)]$. Let $i$ be some sub-optimal arm (so that $\Delta_i > 0$). Following the suggested intuition we decompose $T_i(n)$ into two terms. The first measures the number of times the index of the optimal arm is less than $\mu_1 – \epsilon$. The second term measures the number of times that $A_t = i$ and its index is larger than $\mu_1 – \epsilon$.
\begin{align}
T_i(n)
&= \sum_{t=1}^n \one{A_t = i} \nonumber \\
&\leq \sum_{t=1}^n \one{\hat \mu_1(t-1) + \sqrt{\frac{2\log f(t)}{T_1(t-1)}} \leq \mu_1 – \epsilon} + \nonumber \\
&\qquad \sum_{t=1}^n \one{\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(t)}{T_i(t-1)}} \geq \mu_1 – \epsilon \text{ and } A_t = i}\,. \label{eq:ucb1}
\end{align}
The proof of the first part of the theorem is completed by bounding the expectation of each of these two sums. Starting with the first, we again use the concentration guarantee.
\begin{align*}
\EE{\sum_{t=1}^n \one{\hat \mu_1(t-1) + \sqrt{\frac{2 \log f(t)}{T_1(t-1)}} \leq \mu_1 – \epsilon}}
&= \sum_{t=1}^n \Prob{\hat \mu_1(t-1) + \sqrt{\frac{2 \log f(t)}{T_1(t-1)}} \leq \mu_1 – \epsilon} \\
&\leq \sum_{t=1}^n \sum_{s=1}^n \Prob{\hat \mu_{1,s} + \sqrt{\frac{2 \log f(t)}{s}} \leq \mu_1 – \epsilon} \\
&\leq \sum_{t=1}^n \sum_{s=1}^n \exp\left(-\frac{s\left(\sqrt{\frac{2 \log f(t)}{s}} + \epsilon\right)^2}{2}\right) \\
&\leq \sum_{t=1}^n \frac{1}{f(t)} \sum_{s=1}^n \exp\left(-\frac{s\epsilon^2}{2}\right) \\
&\leq \frac{5}{\epsilon^2}\,.
\end{align*}
The first inequality follows from the union bound over all possible values of $T_1(t-1)$. This is an important point. The concentration guarantee cannot be applied directly because $T_1(t-1)$ is a random variable and not a constant. The last inequality is an algebraic exercise. The function $f(t)$ was chosen precisely so this bound would hold. If $f(t) = t$ instead, then the sum would diverge. Since $f(n)$ appears in the numerator below we would like $f$ to be large enough that its reciprocal is summable and otherwise as small as possible. For the second term in \eqref{eq:ucb1} we use the previous lemma.
\begin{align*}
&\EE{\sum_{t=1}^n \one{\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(t)}{T_i(t-1)}} \geq \mu_1 – \epsilon \text{ and } A_t = i}} \\
&\qquad\leq \EE{\sum_{t=1}^n \one{\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(n)}{T_i(t-1)}} \geq \mu_1 – \epsilon \text{ and } A_t = i}} \\
&\qquad\leq \EE{\sum_{s=1}^n \one{\hat \mu_{i,s} + \sqrt{\frac{2 \log f(n)}{s}} \geq \mu_1 – \epsilon}} \\
&\qquad= \EE{\sum_{s=1}^n \one{\hat \mu_{i,s} – \mu_i + \sqrt{\frac{2 \log f(n)}{s}} \geq \Delta_i – \epsilon}} \\
&\qquad\leq 1 + \frac{2}{(\Delta_i – \epsilon)^2} \left(\log f(n) + \sqrt{\pi \log f(n)} + 1\right)\,.
\end{align*}
The first part of the theorem follows by substituting the results of the previous two displays into \eqref{eq:ucb1}. The second part follows by choosing $\epsilon = \log^{-1/4}(n)$ and taking the limit as $n$ tends to infinity.
QED

Next week we will see that UCB is close to optimal in several ways. As with the explore-then-commit strategy, the bound given in the previous theorem is not meaningful when the gaps $\Delta_i$ are small. Like that algorithm it is possible to prove a distribution-free bound for UCB by treating the arms $i$ with small $\Delta_i$ differently. Fix $\Delta>0$ to be chosen later. Then, from the proof of the bound on the regret of UCB we can derive that $\EE{T_i(n)} \le \frac{C \log(n)}{\Delta_i^2}$ holds for all $n\ge 2$ with some universal constant $C>0$. Hence, the regret can be bounded without dependence on the sub-optimality gaps by
\begin{align*}
R_n
&= \sum_{i:\Delta_i > 0} \Delta_i \E[T_i(n)]
= \sum_{i:\Delta_i < \Delta} \Delta_i \E[T_i(n)] + \sum_{i:\Delta_i \geq \Delta} \Delta_i \E[T_i(n)] \\
&< n \Delta + \sum_{i:\Delta_i \geq \Delta} \Delta_i \E[T_i(n)]
\leq n \Delta + \sum_{i:\Delta_i \geq \Delta} \frac{C \log n}{\Delta_i} \\
&\leq n \Delta + K\frac{C \log n}{\Delta}
= \sqrt{C K n \log(n)}\,,
\end{align*}
where in the last step we chose $\Delta = \sqrt{K C \log(n) / n}$, which optimizes the upper bound.

There are many directions to improve or generalize this result. For example, if more is known about the noise model besides that it is subgaussian, then this can often be exploited to improve the regret. The main example is the Bernoulli case, where one should make use of the fact that the variance is small when the mean is close to zero or one. Another direction is improving the worst-case regret to match the lower bound of $\Omega(\sqrt{Kn})$ that we will see next week. This requires a modification of the confidence level and a more complicated analysis.

# Notes

Note 1: Here we argue that there is no loss in generality in assuming that the rewards experienced satisfy $\eqref{eq:rewardindepmodel}$. Indeed, let $T’ = (A’_1,X’_1,\dots,A’_n,X’_n)$ be any sequence of random variables satisfying that $A_t’ = f_t(A’_1,X’_1,\dots,A’_{t-1},X’_{t-1})$ and that for any $U\subset \R$ open interval
\begin{align*}
\Prob{X_t’\in U\,|\,A’_1,X’_1,\dots,A’_{t-1},X’_{t-1},A’_t} = P_{A’_t}(U)\,,
\end{align*}
where $1\le t\le n$. Then, choosing $(Z_{i,s})_s$ as described in the paragraph before $\eqref{eq:rewardindepmodel}$, we let $T=(A_1,X_1,\dots,A_n,X_n)$ be such that $A_t = f_t(A_1,X_1,\dots,A_{t-1},X_{t-1})$ and $X_t$ be so that it satisfies $\eqref{eq:rewardindepmodel}$. It is not hard to see then that the distributions of $T$ and $T’$ agree. Hence, there is indeed no loss of generality by assuming that the rewards are indeed generated by $\eqref{eq:rewardindepmodel}$.

Note 2: The view that $n$ rewards are generated ahead of time for each arm and the algorithm consumes these rewards as it chooses an action was helpful in the proof as it reduced the argument to the study of averages of independent random variables. The analysis could also have been done directly without relying on the “virtual” rewards $(Z_{i,s})_s$ with the help of martingales, which we will meet later.
A third model of how $X_t$ is generated could have been that $X_t = Z_{A_t,t}$. We will meet this “skipping model” later when studying adversarial bandits. For the stochastic bandit models we study here, all these models coincide (they are indistinguishable in the sense described in the first note above).

Note 3: So is the optimism principle universal? Does it always give good algorithms, even in more complicated settings? Unfortunately, the answer is no. The optimism principle leads to reasonable algorithms when using an action gives feedback that informs the learner about how much the action is worth. If this is not true (i.e., in models where you have to choose action $B$ to learn about the rewards of action $A$, and choosing action $A$ would not give you information about the reward of action $A$), the principle fails! (Why?) Furthermore, even if all actions give information about their own value, the optimistic principle may give rise to algorithms whose regret is overly large compared to what could be achieved with more clever algorithms. Thus, in a way, finite-armed stochastic bandits is a perfect fit for optimistic algorithms. While the more complex feedback models may not make much sense at the moment, we will talk about them later.

# References

The idea of using upper confidence bounds appeared in ’85 in the landmark paper of Lai and Robbins. In this paper they introduced a strategy which plays the leader of the “often sampled” actions except that for any action $j$ in every $K$th round the strategy is checking whether the UCB index of arm $j$ is higher than the estimated reward of the leader. They proved that this strategy, when appropriately tuned, is asymptotically unimprovable the same way UCB as we defined it is asymptotically unimprovable (we still owe the definition of this and a proof, which will come soon). The cleaner UCB idea must have been ready to be found in ’95 because Agrawal and Katehakis & Robbins discovered this idea independently in that year. Auer et al. later modified the strategy slightly and proved a finite-time analysis.