Instance dependent lower bounds

In the last post we showed that under mild assumptions ($n = \Omega(K)$ and Gaussian noise), the regret in the worst case is at least $\Omega(\sqrt{Kn})$. More precisely, we showed that for every policy $\pi$ and $n\ge K-1$ there exists a $K$-armed stochastic bandit problem $\nu\in \mathcal E_K$ with unit variance Gaussian rewards and means in $[0,1]$ such that the regret $R_n(\pi,\nu)$ of $\pi$ on $\nu$ satisfies $R_n(\pi,\nu) \geq c\sqrt{nK}$ for some universal constant $c > 0$, or
R_n^*(\cE_K)\doteq\inf_\pi \sup_{\nu \in \cE_K} R_n(\pi,\nu)\ge c \sqrt{nK}\,.
Earlier we have also seen that the regret of UCB on such problems is at most $C \sqrt{nK \log(n)}$ with some universal constant $C>0$: $R_n(\mathrm{UCB},\cE_K)\le C \sqrt{nK\log(n)}$. Thus, UCB is near minimax-optimal, in the sense that except for a logarithmic factor its upper bound matches the lower bound for $\cE_K$.

Does this mean that UCB is necessarily a good algorithm for class $\cE_K$? Not quite! Consider the following modification of UCB, which we will call by the descriptive, but not particularly flattering name UCBWaste. Choose $0 < C' \ll C$, where $C$ is the universal constant in the upper bound on the regret of UCB (if we actually calculated the value of $C$, we could see that e.g. $C'=0.001$ would do). Then, in the first $m=C'\sqrt{Kn}$ rounds, UCBWaste will explore each of the $K$ actions equally often, after which it switches to the UCB strategy. It is easy to see that from the perspective of worst-case regret, UCBWaste is almost as good as UCB (being at most $C'\sqrt{Kn}$ worse). Would you accept UCBWaste as a good policy?

Perhaps not. To see why, consider for example a problem $\nu\in \cE_K$ such that on $\nu$ the immediate regret for choosing any suboptimal action is close to one. From the regret analysis of UCB we can show that after seeing a few rewards from some suboptimal action, with high probability UCB will stop using that action. As a result, UCB on $\nu$ would achieve a very small regret. In fact, it follows from our previous analysis that for sufficiently large $n$, UCB will achieve a regret of $\sum_{i:\Delta_i(\nu)>0} \frac{C\log(n)}{\Delta_i(\nu)} \approx C K \log(n)$, where $\Delta_i(\nu)$ is the suboptimality gap of action $i$ in $\nu$. However, UCBWaste, true to its name, will suffer a regret of at least $C’\sqrt{Kn/2}$ on $\nu$, a quantity that for $n$ large is much larger than the logarithmic bound on the regret of UCB.

Thus, on the “easy”, or “nice” instances, UCBWaste’s regret seems to be unnecessarily large at least as $n\to\infty$, even though UCBWaste is a near worst-case optimal algorithm for any $n$. If we care about what happens for $n$ large (and why should not we — after all, having higher standards should not hurt), it may be worthwhile to look beyond finite-time worst-case optimality.

Algorithms that are nearly worst-case optimal may fail to take advantage of environments that are not the worst case. What is more desirable is to have algorithms which are near worst-case optimal, while their performance gets better on “nicer” instances.

But what makes an instance “nice”? For a fixed $K$-action stochastic bandit environment $\nu$ with $1$-subgaussian reward-noise, the regret of UCB was seen to be logarithmic and in particular
\limsup_{n\to\infty} \frac{R_n}{\log(n)} \leq c^*(\nu) \doteq \sum_{i:\Delta_i(\nu) > 0} \frac{2}{\Delta_i(\nu)}\,.
Then perhaps $c^*(\nu)$ could be used as an intrinsic measure of the difficulty of learning on $\nu$. Or perhaps there exist some constant $c'(\nu)$ that is even smaller than $c^*(\nu)$ such that some strategy suffers at most $c'(\nu) \log(n)$ regret asymptotically? If, for example, $c'(\nu)$ is smaller by a factor of two than $c^*(\nu)$, then this could be a big difference!

In this post we will show that, in some sense, $c^*(\nu)$ is the best possible. In particular, we will show that no reasonable strategy can outperform UCB in the asymptotic sense above on any problem instance $\nu$. This is a big difference from the previous approach because the “order of quantifiers” is different: In the minimax result proven in the last post we showed that for any policy $\pi$ there exists a bandit instance $\nu \in \cE_K$ on which $\pi$ suffers large regret. Here we show that for all reasonable policies $\pi$, the policy’s asymptotic regret is always at least as large as that of UCB.

We have not yet said what we mean by reasonable. Clearly for any $\nu$, there are policies for which $\limsup_{n\to\infty} R_n / \log(n) = 0$. For example, the policy that chooses $A_t = \argmax_i \mu_i(\nu)$. But this is not a reasonable policy because it suffers linear regret for any $\nu’$ such that the optimal action of $\nu$ is not optimal in $\nu’$. At least, we should require that the policy achieves sublinear regret over all problems. This may still look a bit wasteful because UCB achieves logarithmic asymptotic regret for any $1$-subgaussian problem. A compromise between the undemanding sublinear and the perhaps overly restrictive logarithmic growth is to require that the policy has subpolynomial regret growth, a choice that has historically been the most used, and which we will also take. The resulting policies are said to be consistent.

Asymptotic lower bound

To firm up the ideas, define $\cE_K’$ as the class of $K$-action stochastic bandits where each action has a reward distribution with $1$-subgaussian noise. Further, let $\cE_K'(\mathcal N)\subset \cE_K’$ be those environments in $\cE_K’$ where the reward distribution is Gaussian. Note that for minimax lower bounds it is important to assume the sub-optimality gaps are bounded from above, this is not required in the asymptotic setting. The reason is that any reasonable (that word again) strategy should choose each arm at least once, which does not affect the asymptotic regret at all, but does affect the finite-time minimax regret.

Definition (consistency): A policy $\pi$ is consistent over $\cE$ if for all bandits $\nu \in \mathcal \cE$ and for all $p>0$ it holds that
R_n(\pi,\nu) = O(n^p)\, \quad \text{ as } n \to\infty\,.
We denote the class of consistent policies over $\cE$ by $\Pi_{\text{cons}}(\cE)$.

Note that $\eqref{eq:subpoly}$ trivially holds for $p\ge 1$. Note also that because $n \mapsto n^p$ is positive-valued, the above is equivalent to saying that for each $\nu \in \cE$ there exists a constant $C_\nu>0$ such that $R_n(\pi,\nu) \le C_\nu n^p$ for all $n\in \N$. In fact, one can also show that above we can also replace $O(\cdot)$ with $o(\cdot)$ with no effect on whether a policy is called consistent or not. Just like in statistics, consistency is a purely asymptotic notion.

We can see that UCB is consistent (its regret is logarithmic) over $\cE_K’$. On the other hand, the strategy that always chooses a fixed action $i\in K$ is not consistent (if $K > 1$) over any reasonable $\cE$, since its regret is linear in any $\nu$ where $i$ is suboptimal.

The main theorem of this post is that no consistent strategy can outperform UCB asymptotically in the Gaussian case:

Theorem (Asymptotic lower bound for Gaussian environments): For any policy $\pi$ consistent over the $K$-action unit-variance Gaussian environments $\cE_K'(\mathcal N)$ and any $\nu\in \cE_K'(\mathcal N)$, it holds that
\liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)} \geq c^*(\nu)\,,
where recall that $c^*(\nu)= \sum_{i:\Delta_i(\nu)>0} \frac{2}{\Delta_i(\nu)}$.

Since UCB is consistent for $\cE_K’\supset \cE_K'(\mathcal N)$,
c^*(\nu) = \inf_{\pi \in \Pi_{\text{cons}}(\cE_K'(\mathcal N))} \liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)}
\limsup_{n\to\infty} \frac{R_n(\text{UCB},\nu)}{\log(n)}\le c^*(n)\,,
that is
\inf_{\pi \in \Pi_{\text{cons}}(\cE_K'(\mathcal N))} \liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)}
\limsup_{n\to\infty} \frac{R_n(\text{UCB},\nu)}{\log(n)} = c^*(\nu)\,.
Thus, we see that UCB is asymptotically optimal over the class of unit variance Gaussian environments $\cE_K'(\mathcal N)$.

Pick a $\cE_K'(\mathcal N)$-consistent policy $\pi$ and a unit variance Gaussian environment $\nu =(P_1,\dots,P_K)\in \cE_K'(\mathcal N)$ with mean rewards $\mu \in \R^K$. We need to lower bound $R_n \doteq R_n(\pi,\nu)$. Let $\Delta_i \doteq \Delta_i(\nu)$. Based on the basic regret decomposition identity $R_n = \sum_i \Delta_i \E_{\nu,\pi}[T_i(n)]$, it is not hard to see that the result will follow if for any action $i\in [K]$ that is suboptimal in $\nu$ we prove that
\liminf_{n\to\infty} \frac{\E_{\nu,\pi}[T_i(n)]}{\log(n)} \ge \frac{2}{\Delta_i^2}\,.
As in the previous lower bound proof, we assume that all random variables are defined over the canonical bandit measurable space $(\cH_n,\cF_n)$, where $\cH_n = ([K]\times \R)^n$, $\cF_n$ is the restriction of $\mathcal L(R^{2n})$ to $\cH_n$, and $\PP_{\nu,\pi}$ (which induces $\E_{\nu,\pi}$) is the distribution of $n$-round action-reward histories induced by the interconnection of $\nu$ and $\pi$.

The intuitive logic of the lower bound is as follows: $\E_{\nu,\pi}[T_i(n)]$ must be (relatively) large because the regret in all environments $\nu’$ where (as opposed to what happens in $\nu$) action $i$ is optimal must be small by our assumption that $\pi$ is a “reasonable” policy and for this $\E_{\nu’,\pi}[T_i(n)]$ must be large. However, if the gap $\Delta_i$ is small and $\nu’$ is close to $\nu$, $\E_{\nu,\pi}[T_i(n)]$ will necessarily be large when $\E_{\nu’,\pi}[T_i(n)]$ is large because the distributions $\PP_{\nu,\pi}$ and $\PP_{\nu’,\pi}$ will also be close.

As in the previous lower bound proof, the divergence decomposition identity and the high-probability Pinsker inequality will help us to carry out this argument. In particular, from the divergence decomposition lemma, we get that for any $\nu’=(P’_1,\dots,P_K’)$ such that $P_j=P_j’$ except for $j=i$,
\KL(\PP_{\nu,\pi}, \PP_{\nu’,\pi}) = \E_{\nu,\pi}[T_i(n)] \, \KL(P_i, P’_i)\,,
while the high-probability Pinsker inequality gives that for any event $A$,
\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c)\ge \frac12 \exp(-\KL(\PP_{\nu,\pi}, \PP_{\nu’,\pi}) )\,,
or equivalently,
\KL(\PP_{\nu,\pi},\PP_{\nu’,\pi}) \ge \log \frac{1}{2 (\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c))}\,.
We now upper bound $\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c)$ in terms of $R_n+R_n’$ where $R_n’ = R_n(\nu’,\pi)$. For this choose $P_i’ = \mathcal N(\mu_i+\lambda,1)$ with $\lambda>\Delta_i$ to be selected later. Further, we choose $A= \{ T_i(n)\ge n/2 \}$ and thus $A^c = \{ T_i(n) < n/2 \}$. These choices make $\PP_{\nu,\pi}(A)+\PP_{\nu',\pi}(A^c)$ small if $R_n$ and $R_n'$ are both small. Indeed, since $i$ is suboptimal in $\nu$ and $i$ is optimal in $\nu'$ and for any $j\ne i$, $\Delta_j'(\nu) \ge \lambda-\Delta_i$, \begin{align*} R_n \ge \frac{n\Delta_i}{2} \PP_{\nu,\pi}(T_i(n)\ge n/2 ) \, \text{ and } R_n' \ge \frac{n (\lambda-\Delta_i) }{2} \PP_{\nu',\pi}( T_i(n) < n/2 )\,. \end{align*}
Summing these up, lower bounding $\Delta_i/2$ and $(\lambda-\Delta_i)/2$ by $\kappa(\Delta_i,\lambda) \doteq \frac{\min( \Delta_i, \lambda-\Delta_i )}{2}$, and then reordering gives
\PP_{\nu,\pi}( T_i(n) \ge n/2 ) + \PP_{\nu’,\pi}( T_i(n) < n/2 ) \le \frac{R_n+R_n'}{ \kappa(\Delta_i,\lambda) \,n} \end{align*} as promised. Combining this inequality with $\eqref{eq:infoplem}$ and $\eqref{eq:pinsker}$ and using $\KL(P_i,P'_i) = \lambda^2/2$ we get \begin{align} \frac{\lambda^2}{2}\E_{\nu,\pi}[T_i(n)] \ge \log\left( \frac{\kappa(\Delta_i,\lambda)}{2} \frac{n}{R_n+R_n'} \right) = \log\left( \frac{\kappa(\Delta_i,\lambda)}{2}\right) + \log(n) - \log(R_n+R_n')\,. \end{align} Dividing both sides by $\log(n)$, we see that it remains to upper bound $\frac{\log(R_n+R_n')}{\log(n)}$. For this, note that for any $p>0$ $R_n+ R_n’ \le C n^p$ for all $n$ and some constant $C>0$. Hence, $\log(R_n+R_n’) \le \log(C) + p\log(n)$, from which we get that $\limsup_{n\to\infty} \frac{\log(R_n+R_n’)}{\log(n)} \le p$. Since $p>0$ was arbitrary, it follows that this also holds for $p=0$. Hence,
\frac{\lambda^2}{2} \frac{\E_{\nu,\pi}[T_i(n)]}{\log(n)} \ge
1 – \limsup_{n\to\infty} \frac{\log(R_n+R_n’)}{\log(n)} \ge 1\,.
Taking the infimum of both sides over $\lambda>\Delta_i$ and reordering gives $\eqref{eq:armlowerbound}$, thus finishing the proof.


Instance-dependent finite-time lower bounds

Is it possible to go beyond asymptotic instance optimality? Yes, of course! All we have to do is to redefine what is meant by a “reasonable” algorithm. One idea is to call an algorithm reasonable when its finite-time(!) regret is close to minimax optimal. The situation is illustrated in the graph below:

Illustration of various optimality concepts.

Illustration of optimality concepts. The $x$ axis lists instances, ordered so that the “green” graph looks nice. Shown are the regret of various policies in different colors. While a near minimax algorithm can perform equally over all instances (and never much worse than the largest of the curves’ minimum values), a near instance optimal algorithm is required to come close to what is best on every instance using reasonable algorithms. The reasonable class can be the algorithms which are near minimax optimal.

In this part we will show instance-optimal finite-time lower bounds building on the ideas of the previous proof. This proof by and large uses the same ideas as the proof of the minimax result in the last post. This suggests that for future reference it may be useful to extract the common part, which summarizes what can be obtained by chaining the high-probability Pinsker inequality with the divergence decomposition lemma:

Lemma (Instance-dependent lower bound): Let $\nu = (P_i)_i,\nu’=(P’_i)$ be $K$-action stochastic bandit environments that differ only the distribution of the rewards for action $i\in [K]$. Further, assume that $i$ is suboptimal in $\nu$ and is optimal in $\nu’$, and in particular $i$ is the unique optimal action in $\nu’$. Let $\lambda = \mu_i(\nu’)-\mu_i(\nu)$. Then, for any policy $\pi$,
D(P_i,P_i’) \E_{\nu,\pi} [T_i(n)] \ge \log\left( \frac{\min\set{\lambda – \Delta_i, \Delta_i}}{4}\right) + \log(n) – \log(R_n+R_n’)\,.

Note that the lemma holds for finite $n$ and any $\nu$ and can be used to derive finite-time instance-dependent lower bounds for any environment class $\cE$ that is rich enough to include a pair $\nu’$ for any $\nu \in \cE$ and for policies that are uniformly good over $\cE$. For illustration consider the Gaussian unit variance environments $\cE\doteq \cE_K(\mathcal N)$ and policies $\pi$ whose regret, on any $\nu\in \cE$, is bounded by $C n^p$ for some $C>0$ and $p>0$. Call this class $\Pi(\cE,C,n,p)$, the class of $p$-order policies. Note that UCB is in this class with $C = C’\sqrt{K \log(n)}$ and $p=1/2$ with some $C’>0$. Thus, for $p\ge 1/2$ and suitable $C$ this class is not empty.

As an immediate corollary of the previous lemma we get the following result:

Theorem (Finite-time, instance-dependent lower bound for $p$-order policies): Take $\cE$ and $\Pi(\cE,C,n,p)$ as in the previous paragraph. Then, for any $\pi \in \Pi(\cE,C,n,p)$ and $\nu \in \cE$, the regret of $\pi$ on $\nu$ satisfies
R_n(\pi,\nu) \ge
\frac{1}{2} \sum_{i: \Delta_i>0} \left(\frac{(1-p)\log(n) + \log(\frac{\Delta_i}{8C})}{\Delta_i}\right)^+\,,
where $(x)^+ = \max(x,0)$ is the positive part of $x\in \R$.

Proof: Given $\nu=(\mathcal N(\mu_i,1))_i$, $i$ such that action $i$ is suboptimal in $\nu$, choose $\nu_i’ = (\mathcal N(\mu_j’,1))_j$ such that $\mu_j’=\mu_j$ unless $j=i$, in which make $\mu_i’ = \mu_i+\lambda$, $\lambda\le 2\Delta_i(\nu)$. Then, $\nu’\in \cE$ and from $\log(R_n+R_n’) \le \log(2C)+p\log(n)$ and $\eqref{eq:idalloclb}$, we get
\E_{\nu,\pi} [T_i(n)]
\ge \frac{2}{\lambda^2}\left(\log\left(\frac{n}{2Cn^p}\right) + \log\left(\frac{\min\set{\lambda – \Delta_i, \Delta_i}}{4}\right)\right)
Choosing $\lambda = 2\Delta_i$ and plugging this into the basic regret decomposition identity gives the result.


With $p=1/2$, $C = C’\sqrt{K}$ with $C’>0$ suitable so that $\Pi(\cE,C’\sqrt{K},n,1/2)\ne \emptyset$ we get for any policy $\pi$ that is “near-minimax optimal” for $\cE$ that
R_n(\pi,\nu) \ge \frac{1}{2} \sum_{i: \Delta_i>0} \left(\frac{\frac12\log(n) + \log(\frac{\Delta_i}{8C’ \sqrt{K}})}{\Delta_i}\right)^+\,.
The main difference to the asymptotic lower bound is twofold: First, when $\Delta_i$ is very small, the corresponding term is eliminated in the lower bound: For small $\Delta_i$, the impact of choosing for $n$ small action $i$ is negligible. Second, even when $\Delta_i$ are all relatively large, the leading term in this lower bound is approximately half that of $c^*(\nu)$. This effect may be real: The class of policies considered is larger than in the asymptotic lower bound, hence there is the possibility that the policy that is best tuned for a given environment achieves a smaller regret. At the same time, we only see a constant factor difference.

The lower bound $\eqref{eq:indnearminimaxlb}$ should be compared to the upper bound derived for UCB. In particular, recall that the simplified form of an upper bound was $\sum_{i:\Delta_i>0} \frac{C \log(n)}{\Delta_i}$. We see that the main difference to the above bound is the lack of the second term in $\eqref{eq:indnearminimaxlb}$. With some extra work, this gap can also be closed.

We now argue that the previous lower bound is not too weak in that in some sense it allows us to recover our previous minimax lower bound. In particular, choosing $\Delta_i= 8 \rho C n^{p-1}$ uniformly for all but one optimal action, we get $(1-p)\log(n)+\log(\frac{\Delta_i}{8C}) = \log(\rho)$. Plugging this into $\eqref{eq:finiteinstancebound}$ and lower bounding $\sup_{\rho>0}\log(\rho)/\rho = \exp(-1)$ by $0.35$ gives the following corollary:

Corollary (Finite-time lower bound for $p$-order policies): Take $\cE$ and $\Pi(\cE,C,n,p)$ as in the previous paragraph. Then, for any $\pi \in \Pi(\cE,C,n,p)$,
\sup_{\nu\in \cE} R_n(\pi,\nu) \ge \frac{0.35}{16} \frac{K-1}{C} n^{1-p} \,.

When $C = C’\sqrt{K}$ and $p=1/2$, we get the familiar $\Omega( \sqrt{Kn} )$ lower bound. However, note the difference: Whereas the previous lower bound was true for any policy, this lower bound holds only for policies in $\Pi(\cE,C’\sqrt{K},n,1/2)$. Nevertheless, it is reassuring that the instance-dependent lower bound is able to recover the minimax lower bound for the “near-minimax” policies. And to emphasize the distinction even further. In our minimax lower bound we only showed there exists an environment for which the lower bound holds, while the result above holds no matter which arm we choose to be the optimal one. This is only possible because we restricted the class of algorithms.


So UCB is close to minimax optimal (last post) and now we have seen that at least among the class of consistent strategies it is also asymptotically optimal and also almost optimal for any instance amongst the near-minimax optimal algorithm even in a non-asymptotic sense. One might ask if there is anything left to do, and like always the answer is: Yes, lots!

For one, we have only stated the results for the Gaussian case. Similar results are easily shown for alternative classes. A good exercise is to repeat the derivations in this post for the class of Bernoulli bandits, where the rewards of all arms are sampled from a Bernoulli distribution.

There is also the question of showing bounds that hold with high probability rather than in expectation. So far we have not discussed any properties of the distribution of the regret other than its mean, which we have upper bounded for UCB and lower bounded here and in the last post. In the next few posts we will develop algorithms for which the regret is bounded with high probability and so the lower bounds of this nature will be delayed for a little while.

Notes, references

There are now a wide range of papers with lower bounds for bandits. For asymptotic results the first article below was the earliest that we know of, while others are generalizations.

For non-minimax results that hold in finite time there are very recently a variety of approaches (and one older one). They are:

Finally, for the sake of completeness we note that minimax bounds are available in the following articles. The second of which deals with the high-probability case mentioned above.

Note 1. All of the above results treat all arms symmetrically. Sometimes one might want an algorithm that treats the arms differently and for which the regret guarantee depends on which arm is optimal. We hope eventually to discuss the most natural approach to this problem, which is to use a Bayesian strategy (though we will be interested in it for other reasons). Meanwhile, there are modifications of UCB that treat the actions asymmetrically so that its regret is small if some arms are optimal and large otherwise.

We simply refer the interested reader to a paper on the Pareto regret frontier for bandits, which fully characterizes the trade-offs available to asymmetric algorithms in terms of the worst-case expected regret when comparing to a fixed arm.

Note 2. The leading constant in the instance dependent bound is sub-optimal. This can be improved by a more careful choice of $\lambda$ that is closer to $\Delta_i$ than $2\Delta_i$ and dealing with the lower-order terms that this introduces.

More information theory and minimax lower bounds

Continuing the previous post, we prove the claimed minimax lower bound.

We start with a useful result that quantifies the difficulty of identifying whether or not an observation is drawn from similar distributions $P$ and $Q$ defined over the same measurable space $(\Omega,\cF)$. In particular, the difficulty will be shown to increase as a function of the relative entropy of $P$ with respect to $Q$. In the result below, as usual in probability theory, for $A\subset \Omega$, $A^c$ denotes the complement of $A$ with respect to $\Omega$: $A^c = \Omega \setminus A$.

Theorem (High-probability Pinsker): Let $P$ and $Q$ be probability measures on the same measurable space $(\Omega, \cF)$ and let $A \in \cF$ be an arbitrary event. Then,
P(A) + Q(A^c) \geq \frac{1}{2}\, \exp\left(-\KL(P, Q)\right)\,.

The logic of the result is as follows: If $P$ and $Q$ are similar to each other (e.g., in the sense that $\KL(P,Q)$ is small) then for any event $A\in \cF$, $P(A)\approx Q(A)$ and $P(A^c) \approx Q(A^c)$. Since $P(A)+P(A^c)=1$, we thus also expect $P(A)+Q(A^c)$ to be “large”. How large is “large” is the contents of the result. In particular, using $2 \max(a,b)\ge a+b$, it follows that for any $0<\delta\le 1$, to guarantee $\max(P(A),Q(A^c))\le \delta$, it is necessary that $\KL(P,Q)\ge \log(\frac{1}{4\delta})$, that is relative entropy of $P$ with respect to $Q$ must be larger than the right-hand side. Connecting the dots, this means that the information coming from an observation drawn from $P$ relative to expecting an observation from $Q$ must be just large enough.

Note that if $P$ is not absolutely continuous with respect to $Q$ then $\KL(P,Q)=\infty$ and the result is vacuous. Also note that the result is symmetric. We could replace $\KL(P, Q)$ with $\KL(Q, P)$, which sometimes leads to a stronger result because the relative entropy is not symmetric.

This theorem is useful for proving lower bounds because it implies that at least one of $P(A)$ and $Q(A^c)$ is large. Then, in an application where after some observations a decision needs to be made, $P$ and $Q$ are selected such that the a correct decision under $P$ is incorrect under $Q$ and vice versa. Then one lets $A$ be the event when the decision is incorrect under $P$ and $A^c$ is the event that the decision is incorrect under $Q$. The result then tells us that it is impossible to design a decision rule such that the probability of an incorrect decision is small under both $P$ and $Q$.

To illustrate this, consider the normal means problem when $X\sim N(\mu,1)$ is observed with $\mu \in \{0,\Delta\}$. In this case, we let $P$ the distribution of $X$ under $\mu=0$, $Q$ the distribution of $X$ under $\mu=\Delta$ and if $f: \R \to \{0,\Delta\}$ is the measurable function that is proposed as the decision rule, $A = \{\, x\in \R \,: f(x) \ne 0 \}$. Then, using e.g. $\Delta=1$, the theorem tells us that
P(A) + Q(A^c) \geq \frac{1}{2}\, \exp\left(-\KL(P, Q)\right)
= \frac{1}{2}\, \exp\left(-\frac{\Delta^2}{2\sigma^2}\right)
= \frac{1}{2}\, \exp\left(-1/2\right) \geq 3/10\,.
This means that no matter how we chose our decision rule $f$, we simply do not have enough data to make a decision for which the probability of error on either $P$ or $Q$ is smaller than $3/20$. All that remains is to apply this idea to bandits and we will have our lower bound. The proof of the above result is postponed to the end of the post.

After this short excursion into information theory, let us return to the world of $K$-action stochastic bandits. In what follows we fix $n>0$, the number of rounds and $K>0$, the number of actions. Recall that given a $K$-action bandit environment $\nu = (P_1,\dots,P_K)$ ($P_i$ is the distribution of rewards for action $i$) and a policy $\pi$, the outcome of connecting $\pi$ and $\nu$ was defined as the random variables $(A_1,X_1,\dots,A_n,X_n)$ such that for $t=1,2,\dots,n$, the distribution of $A_t$ given the history $A_1,X_1,\dots,A_{t-1},X_{t-1}$ is uniquely determined by the policy $\pi$, while the distribution of $X_t$ given the history $A_1,X_1,\dots,A_{t-1},X_{t-1},A_t$ was required to be $P_{A_t}$. While the probability space $(\Omega,\cF,\PP)$ that carries these random variables can be chosen in many different ways, you should convince yourself that the constraints mentioned uniquely determine the distribution of the random element $H_n = (A_1,X_1,\dots,A_n,X_n)$ (or wait two paragraphs). As such, despite that $(\Omega,\cF,\PP)$ is non-unique, we can legitimately define $\PP_{\nu,\pi}$ to be the distribution of $H_n$ (suppressing dependence on $n$).

In fact, we will find it useful to write this distribution with a density. To do this, recall from the Radon-Nykodim theorem that if $P,Q$ are measures over a common measurable space $(\Omega,\cF)$ such that $Q$ is $\sigma$-finite and $P\ll Q$ (i.e., $P$ is absolutely continuous w.r.t. $Q$, which is sometimes written as $Q$ dominates $P$) then there exists a random variable $f:\Omega \to \R$ such that
\int_A f dQ = \int_A dP, \qquad \forall A\in \cF\,.
The function $f$ is said to be the density of $P$ with respect to $Q$, also denoted by $\frac{dP}{dQ}$. Since writing $\forall A\in \cF$, $\int_A \dots = \int_A \dots $ is a bit redundant, to save typing, following standard practice, in the future we will write $f dQ = dP$ (or $f(\omega) dQ(\omega) = dP(\omega)$ to detail the dependence on $\omega\in \Omega$) as a shorthand for $\eqref{eq:rnagain}$. This also explains why we write $f = \frac{dP}{dQ}$.

With this, it is not hard to verify that for any $(a_1,x_1,\dots,a_n,x_n)\in \cH_n \doteq ([K]\times \R)^n$ we have
& = \pi_1(a_1) p_{a_1}(x_1) \, \pi_2(a_2|a_1,x_1) p_{a_2}(x_2)\cdot \\
& \quad \cdots \pi_n(a_n|a_1,x_1,\dots,a_{n-1},x_{n-1} )\,p_{a_n}(x_n) \\
& \qquad \times d\lambda(x_1) \dots d\lambda(x_n) d\rho(a_1) \dots d\rho(a_n)\, ,

  • $\pi = (\pi_t)_{1\le t \le n}$ with $\pi_t(a_t|a_1,x_1,\dots,a_{t-1},x_{t-1})$ being the probability that policy $\pi$ in round $t$ chooses action $a_t\in [K]$ when the sequence of actions and rewards in the previous rounds are $(a_1,x_1,\dots,a_{t-1},x_{t-1})\in \cH_{t-1}$;
  • $p_i$ is defined by $p_i d\lambda = d P_i$, where $\lambda$ is a common dominating measure for $P_1,\dots,P_K$, and
  • $\rho$ is the counting measure on $[K]$: For $A\subset [K]$, $\rho(A) = |A \cap [K]|$.

There is no loss of generality in assuming that $\lambda$ exist: For example, we can always use $\lambda = P_1+\dots+P_K$. In a note at the end of the post, for full mathematical rigor, we add another condition to the definition of $\pi$. (Can you guess what this condition should be?)

Note that $\PP_{\nu,\pi}$, as given by the right-hand side of the above displayed equation, is a distribution over $(\cH_n,\cF_n)$ where $\cF_n=\mathcal L(\R^{2n})|_{\cH_n}$. Here, $\mathcal L(\R^{2n})$ is the Lebesgue $\sigma$-algebra over $\R^{2n}$ and $\mathcal L(\R^{2n})|_{\cH_n}$ is its restriction to $\cH_n$.

As mentioned in the previous post, given any two bandit environments, $\nu=(P_1,\dots,P_K)$ and $\nu’=(P’_1,\dots,P’_K)$, we expect $\PP_{\nu,\pi}$ and $\PP_{\nu’,\pi}$ to be close to each other if all of $P_i$ and $P_i’$ are close to each other, regardless of the policy $\pi$ (a sort of continuity result). Our next result will make this precise; in fact we will show an identity that expresses $\KL( \PP_{\nu,\pi}, \PP_{\nu’,\pi})$ as a function $\KL(P_i,P’_i)$, $i=1,\dots,K$.

Before stating the result, we need to introduce the concept of ($n$-round $K$-action) canonical bandit models. The idea is to fix the measurable space $(\Omega,\cF)$ (that holds the variables $(X_t,A_t)_t$) in a special way. In particular, we set $\Omega=\cH_n$ and $\cF=\cF_n$. Then, we define $A_t,X_t$, $t=1,\dots,n$ to be the coordinate projections:
A_t( a_1,x_1,\dots,a_n,x_n ) &= a_t\,, \text{ and }\\
X_t( a_1,x_1,\dots,a_n,x_n) &= x_t,\, \quad \forall t\in [n]\,\text{ and } \forall (a_1,x_1,\dots,a_n,x_n)\in \cH_n\,.
Now, given some policy $\pi$ and bandit environment $\nu$, if we set $\PP=\PP_{\nu,\pi}$ then, trivially, the distribution of $H_n=(A_1,X_1,\dots,A_n,X_n)$ is indeed $\PP_{\nu,\pi}$ as expected. Thus, $A_t,X_t$ as defined by $\eqref{eq:coordproj}$, satisfies the requirements we posed against the random variables that describe the outcome of the interaction of a policy and an environment. However, $A_t,X_t$ are fixed, no matter the choice of $\nu,\pi$. This is in fact the reason why it is convenient to use the canonical bandit model! Furthermore, since all our past and future statements concern only the properties of $\PP_{\nu,\pi}$ (and not the minute details of the definition of the random variables $(A_1,X_1,\dots,A_n,X_n)$), there is no loss of generality in assuming that in all our expressions $(A_1,X_1,\dots,A_n,X_n)$ is in fact defined over the canonical bandit model using $\eqref{eq:coordproj}$. (We avoided initially the canonical model because up to now there was no advantage to introducing and using it.)

Before we state our promised result, we introduce one last notation that we need: Since we use the same measurable space $(\cH_n,\cF_n)$ with multiple environments $\nu$ and policies $\pi$, which give rise to various probability measures $\PP = \PP_{\nu,\pi}$, when writing expectations we will index $\E$ the same way as $\PP$ is indexed. Thus, $\E_{\nu,\pi}$ denotes expectation underlying $\PP_{\nu,\pi}$ (i.e., $\E_{\nu,\pi}[X] = \int X d\PP_{\nu,\pi}$. When $\pi$ is fixed, we will also drop $\pi$ from the indices and just write $\PP_{\nu}$ (and $\E_{\nu}$) for $\PP_{\nu,\pi}$ (respectively, $\E_{\nu,\pi}$).

With this, we are ready to state the promised result:

Lemma (Divergence decomposition): Let $\nu=(P_1,\ldots,P_K)$ be the reward distributions associated with one $K$-armed bandit, and let $\nu’=(P’_1,\ldots,P’_K)$ be the reward distributions associated with the another $K$-armed bandit. Fix some policy $\pi$ and let $\PP_\nu = \PP_{\nu,\pi}$ and $\PP_{\nu’} = \PP_{\nu’,\pi}$ be the distributions induced by the interconnection of $\pi$ and $\nu$ (respectively, $\pi$ and $\nu’$). Then
\KL(\PP_{\nu}, \PP_{\nu’}) = \sum_{i=1}^K \E_{\nu}[T_i(n)] \, \KL(P_i, P’_i)\,.

In particular the lemma implies that $\PP_{\nu}$ and $\PP_{\nu’}$ will be close if for all $i$, $P_i$ and $P’_i$ were close. A nice feature of the lemma is that it provides not only a bound on the divergence of $\PP_{\nu}$ from $\PP_{\nu’}$, but it actually expresses this with an identity.

First, assume that $\KL(P_i,P’_i)<\infty$ for all $i\in [K]$. From this, it follows that $P_i\ll P’_i$. Define $\lambda = \sum_i P_i + P’_i$ and let $p_i = \frac{dP_i}{d\lambda}$ and let $p’_i = \frac{dP’_i}{d\lambda}$. By the chain rule of Radon-Nikodym derivatives, $\KL(P_i,P’_i) = \int p_i \log( \frac{p_i}{p’_i} ) d\lambda$.

Let us now evaluate $\KL( \PP_{\nu}, \PP_{\nu’} )$. By the fundamental identity for the relative entropy, this is equal to $\E_\nu[ \log \frac{d\PP_\nu}{d\PP_{\nu’}} ]$, where $\frac{d\PP_\nu}{d\PP_{\nu’}}$ is the Radon-Nikodym derivative of $\PP_\nu$ with respect to $\PP_{\nu’}$. It is easy to see that both $\PP_\nu$ and $\PP_{\nu’}$ can be written in the form given by $\eqref{eq:jointdensity}$ (this is where we use that $\lambda$ dominates all of $P_i$ and $P_i’$). Then, using the chain rule we get
\log \frac{d\PP_\nu}{d\PP_{\nu’}} = \sum_{t=1}^n \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)}\,
(the terms involving the policy cancel!). Now, taking expectations of both sides, we get
\E_{\nu}\left[ \log \frac{d\PP_\nu}{d\PP_{\nu’}} \right]
= \sum_{t=1}^n \E_{\nu}\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right]\,,
\E_{\nu}\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right]
& = \E_{\nu}\left[ \E_{\nu}\left[\log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \, \big|\, A_t \right] \, \right]
= \E_{\nu}\left[ \KL(P_{A_t},P’_{A_t}) \right]\,,
where in the second equality we used that under $\PP_{\nu}$, conditionally on $A_t$, the distribution of $X_t$ is $dP_{A_t} = p_{A_t} d\lambda$. Plugging back into the previous display,
\E_{\nu}\left[ \log \frac{d\PP_\nu}{d\PP_{\nu’}} \right]
&= \sum_{t=1}^n \E_{\nu}\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right] \\
&= \sum_{t=1}^n \E_{\nu}\left[ \KL(P_{A_t},P’_{A_t}) \right] \\
&= \sum_{i=1}^K \E_{\nu}\left[ \sum_{t=1}^n \one{A_t=i} \KL(P_{A_t},P’_{A_t}) \right] \\
&= \sum_{i=1}^K \E_{\nu}\left[ T_i(n)\right]\KL(P_{i},P’_{i})\,.

When the right-hand side of $\eqref{eq:infoproc}$ is infinite, by our previous calculation, it is not hard to see that the left-hand side will also be infinite.


Using the previous theorem and lemma we are now in a position to present and prove the main result of the post, a worst-case lower bound on the regret of any algorithm for finite-armed stochastic bandits with Gaussian noise. Recall that previously we used $\cE_K$ to denote the class of $K$ action bandit environments where the reward distributions have means in $[0,1]^K$ and the noise in the rewards have 1-subgaussian tails. Denote by $G_\mu\in \cE_K$ the bandit environment where all distributions are Gaussian with unit variance and the vector of means is $\mu\in [0,1]^K$. To make the dependence of the regret on the policy and the environment explicit, in accordance with the notation of the previous post, we will use $R_n(\pi,E)$ to denote the regret of policy $\pi$ on a bandit instance $E$.

Theorem (Worst-case lower bound): Fix $K>1$ and $n\ge K-1$. Then, for any policy $\pi$ there exists a mean vector $\mu \in [0,1]^K$ such that
R_n(\pi,G_\mu) \geq \frac{1}{27}\sqrt{(K-1) n}\,.

Since $G_\mu \in \cE_K$, it follows that the minimax regret for $\cE_K$ is lower bounded by the right-hand side of the above display as soon as $n\ge K-1$:
R_n^*(\cE_K) \geq \frac{1}{27}\sqrt{(K-1) n}\,.
The idea of the proof is illustrated on the following figure:

The idea of the minimax lower bound.

The idea of the minimax lower bound. Given the policy and one environment, the evil antagonist picks another environment so that the policy will suffer a large regret in at least one of the environments.

Fix a policy $\pi$. Let $\Delta\in [0,1/2]$ be some constant to be chosen later. As described in the previous post, we choose two Gaussian environments with unit variance. In the first environment, the vector of means of the rewards per arm is
\mu = (\Delta, 0, 0, \dots,0)\,.
This environment and $\pi$ gives rise to the distribution $\PP_{G_\mu,\pi}$ on the canonical bandit model $(\cH_n,\cF_n)$. For brevity, we will use $\PP_{\mu}$ in place of $\PP_{G_\mu,\pi}$. The expectation under $\PP_{\mu}$ will be denoted by $\E_{\mu}$. Recall that on measurable space $(\cH_n,\cF_n)$, $A_t,X_t$ are just the coordinate projections defined using $\eqref{eq:coordproj}$.

To choose the second environment, pick
i = \argmin_{j\ne 1} \E_{\mu}[ T_j(n) ]\,.
Note that $\E_{\mu}[ T_i(n) ] \le n/(K-1)$ because otherwise we would have $\sum_{j\ne 1} \E_{\mu}[T_j(n)]>n$, which is impossible. Note also that $i$ depends on $\pi$. The second environment has means
\mu’ = (\Delta,0,0, \dots, 0, 2\Delta, 0, \dots, 0 )\,,
where specifically $\mu’_i = 2\Delta$. In particular, $\mu_j = \mu’_j$ except at index $i$. Note that the optimal action in $G_{\mu}$ is action one, while the optimal action in $G_{\mu’}$ is action $i\ne 1$. We again use the shorthand $\PP_{\mu’} = \PP_{G_{\mu’},\pi}$.

Intuitively, if $\pi$ chooses action one infrequently while interacting with $\mu$ (i.e., if $T_1(n)$ is small with high probability), then $\pi$ should do poorly in $G_{\mu}$, while if it chooses action one frequently while interacting with $G_{\mu’}$, then it will do poorly in $G_{\mu’}$. In particular, by the basic regret decomposition identity and a simple calculation,
R_n(\pi,G_\mu) %= \E_\mu[ (n-T_1(n)) \Delta ] \ge \E_\mu[ \one{T_1(n)\le n/2} ]\, \left(n-\frac{n}{2}\right)\, \Delta =
\ge \PP_{\mu}(T_1(n)\le n/2) \frac{n\Delta}{2}\, \quad \text{and} \quad
R_n(\pi,G_{\mu’}) > \PP_{\mu’}(T_1(n)> n/2)\, \frac{n\Delta}{2}\,.
%R_n(\pi,G_{\mu’}) %\ge \E_{\mu’}[ T_1(n) \Delta ] > \E_{\mu’}[ \one{T_1(n)> n/2} ]\, \frac{n\Delta}{2} =
%\ge \PP_{\mu’}(T_1(n)> n/2)\, \frac{n\Delta}{2}\,.
Chaining this with the high-probability Pinsker inequality,
R_n(\pi,G_\mu) + R_n(\pi,G_{\mu’})
& > \frac{n\Delta}{2}\,\left( \PP_{\mu}(T_1(n)\le n/2) + \PP_{\mu’}(T_1(n)> n/2) \right) \\
& \ge \frac{n\Delta}{4}\, \exp( – \KL(\PP_{\mu},\PP_{\mu’}) )\,.
It remains to upper bound $\KL(\PP_{\mu},\PP_{\mu’})$, i.e., to show that $\PP_\mu$ and $\PP_{\mu’}$ are indeed not far. For this, we use the divergence decomposition lemma, the definitions of $\mu$ and $\mu’$ and then exploit the choice of $i$ to get
\KL(\PP_\mu, \PP_{\mu’}) = \E_{\mu}[T_i(n)] \KL( \mathcal N(0,1), \mathcal N(2\Delta,1) ) =
\E_{\mu}[T_i(n)]\, \frac{(2\Delta)^2}{2} \leq \frac{2n\Delta^2}{K-1} \,.
Plugging this into the previous display, we find that
R_n(\pi,G_\mu) + R_n(\pi,G_{\mu’}) \ge \frac{n\Delta}{4}\, \exp\left( – \frac{2n\Delta^2}{K-1} \right)\,.
The result is completed by tuning $\Delta$. In particular, the optimal choice for $\Delta$ happens to be $\Delta = \sqrt{(K-1)/4n}$, which is below $1/2$ when $n$ is large compared to $K$, as postulated in the conditions of the theorem. The final steps are lower bounding $\exp(-1/2)$ and using $2\max(a,b) \ge a+b$.


This concludes the proof of the worst-case lower bound, showing that no algorithm can do better than $O(\sqrt{nK})$ over all problems simultaneously. Coming in the next post we will show that the asymptotic logarithmic regret of UCB is essentially not improvable when the noise is Gaussian.

In the remainder of the post, we give the proof of the “high probability” Pinsker inequality, followed by miscellaneous thoughts.

Proof of the Pinsker-type inequality

It remains to show the proof of the Pinsker-type inequality stated above. In the proof, to save space, we use the convention of denoting the minimum of $a$ and $b$ by $a\wedge b$, while we denote their maximum by $a \vee b$.

If $\KL(P,Q)=\infty$, there is nothing to be proven. On the other hand, by our previous theorem on relative entropy, $\KL(P,Q)<\infty$ implies that $P \ll Q$. Take $\nu = P+Q$. Note that $P,Q\ll \nu$. Hence, we can define $p = \frac{dP}{d\nu}$ and $q = \frac{dQ}{d\nu}$ and use the chain rule for Radon-Nikodym derivatives to derive that $\frac{dP}{dQ} \frac{dQ}{d\nu} = \frac{dP}{d\nu}$, or $\frac{dP}{dQ} = \frac{\frac{dP}{d\nu}}{\frac{dQ}{d\nu}}$. Thus,
\KL(P,Q) = \int p \log(\frac{p}{q}) d\nu\,.
For brevity, when writing integrals with respect to $\nu$, in this proof, we will drop $d\nu$. Thus, we will write, for example $\int p \log(p/q)$ for the above integral.

Instead of $\eqref{eq:pinskerhp}$, we prove the stronger result that
\int p \wedge q \ge \frac12 \, \exp( – \KL(P,Q) )\,.
This indeed is sufficient since $\int p \wedge q = \int_A p \wedge q + \int_{A^c} p \wedge q \le \int_A p + \int_{A^c} q = P(A) + Q(A^c)$.

We start with an inequality attributed to Lucien Le Cam, which lower bounds the left-hand side of $\eqref{eq:pinskerhp2}$. The inequality states that
\int p \wedge q \ge \frac12 \left( \int \sqrt{pq} \right)^2\,.
Starting from the right-hand side above, using $pq = (p\wedge q) (p \vee q)$ and then Cauchy-Schwartz, we get
\left( \int \sqrt{pq} \right)^2
= \left(\int \sqrt{(p\wedge q)(p\vee q)} \right)^2
\le \left(\int p \wedge q \right)\, \left(\int p \vee q\right)\,.
Now, using $p\wedge q + p \vee q = p+q$, the proof is finished by substituting $\int p \vee q = 2-\int p \wedge q \le 2$ and dividing both sides by two.

Thus, it remains to lower bound the right-hand side of $\eqref{eq:lecam}$. For this, we use Jensen’s inequality. First, we write $(\cdot)^2$ as $\exp (2 \log (\cdot))$ and then move the $\log$ inside the integral:
\left(\int \sqrt{pq} \right)^2 &= \exp\left( 2 \log \int \sqrt{pq} \right)
= \exp\left( 2 \log \int p \sqrt{\frac{q}{p}} \right) \\
&\ge \exp\left( 2 \int p \,\frac12\,\log \left(\frac{q}{p}\right)\, \right) \\
&= \exp\left( 2 \int_{pq>0} p \,\frac12\,\log \left(\frac{q}{p}\right)\, \right) \\
&= \exp\left( – \int_{pq>0} p \log \left(\frac{p}{q}\right)\, \right) \\
&= \exp\left( – \int p \log \left(\frac{p}{q}\right)\, \right)\,.
Here, in the fourth and the last step we used that since $P\ll Q$, $q=0$ implies $p=0$, hence $p>0$ implies $q>0$, and eventually $pq>0$. Putting together the inequalities finishes the proof.


Note 1: We used the Gaussian noise model because the KL divergences are so easily calculated in this case, but all that we actually used was that $\KL(P_i, P_i’) = O((\mu_i – \mu_i’)^2)$ when the gap between the means $\Delta = \mu_i – \mu_i’$ is small. While this is certainly not true for all distributions, it very often is. Why is that? Let $\{P_\mu : \mu \in \R\}$ be some parametric family of distributions on $\Omega$ and assume that distribution $P_\mu$ has mean $\mu$. Then assuming the densities are twice differentiable and that everything is sufficiently “nice” that integrals and derivatives can be exchanged (as is almost always the case), we can use a Taylor expansion about $\mu$ to show that
\KL(P_\mu, P_{\mu+\Delta})
&\approx \KL(P_\mu, P_\mu) + \left.\frac{\partial}{\partial \Delta} \KL(P_\mu, P_{\mu+\Delta})\right|_{\Delta = 0} \Delta + \frac{1}{2}\left.\frac{\partial^2}{\partial\Delta^2} \KL(P_\mu, P_{\mu+\Delta})\right|_{\Delta = 0} \Delta^2 \\
&= \left.\frac{\partial}{\partial \Delta} \int_{\Omega} \log\left(\frac{dP_\mu}{dP_{\mu+\Delta}}\right) dP_\mu \right|_{\Delta = 0}\Delta
+ \frac12 I(\mu) \Delta^2 \\
&= -\left.\int_{\Omega} \frac{\partial}{\partial \Delta} \log\left(\frac{dP_{\mu+\Delta}}{dP_{\mu}}\right) \right|_{\Delta=0} dP_\mu \Delta
+ \frac12 I(\mu) \Delta^2 \\
&= -\left.\int_{\Omega} \frac{\partial}{\partial \Delta} \frac{dP_{\mu+\Delta}}{dP_{\mu}} \right|_{\Delta=0} dP_\mu \Delta
+ \frac12 I(\mu) \Delta^2 \\
&= – \frac{\partial}{\partial \Delta} \left.\int_{\Omega} \frac{dP_{\mu+\Delta}}{dP_{\mu}} dP_\mu \right|_{\Delta=0} \Delta
+ \frac12 I(\mu) \Delta^2 \\
&= -\left.\frac{\partial}{\partial \Delta} \int_{\Omega} dP_{\mu+\Delta} \right|_{\Delta=0}\Delta
+ \frac12 I(\mu) \Delta^2 \\
&= \frac12 I(\mu)\Delta^2\,,
where $I(\mu)$, introduced in the second line, is called the Fisher information of the family $(P_\mu)_{\mu}$ at $\mu$. Note that if $\lambda$ is a common dominating measure for $(P_{\mu+\Delta})$ for $\Delta$ small, $dP_{\mu+\Delta} = p_{\mu+\Delta} d\lambda$ and we can write
I(\mu) = -\int \left.\frac{\partial^2}{\partial \Delta^2} \log\, p_{\mu+\Delta}\right|_{\Delta=0}\,p_\mu d\lambda\,,
which is the form that is usually given in elementary texts. The upshot of all this is that $\KL(P_\mu, P_{\mu+\Delta})$ for $\Delta$ small is indeed quadratic in $\Delta$, with the scaling provided by $I(\mu)$, and as a result the worst-case regret is always $O(\sqrt{nK})$, provided the class of distributions considered is sufficiently rich and not too bizarre.

Note 2: In the previous post we showed that the regret of UCB is at most $O(\sqrt{nK \log(n)})$, which does not match the lower bound that we just proved. It is possible to show two things. First, that UCB does not achieve $O(\sqrt{nK})$ regret. The logarithm is real, and not merely a consequence of lazy analysis. Second, there is a modification of UCB (called MOSS) that matches the lower bound up to constant factors. We hope to cover this algorithm in future posts, but the general approach is to use a less conservative confidence level.

Note 3: We have now shown a lower bound that is $\Omega(\sqrt{nK})$, while in the previous post we showed an upper bound for UCB that was $O(\log(n))$. What is going on? Do we have a contradiction? Of course, the answer is no and the reason is that the logarithmic regret guarantees depended on the inverse gaps between the arms, which may be very large.

Note 4: Our lower bounding theorem was only proved when $n$ is at least linear in $K$. What happens when $n$ is small compared to $K$ (in particular, $n\le (K-1)$) above? A careful investigation of the proof of the theorem shows that in this case a linear lower bound of size $n/8\exp(-1/2)$ also holds for the regret (choose $\Delta=1/2$). Note that for small values of $n$ compared to $K$, the square-root lower bound gives larger than linear regret owning to that for $0\le x\le 1$, $\sqrt{x}\ge x$ and if the mean rewards are restricted the $[0,1]$ interval than the regret can not be larger than $n$, i.e., for $n$ small, the lower bound is vacuous (in particular, this happens when $n\le (K-1)/27$). Hence, the square-root form is simply too strong to hold for small values of $n$. A lower bound, which joins the two cases and thus holds for any value of $n$ regardless of $K$ is $\frac{\exp(-1/2)}{16}\, \min( n, \sqrt{(K-1)n} )$.

Note 5: To guarantee that the right-hand side of $\eqref{eq:jointdensity}$ defines a measure, we also must have that for any $t\ge 1$, $\pi_t(i|\cdot): \cH_{t-1} \to [0,1]$ is a measurable map. In fact, $\pi_t$ is a simple version of a so-called probability kernel. For the interested reader, we add the general definition for a probability kernel $K$: Let $(\cX,\cF)$, $(\cY,\cG)$ be a measurable spaces. $K:\cX \times \cG \to [0,1]$ is a probability kernel if it satisfies the following two properties: (i) For any $x\in \cX$, $K(x,\cdot)$ as the $U \mapsto K(x,U)$ ($U\in \cG$) map is a probability measure; (ii) For any $U\in \cG$, $K(\cdot,U)$ as the $x\mapsto K(x,U)$ ($x\in \cX$) map is measurable. When $\cY$ is finite and $\cG= 2^{\cY}$, instead of $K(x,U)$ for $U\subset \cY$, it is common that we specify the values of $K(x,\{y\})$ for $x\in \cX$ and $y\in \cY$ and in these cases we also often write $K(x,y)$. It is also common to write $K(U|x)$ (or in the discrete case $K(y|x)$).

Note 6: There is another subtlety in connection to the definition of $\PP_{\nu,\pi}$ is that we emphasized that $\lambda$ should be a measure over the Lebesgue $\sigma$-algebra $\mathcal L(\R)$. This will be typically easy to satisfy and holds for example when $P_1,\dots,P_K$ are defined over $\mathcal L(\R)$. This requirement would not be needed though if we did not insist on using the canonical bandit model. In particular, by their original definition, $X_1,\dots,X_n$ (as maps from $\Omega$ to $\R$) need only be $\cF/\mathcal B(\R)$ measurable (where $\mathcal B(\R)$ is the Borel $\sigma$-algebra over $\R$) and thus at a first sight it may seem strange to require that $P_1,\dots,P_K$ are defined over $\mathcal L(\R)$. The reason we require this is because in the canonical model we wish to use the Lebesgue $\sigma$-algebra of $\R^{2n}$ (in particular its restriction to $\cH_n$), which is in fact demanded by our wish to work with complete measure-spaces. This latter, as explained at the link, is useful because it essentially allows us not to worry about (and drop) sets of zero measure. Now, $\PP_{\nu,\pi}$, as the joint distribution of $(A_1,X_1,\dots,A_n,X_n)$ would only be defined for Borel measurable sets. Then, $\eqref{eq:jointdensity}$ could be first derived for the restriction of $\lambda$ to Borel measurable sets and then $\eqref{eq:jointdensity}$ with $\lambda$ not restricted to Borel measurable sets gives the required extension of $\PP_{\nu,\pi}$ to Lebesgue measurable sets. We can also take $\eqref{eq:jointdensity}$ as our definition of $\PP_{\nu,\pi}$, in which case the issue does not arise.

Note 7: (A simple version of) Jensen’s inequality states that if for any $U\subset \R$ convex, with non-empty interior and for any $f: U \to \R$ convex function and random variable $X\in U$ such that $\EE{X}$ exists, $f(\EE{X})\le \EE{f(X)}$. The proof is simple if one notes that for such a convex $f$ function, at every point $m\in R$ in the interior of $U$, there exists a “slope” $a\in \R$ such that $a(x-m)+f(m)\le f(x)$ for all $x\in \R$ (if $f$ is differentiable at $m$, take $a = f'(m)$). Indeed, is such a slope exists, taking $m = \EE{X}$ and replacing $x$ by $X$ we get $a(X-\EE{X}) + f(\EE{X}) \le f(X)$. Then, taking the expectation of both sides we arrive at Jensen’s inequality. The idea can be generalized into multiple directions, i.e., the domain of $f$ could be a convex set in a vector space, etc.

Note 8: We can’t help ourselves not to point out that the quantity $\int \sqrt{p q}$ that appeared in the previous proof is called the Bhattacharyya coefficient and that $\sqrt{1 – \int \sqrt{pq}}$ is the Hellinger distance, which is also a measure of the distance between probability measures $P$ and $Q$ and is frequently useful (perhaps especially in mathematical statistics) as a more well-behaved distance than the relative entropy (it is bounded, and symmetric and satisfies the triangle inequality!).


The first minimax lower bound that we know of was given by Auer et al. below. Our proof uses a slightly different technique, but otherwise the idea is the same.

Note 9: Earlier, what in the last version of the post is called the divergence decomposition lemma was called the information processing lemma. This previous choice of name however is not very fortunate as it collides with the information processing inequality, which is a related inequality, but still quite different. Perhaps we should call this the bandit divergence decomposition lemma, or its claim the bandit divergence decomposition identity, though there is very little used of the specifics of bandits problems here. In particular, the identity holds whenever a sequential policy interacts with a stochastic environment where the feedback to the stochastic policy is based on a fixed set of distributions (swap the reward distributions with feedback distributions in the more general setting).

The high-probability Pinsker inequality is appears as as Lemma 2.6 in Sasha Tsybakov‘s book on nonparametric estimation. According to him, the result is originally due to Bretagnolle and Huber and the original reference is:

  • Bretagnolle, J. and Huber, C. (1979) Estimation des densitiés: risque minimax. Z. für Wahrscheinlichkeitstheorie und verw. Geb., 47, 199-137.

One of the authors learned this inequality from the following paper:

  • Sébastien Bubeck, Vianney Perchet, Philippe Rigollet: Bounded regret in stochastic multi-armed bandits. COLT 2013: 122-134