## Contextual Bandits and the Exp4 Algorithm

In most bandit problems there is likely to be some additional information available at the beginning of rounds and often this information can potentially help with the action choices. For example, in a web article recommendation system, where the goal is to keep the visitors engaged with the website, contextual information about the visitor of the website, the time of day, information on what is trendy, etc., can likely improve the choice of the article to be put on the “front-page”. For example, a science-oriented article is more likely to grab the attention of a science geek, and a baseball fan may care little about European soccer.

If we used a standard bandit algorithm (like Exp3, Exp3-IX, or UCB), the one-size fits all approach implicitly taken by these algorithms which aim just finding the single most catching article is likely to disappoint an unnecessarily large portion of the site’s visitors. In situations like this, since the benchmark that the bandit algorithms aim to approach performs poorly by omitting available information, it is better to change the problem and redefine the benchmark! It is important to realize though that there is an inherent difficulty here:

Competing with a poor benchmark does not make sense since even an algorithm that perfectly matches the benchmark will perform poorly. At the same time, competing with a better benchmark can be harder from a learning point of view and in a specific scenario the gain from a better benchmark may very well be offset by the fact that algorithms that compete with stronger benchmark have to search in a larger space of possibilities.

The tradeoff just described is fundamental to all machine learning problems. In statistical estimation, the analogue tradeoff is known as the bias-variance tradeoff.

We will not attempt to answer the question of how to resolve this tradeoff in this post because first we need to see ways ways effectively competing with improved benchmarks. First, let’s talk about possible improvements to the benchmarks.

# Contextual bandits: One bandit per context

In a contextual bandit problem everything works the same as in a bandit problem, except that the learner receives a context at the beginning of the round, before it needs to select its action. The promise, as discussed, is that perhaps specializing the action taken to the context can help to collect more reward.

Assuming that the set $\cC$ of all possible contexts is finite, one simple approach then is to set up a separate bandit algorithm, such as Exp3, for each context. Indeed, if we do this then the collection of bandit algorithms should be able to compete with the best context-dependent action. In particular, the best total reward that we can achieve in $n$ rounds if we are allowed to adjust the action to the context is
\begin{align*}
S_n = \sum_{c\in \cC} \max_{k\in [K]} \sum_{t: c_t=c} x_{t,k}\,,
\end{align*}
where $c_t\in \cC$ is the context received at the beginning of round $t$. For future reference note that we can also write
\begin{align}
S_n = \max_{\phi: \cC \to [K]} \sum_{t=1}^n x_{t,\phi(c_t)}\,.
\label{eq:maxrewardunrestricted}
\end{align}

Then, the regret of a learner who incurs a reward $X_t$ in round $t$ is $R_n=S_n – \sum_t X_t$, which satisfies
\begin{align*}
R_n = \sum_{c\in \cC} \EE{\max_{k\in [K]} \sum_{t: c_t=c} (x_{t,k}-X_t)}\,.
\end{align*}
That is, $R_n$ is just the sum of the regrets suffered by the bandits assigned to the individual contexts. Let $T^c(n)$ be the number of times context $c\in \cC$ is seen during the first $n$ rounds and let $R^c(s)$ be the regret of the instance of Exp3 associated with $c$ at the end of the round when this instance is used $s$ times. With this notation we can thus write
\begin{align*}
R_n = \sum_{c\in \cC} \EE{ R^c(T^c(n)) }\,.
\end{align*}

Since $T^c(n)$ may vary from context to context and the distribution of $T^c(n)$ may be uneven, it would be wasteful to use a version of Exp3 that is tuned to achieve a small regret after a fixed number of rounds as such a version of Exp3 may suffer a larger regret when the actual number of rounds $T^c(n)$ is vastly different from the anticipated horizon. Luckily, the single parameter $\eta$ of Exp3 can be chosen in a way that does not depend on the anticipated horizon without losing much on the regret. In particular, as we hinted on this before, such an anytime version of Exp3 can be created if we let the $\eta$ parameter of Exp3 depend on the round index. In particular, when an Exp3 instance is used the $s$th time, if we set $\eta$ to $\eta_s = \sqrt{\log(K)/(sK)}$, then one can show that $R^c(s) \le 2\sqrt{sK \log(K)}$ will hold for any $s\le 1$.

Plugging this into the previous display, we get
\begin{align}
R_n \le 2 \sqrt{K \log(K)} \, \sum_{c\in \cC} \sqrt{ T_c(n) }\,.
\label{eq:exp3perc}
\end{align}
How big the right-hand side is depends on how skewed the context distribution $(T_c(n)/n)_{c\in \cC}$ is.

The best case is when only one context is seen, in which case $\sum_{c\in \cC} \sqrt{ T_c(n) }=\sqrt{n}$. Note that in this case the benchmark won’t improve either, but nevertheless the algorithm that keeps one Exp3 instance for each context does not lose anything compared to the algorithm that would run a single Exp3 instance, trying to compete with the single best action. This is good. The worst-case is when all context are seen equally often, in which case case $T_c(n) = n/|\cC|$ (assume for simplicity that this is an integer). In this case, $\eqref{eq:exp3perc}$ becomes
\begin{align}
R_n \le 2 \sqrt{ K \log(K) |\cC| n }\,.
\label{eq:exp3perc2}
\end{align}

It is instructive to consider what this means for the total reward:
\begin{align*}
\EE{\sum_{t=1}^n X_t} \ge S_n – 2 \sqrt{ K \log(K) |\cC| n }\,.
\end{align*}

While the benchmark $S_n$ may have improved, the second term will always exceed the first one for the first $4 K \log(K) |\cC|$ rounds! Thus, the guarantee on the total reward will be vacuous for all earlier time steps! When $|\cC|$ is large, we conclude that for a long time, the per-instance Exp3 algorithm may have a much worse total reward than if we just ran a single instance of Exp3, demonstrating the tradeoff mentioned above.

For large context sets, using one bandit algorithm per context will almost always be a poor choice because the additional precision is wasted unless the amount of data is enormous. Very often the contexts themselves have some kind of internal structure that we may hope to exploit. There are many different kinds of structure. For example, we expect that a person who is deeply into computer science may share common interests with people who are deeply into math, and people who are into sport acrobatics may enjoy gymnastics and vice versa. This gives the idea to group the contexts in some way, to reduce their number, and then assign a bandit to the individual groups. Assume that we chose a suitable partitioning of contexts, which we denote by $\cP \subset 2^{\cC}$. Thus, the elements of $\cP$ are disjoint subsets of $\cC$ such that jointly they cover $\cC$: $\cup \cP = \cC$. Then, the maximum total reward that can be achieved if for every element $P$ of the partitioning $\cP$ we can select a single action is
\begin{align*}
S_n = \sum_{P\in \cP} \max_{k\in [K]} \sum_{t: c_t\in P} x_{t,k}\,.
\end{align*}

It may be worthwhile to put this into an alternate form. Defining $\Phi(\cP) = \{\phi:\cC\to[K]\,;\, \forall c,c’\in \cC \text{ s.t. } c,c’\in P \text{ for some } P\in \cP, \phi(c) = \phi(c’)\}$ to be the set of functions the map contexts in the same partition to the same action (the astute reader may recognize $\Phi(\cP)$ as the set of all $\sigma(\cP)$-measurable functions from $\cC$ to $[K]$), we can rewrite $S_n$ as
\begin{align*}
S_n = \max_{\phi\in \Phi(\cP)} \sum_{t=1}^n x_{t,\phi(c_t)}\,.
\end{align*}

Compared to $\eqref{eq:maxrewardunrestricted}$ we see that what changed is that the set of functions that we are maximizing over became smaller. The reader may readily verify that the regret of the composite algorithm where an Exp3 with a varying parameter sequence as described above is used for each element of $\cP$ is exactly of the form $\eqref{eq:exp3perc}$ except that $c\in \cC$ has to be changed $P\in \cP$ and the definition of $T_c$ also needs to be adjusted accordingly.

Of course, $\Phi(\cP)$ is not the only set of functions that come to mind. We may consider of course different partitions. Or we may bring in extra structure of $\cC$. Why not, for example, use a similarity function of $\cC$ and then consider all functions which tend to assign identical actions to contexts that are more similar. For example, if $s:\cC\times \cC \to [0,1]$ is a similarity function, we may consider all functions $\phi: \cC\to[K]$ such that the average dissimilarity,
\begin{align*}
\frac{1}{|\cC|^2} \sum_{c,d\in \cC} (1-s(c,d)) \one{ \phi(c)\ne \phi(d) },
\end{align*}
is below a user-tuned threshold $\theta\in (0,1)$.

Another options is to run your favorite supervised learning method, training on some batch data to find a few predictors $\phi_1,\dots,\phi_M: \cC \to [K]$ (in machine learning terms, these would be called classifiers since the range space is finite). Then we could use a bandit algorithm to compete with the “best” of these in an online fashion. This has the advantage that the offline training procedure can bring in the power of batch data and the whole army of supervised learning, without relying on potentially inaccurate evaluation methods that aim to pick the best of the pack. And why pick if one does not need to?

The possibilities are endless, but in any case, we would end up with a set of functions $\Phi$ with the goal of competing with the best of them. This gives the idea that perhaps we should think more generally about some subset $\Phi$ of functions without necessarily considering the internal structure of $\Phi$. This is the viewpoint that we will take. In fact, we will bring this one or two steps further, leading to what is called bandits with expert advice.

In this model, there are $M$ experts that we wish to compete with. At the beginning of each round, the experts announce their predictions of which actions are the most promising. For the sake of generality, we allow the experts to report not only a single prediction, but a probability distribution over the actions. The interpretation of this probability distribution is that the expert, if the decision was left to it, would choose the action for the round at random from the probability distribution that it reported. As discussed before, in an adversarial setting, it is natural to consider randomized algorithms, hence one should not be too surprised that the experts are also allowed to randomize. In any case, this can only increase generality. For reasons that will become clear later, it will be useful to collect the advice of the $M$ experts for round $t$ into an $M\times K$ matrix $E^{(t)}$ such that the $m$th row of $E^{(t)}$, $E_m^{(t)}$ is the probability distribution that expert $m$ recommends for round $t$. Denote by $E_{mi}^{(t)}$ the $i$th entry of the row vector $E_m^{(t)}$, we thus have $\sum_i E_{mi}^{(t)}=1$ and $E_{mi}^{(t)}\ge 0$ for every $(m,t,i)\in [M]\times \N \times [K]$. After receiving $E^{(t)}$, the learner then chooses $A_t\in [K]$ and as before observes the reward $X_t = x_{t,A_t}$, where $x_t=(x_{ti})_i$ is the $K$-dimensional vector of rewards of the individual actions. For a real-world application, see the figure below.

Prediction with expert advice. The experts, upon seeing a foot give expert advice on what socks should fit it best. If the owner of the foot is happy, the recommendation system earns a cookie!

The regret of the learner is with respect to the total expected reward of the best expert:
\begin{align}
R_n = \EE{ \max_m \sum_{t=1}^n E_{m}^{(t)} x_{t} – \sum_{t=1}^n X_t }\,.
\label{eq:expertregret}
\end{align}

Below, we may allow the expert advice $E^{(t)}$ of round $t$ to depend on all the information up the beginning of round $t$. While this does allow “learning” experts, the regret definition above is not really meaningful if the experts would learn from the feedback $(A_t,X_t)$. For dealing with learning experts, it is more appropriate to measure regret as done in reinforcement learning where an agent controls the state of the environment, but the agent’s reward is compared to the best total reward that any other (simple) policy would incur, regardless of the “trajectory” of the agent. We will discuss reinforcement learning in some later post.

# Can it go higher? Exp4

Exp4 is actually not just an increased version number, but it stands for Exponential weighting for Exploration and Explotation with Experts. The idea of the algorithm is very simple: Since exponential weighting worked so well in the standard bandit problem, we should adopt it to the problem at hand. However, since now the goal is to compete with the best expert in hindsight, it is not the actions that we should score, but the experts. Hence, the algorithm will keep a probability distribution, which we will denote by $Q_t$, over the experts and use this to come up with the next action. Once the action is chosen, we can use our favorite reward estimation procedure to estimate the rewards for all the actions, which can then be used to estimate how much total reward the individual experts would have made so far, which in the end can be used to update $Q_t$.

Formally, the procedure is as follows: First, the distribution $Q_1$ is initialized to the uniform distribution $(1/M,\dots,1/M)$ (the $Q_t$ are treated as row vectors). Then, some values of $\eta,\gamma\ge 0$ are selected.

In round $t=1,2,\dots$, the following things happen:

1. The advice $E^{(t)}$ is received
2. Choose the action $A_t\sim P_{t,\cdot}$ at random, where $P_t = Q_t E^{(t)}$
3. The reward $X_t = x_{t,A_t}$ is received
4. The rewards of all the actions are estimated; say: $\hat X_{ti} = 1- \frac{\one{A_t=i}}{P_{ti}+\gamma} (1-X_t)$
5. Propagate the rewards to the experts: $\tilde X_t = E^{(t)} \hat X_t$
6. The distribution $Q_t$ is updated using exponential weighting:
$Q_{t+1,i} = \frac{\exp( \eta \tilde X_{ti} ) Q_{ti}}{\sum_j \exp( \eta \tilde X_{tj}) Q_{tj} }$, $i\in [M]$

Note that $A_t$ can be chosen in two steps, first sampling $M_t$ from $Q_{t}$ and then choosing $A_t$ from $E^{(t)}_{M_t,\cdot}$. The reader can verify that (given the past) the probability distribution of the so-selected action is also $P_{t}$.

# A bound on the expected regret of Exp4

The algorithm when we set $\gamma=0$ is actually what is most commonly known as Exp4 and the algorithm when $\gamma>0$ is the “IX” version of Exp4. As in the case of Exp3, setting $\gamma>0$ helps concentrating the regret. Here, for the sake of brevity we only consider the expected regret of Exp4; the analysis of the tail properties of Exp4-IX with appropriately tuned $\eta,\gamma$ is left as an exercise for the reader.

To bound the expected regret of Exp4, we apply the analysis of Exp3. In particular, the following lemma can be extracted from our earlier analysis of Exp3:

Lemma (regret of exponential weighting): Let $(\hat X_{ti})_{ti}$ and $(P_{ti})_{ti}$ satisfy for all $t\in [n]$ and $i\in [K]$ the relations $\hat X_{ti}\le 1$ and
\begin{align*}
P_{ti} = \frac{\exp( \eta \sum_{s=1}^t \hat X_{ti} )}{\sum_j\exp( \eta \sum_{s=1}^t \hat X_{tj} )}\,.
\end{align*}
Then, for any $i\in [K]$,
\begin{align*}
\sum_{t=1}^n \hat X_{ti} – \sum_{t=1}^n \sum_{j=1}^K P_{tj} \hat X_{tj} \le \frac{\log(M)}{\eta} + \frac{\eta}{2} \sum_{t,j} P_{tj} (1-\hat X_{tj})^2\,.
\end{align*}

Based on this lemma, we immediately get that for any $m\in [M]$,
\begin{align*}
\sum_{t=1}^n \tilde X_{tm} – \sum_{t=1}^n \sum_{m’} Q_{t,m’} \tilde X_{tm’} \le \frac{\log(M)}{\eta}
+ \frac{\eta}{2}\, \sum_{t,m’} Q_{t,m’} (1-\tilde X_{tm’})^2\,.
\end{align*}

Note that by our earlier argument $\EE{\hat X_{ti}} = x_{ti}$ and hence the expected value of the left-hand of the above display is the regret $R_n$ defined in $\eqref{eq:expertregret}$. Hence, to derive a regret bound it remains to bound the expectation of the right-hand side. For this, as before, we will find it useful to introduce the loss estimates $\hat Y_{ti} = 1-\hat X_{ti}$, the losses, $y_{ti} = 1-x_{ti}$, and also the loss estimates of experts: $\tilde Y_{tm} = 1-\tilde X_{tm}$, $m\in [M]$. Note that $\tilde Y_t = E^{(t)} \hat Y_t$, thanks to $E^{(t)} \mathbf{1} = \mathbf{1}$, where $\mathbf{1}$ is the vector whose components are all equal to one. Defining $A_{ti} = \one{A_t=i}$, we have $\hat Y_{ti} = \frac{A_{ti}}{P_{ti}} y_{ti}$.

Hence, using $\E_{t-1}[\cdot]$ to denote expectation conditioned on the past $(E^{(1)},A_1,X_1,\dots,E^{(t-1)},A_{t-1},X_{t-1},E^{(t)})$, we have
\begin{align*}
\E_{t-1}[ \tilde Y_{tm}^2 ] =\E_{t-1}\left[ \left(\frac{E^{(t)}_{m,A_t} y_{t,A_t}}{P_{t,A_t}}\right)^2\right]
= \sum_i \frac{(E^{(t)}_{m,i})^2 y_{t,i}^2}{P_{ti}} \le \sum_i \frac{(E^{(t)}_{m,i})^2 }{P_{ti}}\,.
\end{align*}
Therefore, $\require{cancel}$
\begin{align*}
\E_{t-1}\left[ \sum_m Q_{tm} \tilde Y_{tm}^2 \right]
&\le \sum_m Q_{tm} \sum_i \frac{(E^{(t)}_{m,i})^2 }{P_{ti}}
\le \sum_{i} \left(\max_{m’} E_{m’,i}^{(t)}\right) \frac{ \cancel{\sum_m Q_{tm} E^{(t)}_{m,i}} }{\cancel{P_{ti}}} \,.
\end{align*}

Defining
\begin{align*}
E^*_n = \sum_{t,i} \left(\max_{m’} E_{m’,i}^{(t)}\right)\,,
\end{align*}
we get the following result:

Theorem (Regret of Exp4): If $\eta$ is chosen appropriately, the regret of Exp4 defined in $\eqref{eq:expertregret}$ satisfies
\begin{align}
R_n \le \EE{\sqrt{ 2 \log(M) E^*_n }}\,.
\label{eq:exp4bound}
\end{align}

Note that to get this result, one should set $\eta = \sqrt{(2\log M)/E^*_n}$, which is an infeasible choice when $E^{(1)},\dots,E^{(n)}$ are not available in advance, as is typically the case. Hence, in a practical implementation, one sets $\eta_t = \sqrt{\log M/E^*_t}$, which is a decreasing sequence (the factor of $2$ inside the square root will be lost due to some technical reasons). Then, as discussed before, a regret bound of the above form with a slightly larger constant will hold.

To assess the quality of the bound in the above theorem we consider a few upper bounds on $\eqref{eq:exp4bound}$. First, the expression in \eqref{eq:exp4bound} is the smallest when all experts agree: $E_m^{(t)} = E_{m’}^{(t)}$ for all $t,m,m’$. In this case $E_n^*= n$ and the right-hand side of $\eqref{eq:exp4bound}$ becomes $\sqrt{ 2 \log(M) n }$. We see that the price we pay for not knowing in advance that the experts will agree is $O(\sqrt{\log(M)})$. This case also highlights that in a way $E_n^*$ measures the amount of agreement between the experts.

In general, we cannot know whether the experts will agree. In any case, from $E_{m,i}^{(t)}\le 1$, we get $E_n^*\le Kn$, leading to
\begin{align}
R_n \leq \sqrt{ 2n K \log(M) }\,.
\label{eq:fewactions}
\end{align}
This shows that the price paid for adding experts is relatively low, as long as we have few actions.

How about the opposite case when the number of experts is low? Using $\max_m E_{mi}^{(t)} \le \sum_m E_{mi}^{(t)}\le M$, we get $E_n^* \le M n$ and thus
\begin{align*}
R_n \leq \sqrt{ 2n M \log(M) }\,.
\end{align*}
This shows that the number of actions can be very high, yet the regret will be low if we have only a few experts. This should make intuitive sense.

The above two bounds can also be summarized as (as they hold simultaneously):
\begin{align*}
R_n \leq \sqrt{ 2n (M\wedge K) \log(M) }\,.
\end{align*}
In a way, Exp4 adapts to whether the number of experts, or the number of actions is small (and in fact adapts to the alignment between the experts).

How does this bound compare to our earlier bounds, e.g., $\eqref{eq:exp3perc2}$? The number $M$ of all possible maps of $\cC$ to $[K]$ is $|[K]^{\cC}| = K^{|\cC|}$. Taking $\log$, we see that $\log(M) =|\cC| \log(K)$. From $\eqref{eq:fewactions}$, we conclude that $R_n \le \sqrt{ 2n K |\cC| \log(K) }$, which is the same as $\eqref{eq:exp3perc2}$, up to a constant factor, which would be lost anyways if we properly bounded the regret of Exp4 that uses changing parameters. While we get back the earlier bound, we won’t be able to implement the Exp4 algorithm for even moderate context and action sizes. In particular, the memory requirement of Exp4 is $O(K^{|\cC|})$, while the memory requirement of the specialized method where an Exp3 algorithm is used with every instance is $O(|\cC| K)$. The setting when Exp4 is practically useful is when $M$ is small and $\cC$ is large.

# Conclusions

In this post we have introduced the contextual bandit setting and the Exp4 category of strategies. Perhaps the most important point of this post beyond the algorithms is to understand that there are trade-offs between having a larger comparison class and a more meaningful definition of the regret that this entails.

There are many points we have not developed in detail. One is high probability bounds, which we saw in the previous post for Exp-IX and can also be derived here. We also have not mentioned lower bounds. As it turns out, the results we have given are essentially tight, at least in some instances and we hope to return to this topic in future posts. Finally there is the “hot-topic” of adaptation. For example, what is gained in terms of the regret if the experts are often in agreement.

# References

## High probability lower bounds

In the post on adversarial bandits we proved two high probability upper bounds on the regret of Exp-IX. Specifically, we showed:

Theorem: There exists a policy $\pi$ such that for all $\delta \in (0,1)$ for any adversarial environment $\nu\in [0,1]^{nK}$, with probability at least $1-\delta$
\begin{align}
\label{eq:high-prob}
\hat R_n(\pi,\nu) = O\left(\sqrt{Kn \log(K)} + \sqrt{\frac{Kn}{\log(K)}} \log\left(\frac{1}{\delta}\right)\right)\,.
\end{align}

We also gave a version of the algorithm that depended on $\delta \in (0,1)$:

Theorem: For all $\delta \in (0,1)$ there exists a policy $\pi$ such that for any adversarial environment $\nu\in [0,1]^{nK}$, with probability at least $1- \delta$,
\begin{align}
\label{eq:high-prob2}
\hat R_n(\pi,\nu) = O\,\left(\sqrt{Kn \log\left(\frac{K}{\delta}\right)}\right)\,.
\end{align}

The key difference between these results is the order of quantifiers. In the first we have a single algorithm and a high-probability guarantee that holds simultaneously for any confidence level. For the second result the confidence level must be specified in advance. The price for using the generic algorithm appears to be $\sqrt{\log(1/\delta)}$, which is usually quite small but not totally insignificant. The purpose of this post is to show that both bounds are tight up to constant factors, which implies that algorithms knowing the confidence level in advance really do have an advantage in terms of the achievable regret.

One reason why choosing the confidence level in advance is not ideal is that the resulting high-probability bound cannot be integrated to prove a bound in expectation. For algorithms satisfying (\ref{eq:high-prob}) the expected regret can be bounded by
\begin{align}
R_n \leq \int^\infty_0 \Prob{\hat R_n \geq x} dx = O(\sqrt{Kn \log(K)})\,.
\label{eq:integratedbound}
\end{align}
On the other hand, if the high-probability bound only holds for a single $\delta$ as in \eqref{eq:high-prob2}, then it seems hard to do much better than
\begin{align*}
R_n \leq n \delta + O\left(\sqrt{Kn \log\left(\frac{K}{\delta}\right)}\right)\,,
\end{align*}
which with the best choice of $\delta$ leads to a bound where an extra $\log(n)$ factor appears compared to \eqref{eq:integratedbound}. In fact, it turns out that this argument cannot be strengthened and algorithms with the strong high-probability regret cannot be near-optimal in expectation.

Our approach for proving this fact is very similar to what we did for the minimax lower bounds for stochastic bandits in a previous post. There are two differences between the adversarial setting and the stochastic setting that force us to work a little harder. The first is that for the adversarial-bandit upper bounds we have assumed the rewards are bounded in $[0,1]$, which was necessary in order to say anything at all. This means that our lower bounds should also satisfy this requirement, while in the stochastic lower bounds we used Gaussian rewards with an unbounded range. The second difference comes from the fact that rather than for the regret, the stochastic lower bounds were given for what is known as the pseudo-regret in the adversarial framework and which reads
\begin{align*}
\bar R_n \doteq \max_i \EE{ \sum_{t=1}^n X_{ti} – \sum_{t=1}^n X_t }\,.
\end{align*}
In the stochastic setting, we have $\bar R_n = \sum_{i=1}^K \E[T_i(n)] \Delta_i$ and thus bounds on the pseudo-regret are possible by lower bounding the number of times an algorithm chooses sub-optimal arms on expectation. This is not enough to bound the regret, which depends also on the actual samples.

# A lower bound on tail probabilities of pseudo-regret in stochastic bandits

Before we overcome these technicalities we describe the simple intuition by returning to the stochastic setting, using the pseudo-regret and relaxing the assumption that the rewards are bounded. It is important to remember that $\bar R_n$ is not a random variable because all the randomness is integrated away by the expectation. This means that it does not make sense to talk about high-probability results for $\bar R_n$, so we introduce another quantity,
\begin{align*}
\tilde R_n = \sum_{i=1}^K T_i(n) \Delta_i\,,
\end{align*}
which is a random variable through the pull counts $T_i(n)$ and which, for lack of a better name, we call the random pseudo-regret. For the subsequent results we let $\cE$ denote the set of Gaussian bandits with sub-optimality gaps bounded by one.

Theorem: Fix the horizon $n>0$, number of arms $K>1$, a constant $C>0$ and a policy $\pi$. Assume that for any $\nu \in \cE$ bandit environment,
\begin{align*}
R_n(\pi,\nu) \leq C \sqrt{(K-1)n}\,,
\end{align*}
Let $\delta \in (0,1)$. Then, there exists a bandit $\nu$ in $\cE$ such that
\begin{align*}
\Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4}\min\set{n,\, \frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}} \geq \delta\,.
\end{align*}

It follows that if the result can be transferred to the adversarial setting, there will be little or no room for improving \eqref{eq:high-prob}.
Proof: Let $\Delta \in (0, 1/2]$ be a constant to be tuned subsequently and define
\begin{align*}
\mu_i = \begin{cases}
\Delta, & \text{if } i = 1\,; \\
0, & \text{otherwise}\,,
\end{cases}
\end{align*}
As usual, let $R_n = R_n(\pi,\nu)$ for $\nu = (\cN(\mu_i,1))_{i\in [K]}$. Let $\PP=\PP_{\nu,\pi}$ and $\E=\E_{\nu,\pi}$. Let $i = \argmin_{j>1} \E[T_j(n)]$. Then, thanks to
\begin{align*}
C \sqrt{(K-1)n} \ge R_n = \Delta \sum_{i>1} \E[T_i(n)] \ge \Delta (K-1)\min_i \E[T_i(n)]
\end{align*}
we find that
\begin{align}
\E[T_i] \leq \frac{C}{\Delta} \sqrt{\frac{n}{K-1}}\,.
\label{eq:hpproofarmusage}
\end{align}

Define $\mu’ \in \R^K$ by
\begin{align*}
\mu’_j = \begin{cases}
\mu_j\,, & \text{if } j \neq i; \\
2\Delta\,, & \text{otherwise}
\end{cases}
\end{align*}
and let $\nu’=(\cN(\mu_j’,1))_{j\in [K]}$ be the Gaussian bandit with means $\mu’$ and abbreviate $\PP’=\PP_{\nu’,\pi}$ and $\E’=\E_{\nu’,\pi}$. Thus in $\nu’$ action $i$ is better than any other action by at least $\Delta$. Let $\tilde R_n = \sum_{j=1}^K T_j(n) \Delta_j$ and $\tilde R_n’ = \sum_{j=1}^n T_j(n) \Delta_j’$ be the random pseudo-regret in $\nu$ and $\nu’$ respectively, where $\Delta_j = \max_k \mu_k-\mu_j = \one{j\ne 1}\Delta$ and $\Delta_j’=\max_k \mu_k’-\mu_j \ge \one{i\ne j} \Delta$. Hence,
\begin{align*}
\tilde R_n & \ge T_i(n) \Delta_i \ge \one{T_i(n)\ge n/2} \frac{\Delta n}{2}\,, \qquad\text{ and }\\
\tilde R_n’ & \ge \Delta \sum_{j\ne i} T_j(n) = \Delta (n-T_i(n)) \ge \one{T_i(n) < n/2} \frac{\Delta n}{2}\,. \end{align*} Hence, $T_i(n)\ge n/2 \Rightarrow \tilde R_n \ge \frac{\Delta n}{2}$ and $T_i(n) < n/2 \Rightarrow \tilde R_n' \ge \frac{\Delta n}{2}$, implying that \begin{align} \max\left(\Prob{ \tilde R_n \ge \frac{\Delta n}{2} }, \PP'\left(\tilde R_n' \ge \frac{\Delta n}{2} \right)\right) \ge \frac12 \left( \Prob{T_i(n) \ge n/2} +\PP'\left(T_i(n) < n/2 \right) \right)\,. \label{eq:hprlb} \end{align} By the high probability Pinsker lemma, the divergence decomposition identity (earlier this was called this the information processing lemma) and \eqref{eq:hpproofarmusage} we have
\begin{align}
\Prob{T_i(n) \geq n/2} + \mathbb P’\left(T_i(n) < n/2\right)
&\geq \frac{1}{2} \exp\left(-\KL(\mathbb P, \mathbb P’)\right) \nonumber \\
&= \frac{1}{2} \exp\left(-2\E[T_i(n)] \Delta^2\right) \nonumber \\
&\geq \frac{1}{2} \exp\left(-2C \Delta\sqrt{\frac{n}{K-1}}\right)\,.
\label{eq:hppinskerlb}
\end{align}
To enforce that the right-hand side of the above display is at least $2\delta$, we choose
\begin{align*}
\Delta = \min\set{\frac{1}{2},\,\frac{1}{2C} \sqrt{\frac{K-1}{n}} \log\left(\frac{1}{4\delta}\right)}\,.
\end{align*}
Putting \eqref{eq:hprlb} and \eqref{eq:hppinskerlb} together we find that either
\begin{align*}
&\Prob{\tilde R_n \geq \frac{1}{4}\min\set{n,\, \frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}} \geq \delta \\
\mathbb P’\left(\tilde R’_n \geq \frac{1}{4}\min\set{n,\,\frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}\right) \geq \delta\,.
\end{align*}
QED.

From this theorem we can derive two useful corollaries.

Corollary: Fix $n>0$, $K>1$. For any policy $\pi$ and $\delta \in (0,1)$ small enough that
\begin{align}
n\delta \leq \sqrt{n (K-1) \log\left(\frac{1}{4\delta}\right)} \label{eq:hplbdelta}
\end{align}
there exists a bandit problem $\nu\in\cE$ such that
\begin{align}
\Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4}\min\set{n,\, \sqrt{\frac{n(K-1)}{2} \log\left(\frac{1}{4\delta}\right)}}} \geq \delta\,.
\label{eq:hplbsmalldelta}
\end{align}

Proof: We prove the result by contradiction. Assume that the conclusion does not holds for $\pi$. We will derive a contradiction. Take $\delta$ that satisfies \eqref{eq:hplbdelta}. Then, for any bandit problem $\nu\in \cE$ the expected regret of $\pi$ is bounded by
\begin{align*}
R_n(\pi,\nu) \leq n\delta + \sqrt{\frac{n(K-1)}{2} \log\left(\frac{1}{4\delta}\right)}
\leq \sqrt{2n(K-1) \log\left(\frac{1}{4\delta}\right)}\,.
\end{align*}
Therefore $\pi$ satisfies the conditions of the previous theorem with $C =\sqrt{2 \log(\frac{1}{4\delta})}$, which implies that there exists some bandit problem $\nu\in\cE$ such that \eqref{eq:hplbsmalldelta} holds, contradicting our assumption.
QED

Corollary: Fix any $K>1$, $p \in (0,1)$ and $C > 0$. Then, for any policy $\pi$ there exists $n>0$ large enough, $\delta\in(0,1)$ small enough and a bandit environment $\nu\in \cE$ such that
\begin{align*}
\Prob{\tilde R_n(\pi,\nu) \geq C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)} \geq \delta\,.
\end{align*}

Proof: Again, we proceed by contradiction. Suppose that a policy $\pi$ exists for which the conclusion does not hold. Then, for any $n>0$ and environment $\nu\in \cE$,
\begin{align}
\Prob{\tilde R_n(\pi,\nu) \geq C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)} < \delta \label{eq:hpexplbp} \end{align} and therefore, for any $n>0$, the expected $n$-round regret of $\pi$ on $\nu$ is bounded by
\begin{align*}
R_n(\pi,\nu)
\leq \int^\infty_0 \Prob{ \tilde R_n(\pi,\nu) \geq x} dx
\leq C \sqrt{n(K-1)} \int^\infty_0 \exp\left(-x^{1/p}\right) dx
\leq C \sqrt{n(K-1)}\,.
\end{align*} Therefore, by the previous theorem, for any $n>0$, $\delta\in (0,1)$ there exists a bandit $\nu_{n,\delta}\in \cE$ such that
\begin{align*}
\Prob{\tilde R_n(\pi,\nu_{n,\delta}) \geq \frac{1}{4} \min\set{n, \frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)}} \geq \delta\,.
\end{align*}
For $\delta$ small enough, $\frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right) \ge C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)$ and then choosing $n$ large enough so that $\frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)\le n$, we find that on the environment $\nu=\nu_{n,\delta}$,
\begin{align*}
\delta &\le \Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4} \min\set{n, \frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)}} \\
& \le \Prob{ \tilde R_n(\pi,\nu) \geq C\sqrt{(K-1)n} \log^p\left(\frac1n \right) }\,,
\end{align*}
QED

# A lower bound on tail probabilities of regret in adversarial bandits

So how do we transfer this argument to the case where the rewards are bounded and the regret is used, rather than the pseudo-regret? For the first we can simply shift the means to be close to $1/2$ and truncate or “clip” the rewards that (by misfortune) end up outside the allowed range. To deal with the regret there are two options. Either one adds an additional step to show that the regret and the pseudo-regret concentrate sufficiently fast (which they do), or one correlates the losses across the actions. The latter is the strategy that we will follow here.

We start with an observation. Our goal is to show that there exist a reward sequence $x=(x_1,\dots,x_n)\in [0,1]^{nK}$ such that the regret $\hat R_n=\max_i \sum_t x_{ti} – x_{t,A_t}$ is above some threshold $u>0$ with a probability exceeding a prespecified value $\delta\in (0,1)$. For this we want to argue that it suffices to show this when the rewards are randomly chosen. Similarly to the stochastic case we define the “extended” canonical bandit probability space. Since the regret in adversarial bandits depends on non-observed rewards, the outcome space of the extended canonical probability space is $\Omega_n = \R^{nK}\times [K]^n$ and now $X_t,A_t: \Omega_n \to \R$ are $X_t(x,a) = x_t$ and $A_t(x,a) = a_t$ where we use the convention that $x= (x_1,\dots,x_n)$ and $a=(a_1,\dots,a_n)$. We also let $\hat R_n = \max_i \sum_{t=1}^n X_{ti} – \sum_{t=1}^n X_{t,A_t}$ and define $\PP_{Q,\pi}$ to be joint of $(X_1,\dots,X_n,A_1,\dots,A_n)$ arising from the interaction of $\pi$ with $X\sim Q$. Finally, as we often need it, for a fixed $\nu\in \R^{nK}$ we abbreviate $\PP_{\delta_\nu,\pi}$ to $\PP_{\nu,\pi}$ where $\delta_\nu$ is the Dirac (probability) measure over $\R^{nK}$ (i.e., $\delta_{\nu}(U) = \one{\nu \in U}$ for $U\subset \R^{nK}$ Borel).

Lemma (Randomization device): For any $Q$ probability measure over $\R^{nK}$, any policy $\pi$, $u\in \R$ and $\delta\in (0,1)$,
\begin{align}\label{eq:pqpidelta}
\PP_{Q,\pi}( \hat R_n \geq u ) \geq \delta \implies
\exists \nu \in\mathrm{support}(Q) \text{ such that } \PP_{\nu,\pi}(\hat R_n \geq u) \geq \delta\,.
\end{align}

The lemma is proved by noting that $\PP_{Q,\pi}$ can be disintegrated into the “product” of $Q$ and $\{\PP_{\nu,\pi}\}_{\nu}$. The proof is given at the end of the post.

Given this machinery, let us get into the proof. Fix a policy $\pi$, $n>0$, $K>1$ and a $\delta\in (0,1)$. Our goal is to find some $u>0$ and a reward sequence $x\in [0,1]^{nK}$ such that the random regret of $\pi$ while interacting with $x$ is above $u$ with probability exceeding $\delta$. For this, we will define two reward distributions $Q$ and $Q’$, and show for (at least) one of $\PP_{Q,\pi}$ or $\PP_{Q’,\pi}$ that the probability of $\hat R_n \ge u$ exceeds $\delta$.

Instead of the canonical probability models we will find it more convenient to work with two sequences $(X_t,A_t)_t$ and $(X_t’,A_t’)_t$ of reward-action pairs defined over a common probability space. These are constructed as follows: We let $(\eta_t)_t$ be an i.i.d. sequence of $\mathcal N(0,\sigma^2)$ Gaussian variables and then let
\begin{align*}
X_{tj} = \clip( \mu_j + \eta_t )\,, \qquad X_{tj}’ = \clip( \mu_j’ + \eta_t)\, \qquad (t\in [n],j\in [K])\,,
\end{align*}
where $\clip(x) = \max(\min(x,1),0)$ clips its argument to $[0,1]$, and for some $\Delta\in (0,1/4]$ to be chosen later,
\begin{align*}
\mu_j = \frac12 + \one{j=1} \Delta, \qquad (j\in [K])\,.
\end{align*}
The “means” $(\mu_j’)_j$ will also be chosen later. Note that apart from clipping, $(X_{ti})_t$ (and also $(X_{ti}’)_t$) is merely a shifted version of $(\eta_t)_t$. In particular, thanks to $\Delta>0$, $X_{t1}\ge X_{tj}$ for any $t,j$. Moreover, $X_{t1}$ exceeds $X_{tj}$ by $\Delta$ when none of them is clipped:
\begin{align}
X_{t1}\ge X_{tj} + \Delta\, \one{\eta_t\in [-1/2,1/2-\Delta]}\,, \quad t\in [n], j\ne 1\,.
\label{eq:hplb_rewardgapxt}
\end{align}

Now, define $(A_t)_t$ to be the random actions that arise from the interaction of $\pi$ and $(X_t)_t$ and let $i = \argmin_{j>1} \EE{ T_j(n) }$ where $T_i(n) = \sum_{t=1}^n \one{A_t=i}$. As before, $\EE{T_i(n)}\le n/(K-1)$. Choose
\begin{align*}
\mu_j’ = \mu_j + \one{j=i}\, 2\Delta\,, \quad j\in [K]\,
\end{align*}
so that $X_{ti}’\ge X_{tj}’$ for $j\in [K]$ and furthermore
\begin{align}
X_{ti}’\ge X_{tj}’ + \Delta\, \one{\eta_t\in [-1/2,1/2-2\Delta]}\,, \quad t\in [n], j\ne i\,.
\label{eq:hplb_rewardgapxtp}
\end{align}

Denote by $\hat R_n = \max_j \sum_t X_{tj} – \sum_t X_{t,A_t}$ the random regret of $\pi$ when interacting with $X = (X_1,\dots,X_n)$ and let $\hat R_n’ = \max_j \sum_t X’_{tj} – \sum_t X’_{t,A_t’}$ the random regret of $\pi$ when interacting with $X’ = (X_1′,\dots,X_n’)$.
Thus, it suffices to prove that either $\Prob{ \hat R_n \ge u }\ge \delta$ or $\Prob{ \hat R_n’ \ge u} \ge \delta$.

Illustration of the idea of the lower bound proof. An elf randomly chooses a pair of environments from the bag of all possible clipped Gaussian environments and feeds the environment into the policy, which is also randomized. The policy spits out the random regret for both environments. For at least one of the two environments, many of the random regrets will be high with a high chance.

By our earlier remarks, $\hat R_n = \sum_t X_{t1} – \sum_t X_{t,A_t}$ and $\hat R_n’ = \sum_t X_{ti}’ – \sum_t X_{t,A_t}’$. Define $U_t =\one{\eta_t\in [-1/2,1/2-\Delta]}$, $U_t’=\one{\eta_t\in [-1/2,1/2-2\Delta]}$, $A_{tj} = \one{A_t=j}$ and $A_{tj}’ = \one{A_t’=j}$. From \eqref{eq:hplb_rewardgapxt} we see that
\begin{align*}
\hat R_n \ge \Delta\, \sum_t \one{A_t\ne 1} U_t =\Delta\, \sum_t (1-A_{t1}) U_t \ge \Delta\,(U – T_1(n)) \ge \Delta\,(U + T_i(n) – n)\,,
\end{align*}
where we also defined $U = \sum_t U_t$ and used that $T_1(n)+T_i(n)\le n$. Note that $U_t=1$ indicates that $(X_{tj})_j$ are “unclipped”. Similarly, from \eqref{eq:hplb_rewardgapxtp} we see that
\begin{align*}
\hat R_n’ \ge \Delta \, \sum_t \one{A_t’\ne i} U_t’ =\Delta\, \sum_t (1-A_{ti}’) U_t’ \ge \Delta\,( U’ – T_i'(n)) \,,
\end{align*}
where $T_i'(n)=\sum_t A_{ti}’$ and $U’ = \sum_t U_t’$. Based on the lower bounds on $\hat R_n$ and $\hat R_n’$ we thus see that if $T_i(n)\ge n/2$ and $U\ge 3n/4$ then $\hat R_n \ge u\doteq \frac{n\Delta}{4}$ and if $T_i'(n) < n/2$ and $U'\ge 3n/4$ then $\hat R_n' \ge u$ holds, too. Thus, from union bounds, \begin{align*} \Probng{ \hat R_n \ge u } &\ge \Prob{ T_i(n)\ge n/2, U\ge 3n/4 } \ge \Prob{ T_i(n)\ge n/2 } - \Prob{U < 3n/4}\,,\\ \Probng{ \hat R_n' \ge u } &\ge \Prob{ T_i'(n) < n/2, U'\ge 3n/4 } \ge \Prob{ T_i'(n) < n/2 } - \Prob{U' < 3n/4}\,. \end{align*} Noticing that $U'\le U$ and hence $\Prob{U<3n/4}\le \Prob{U' <3n/4}$, we get \begin{align*} \max(\Probng{ \hat R_n \ge u },\Probng{ \hat R_n' \ge u }) & \ge \frac12 \Bigl(\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }\Bigr) - \Prob{U' < 3n/4}\,. \end{align*} The sum $\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }$ will be lower bounded with the help of the high-probability Pinsker inequality. For an upper bound on $\Prob{U’ < 3n/4}$, we have the following technical lemma:

Lemma (Control of the number of unclipped rounds): Assume that $\Delta\le 1/8$. If $n \ge 32 \log(2/\delta)$ and $\sigma\le 1/10$ then $\Prob{U’<3n/4} \le \delta/2$.

The proof is based on bounding the tail probability of the mean the i.i.d. $(U_t’)_t$ Bernoulli variables using Hoeffding’s inequality and is given later. Intuitively, it is clear that by making $\sigma^2$ small, the number of times $\eta_t$ falls inside $[-1/2,1/4]\subset [-1/2,1/2-2\Delta]$ can be made arbitrary high with arbitrary high probability.

Our goal now is to lower bound $\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }$ by $3\delta$. As suggested before, we aim to use the high-probability Pinsker inequality. One difficulty that we face is that the events $\{T_i(n)\ge n/2\}$ and $\{T_i'(n) < n/2\}$ may not be complementary as they are defined in terms of a potentially distinct set of random variables. This will be overcome by rewriting the above probabilities using the canonical bandit probability spaces. In fact, we will use the non-extended version of these probability spaces (as defined earlier in the context of stochastic bandits). The reason of this is that after Pinsker, we plan to apply the divergence decomposition identity, which decomposes the divergence between distributions of action-reward sequences.

To properly write things let $Q_j$ denote the distribution of $X_{tj}$ and similarly let $Q_j’$ be the distribution of $X_{tj}’$. These are well defined (why?). Define the stochastic bandits $\beta=(Q_1,\dots,Q_K)$ and $\beta’=(Q’_1,\dots,Q’_K)$. Let $\Omega_n = ([K]\times \R)^n$ and let $\tilde Y_t,\tilde A_t:\Omega_n \to \R$ be the usual coordinate projection functions: $\tilde Y_t(a_1,y_1,\dots,a_n,y_n) = y_t$ and $\tilde A_t(a_1,y_1,\dots,a_n,y_n)=a_t$. Also, let $\tilde T_i(n) = \sum_{t=1}^n \one{\tilde A_t=i}$. Recall that $\PP_{\beta,\pi}$ denotes the probability measure over $\Omega_n$ that arises from the interaction of $\pi$ and $\beta$ (detto for $\PP_{\beta’,\pi}$). Now, since $T_i(n)$ is only a function of $(A_1,X_{1,A_1},\dots,A_n,X_{n,A_n})$ whose probability distribution is exactly $\PP_{\beta,\pi}$, we have
\begin{align*}
\Prob{ T_i(n) \ge n/2 } = \PP_{\beta,\pi}( \tilde T_i(n) \ge n/2 )\,.
\end{align*}
Similarly,
\begin{align*}
\Prob{ T_i'(n) < n/2 } = \PP_{\beta',\pi}( \tilde T_i(n) < n/2 )\,. \end{align*} Now, by the high-probability Pinsker inequality and the divergence decomposition lemma, \begin{align*} \Prob{ T_i(n) \ge n/2 } + \Prob{ T_i'(n) < n/2 } & = \PP_{\beta,\pi}( \tilde T_i(n) \ge n/2 ) + \PP_{\beta',\pi}( \tilde T_i(n) < n/2 ) \\ & \ge \frac12 \exp\left(- \KL( \PP_{\beta,\pi}, \PP_{\beta',\pi} ) \right) \\ & = \frac12 \exp\left(- \E_{\beta,\pi}[\tilde T_i(n)] \KL( Q_i, Q_i' ) \right) \\ & \ge \frac12 \exp\left(- \EE{ T_i(n) } \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2) ) \right)\,, \end{align*} where in the last equality we used that $Q_j=Q_j'$ unless $j=i$, while in the last step we used $\E_{\beta,\pi}[\tilde T_i(n)] = \EE{ T_i(n) }$ and also that $\KL( Q_i, Q_i' ) \le \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2) )$. From where does the last inequality come, one might ask. The answer is the truncation, which always reduces information. More precisely, let $P$ and $Q$ be probability measures on the same probability space $(\Omega, \cF)$. Let $X:\Omega \to \R$ be a random variable and $P_X$ and $Q_X$ be the laws of $X$ with under $P$ and $Q$ respectively. Then $\KL(P_X, Q_X) \leq \KL(P, Q)$.

Now, by the choice of $i$,
\begin{align*}
\EE{ T_i(n) } \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2)
\le \frac{n}{K-1} \frac{2\Delta^2}{\sigma^2}\,.
\end{align*}
Plugging this into the previous display we get that if
\begin{align*}
\Delta = \sigma \sqrt{\frac{K-1}{2n} \log\frac{1}{6\delta}}
\end{align*}
then $\Prob{ T_i(n) \ge n/2 } + \Prob{ T_i'(n) < n/2 }\ge 3\delta$ and thus $\max(\Prob{ \hat R_n \ge u },\Prob{ \hat R_n'\ge u} )\ge \delta$. Recalling the definition $u = n\Delta/4$ and choosing $\sigma=1/10$ gives the following result:

Theorem (High probability lower bound for adversarial bandits): Let $K>1$, $n>0$ and $\delta\in (0,1)$ such that
\begin{align*}
n\ge \max\left( 32 \log \frac2{\delta}, (0.8)^2 \frac{K-1}{2} \log \frac{1}{6\delta}\right)
\end{align*}
holds. Then, for any bandit policy $\pi$ there exists a reward sequence $\nu = (x_1,\dots,x_n)\in [0,1]^{nK}$ such that if $\hat R_n$ is the random regret of $\pi$ when interacting with the environment $\nu$ then
\begin{align*}
\Prob{ \hat R_n \ge 0.025\sqrt{\frac{n(K-1)}{2}\log \frac{1}{6\delta}} } \ge \delta\,.
\end{align*}

So what can one take away from this post? The main thing is to recognize that the upper bounds we proved in the previous post cannot be improved very much, at least in this worst case sense. This includes the important difference between the high-probability regret that is achievable when the confidence level $\delta$ is chosen in advance and what is possible if a single strategy must satisfy a high-probability regret guarantee for all confidence levels simultaneously.

Besides this result we also introduced some new techniques that will be revisited in the future, especially the randomization device lemma. The advantage of using clipped and correlated Gaussian rewards is that it ensures the same arm is always optimal, no matter how the noise behaves.

# Technicalities

The purpose of this section is to lay to rest the two technical results required in the main body. The first a proof of the lemma which gives us the randomization technique or “device” and afterwards the proof of the proof of the lemma that controls the number of unclipped rounds.

Proof of the randomization device lemma

The argument underlying this goes as follows: If $A=(A_1,\dots,A_n)\in [K]^n$ are the actions of a stochastic policy $\pi=(\pi_1,\dots,\pi_n)$ when interacting with the environment where the rewards $X=(X_1,\dots,X_n)\in \R^{nK}$ are drawn from $Q$ then for $t=1,\dots,n$, $A_t\sim \pi_t(\cdot|A_1,X_{1,A_1},\dots,A_{t-1},X_{t-1,A_{t-1}})$ and thus the distribution $\PP_{Q,\pi}$ of $(X,A)$ satisfies
\begin{align*}
d\PP_{Q,\pi}(x,a)
&= \pi(a|x) d\rho^{\otimes n}(a) dQ(x) \,,
\end{align*}
where $a=(a_1,\dots,a_n)\in [K]^n$, $x\in (x_1,\dots,x_n)\in \R^{nK}$,
\begin{align*}
\pi(a|x)\doteq
\pi_1(a_1) \pi_2(a_2|a_1,x_{1,a_1}) \cdots \pi_n(a_n|a_1,x_{1,a_1}, \dots,a_{n-1},x_{n-1,a_{n-1}})
\end{align*}
and $\rho^{\otimes n}$ is the $n$-fold product $\rho$ with itself, where $\rho$ is the counting measure $\rho$ on $[K]$. Letting $\delta_x$ be the Dirac (probability) measure on $\R^{nK}$ concentrated at $x$ (i.e., $\delta_x(U) = \one{x\in U}$), we have that $\PP_{Q,\pi}$ can be disintegrated into $Q$ and $\{\PP_{\delta_x,\pi}\}_x$. In particular, a direct calculation verifies that
\begin{align}
d\PP_{Q,\pi}(x,a) = \int_{y\in \R^{nK}} dQ(y) \, d\PP_{\delta_y,\pi}(x,a) \,.
\label{eq:disintegration}
\end{align}

Let $(X_t,A_t)$ be the reward and action of round $t$ in the extended canonical bandit probability space and $\hat R_n$ the random regret defined in terms of these random variables. For any Borel $U\subset \R$,
\begin{align*}
\PP_{Q,\pi}( \hat R_n \in U )
&= \int \one{\hat R_n(x,a)\in U} d\PP_{Q,\pi}(x,a) \\
&= \int_{\R^{nK}} dQ(y) \left( \int_{\R^{nK}\times [K]^n} \one{\hat R_n(x,a)\in U} d\PP_{\delta_y,\pi}(x,a) \right)\\
&= \int_{\R^{nK}} dQ(y) \PP_{\delta_y,\pi}( \hat R_n\in U )\,,
\end{align*}
where the the second equality uses \eqref{eq:disintegration} and Fubini. From the above equality it is obvious that it is not possible that $\PP_{Q,\pi}( \hat R_n \in U )\ge \delta$ while for all $y\in \mathrm{support}(Q)$, $\PP_{\delta_y,\pi}( \hat R_n\in U )<\delta$, thus finishing the proof. QED

Proof of lemma controlling number of clipped rounds: First note that $U’\le U$ and thus $\Prob{U < 3n/4}\le \Prob{U '< 3n/4}$ hence it suffices to control the latter. Since $\Delta\le 1/8$ and $\eta_t$ is a Gaussian with zero mean and variance $\sigma^2$, and in particular $\eta_t$ is $\sigma^2$-subgaussian, we have \begin{align*} 1 - p = \Prob{U_t' = 0} & \leq \Prob{ |\eta_t| > 1/2-2\Delta }
\leq 2 \exp\left(-\frac{\left(1/2 – 2\Delta\right)^2}{2\sigma^2}\right)\\
& \leq 2 \exp\left(-\frac{1}{2 (4)^2 \sigma^2}\right)
\le \frac{1}{8}\,,
\end{align*}
where the last inequality follows whenever $\sigma^2 \le \frac{1}{32 \log 16}$ which is larger than $0.01$. Therefore $p \ge 7/8$ and
\begin{align*}
\Prob{\sum_{t=1}^n U_t’ < \frac{3n}{4}} &= \Prob{\frac{1}{n} \sum_{t=1}^n ( U_t' - p) < -(p-\frac{3}{4}) } \\ &\le \Prob{\frac{1}{n} \sum_{t=1}^n ( U_t' - p) \le - \frac{1}{8}} \\ &\leq \exp\left(-\frac{n}{32}\right) \leq \frac{\delta}{2}\,, \end{align*} where the second last inequality uses Hoeffding’s bound together with that $U_t’-p$ is $1/4$-subgaussian, and the last holds by our assumption on $n$.
QED

# Notes

Note 1: It so happens that the counter-example construction we used means that the same arm has the best reward in every round (not just the best mean). It is perhaps a little surprising that algorithms cannot exploit this fact, in contrast the experts setting where this knowledge enormously improves the achievable regret.

Note 2: Adaptivity is all the rage right now. Can you design an adversarial bandit algorithm that exploits “easy data” when available? For example, if the rewards don’t change much over time, or lie in a small range. There are still a lot of open questions in this area. The paper referenced below gives lower bounds for some of these situations.

# References

Sebastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits. 2016