# Table of Contents

- Bandits: A new beginning
- Finite-armed stochastic bandits: Warming up
- First steps: Explore-then-Commit
- The Upper Confidence Bound (UCB) Algorithm
- Optimality concepts and information theory
- More information theory and minimax lower bounds
- Instance dependent lower bounds
- Adversarial bandits
- High probability lower bounds
- Contextual bandits, prediction with expert advice and Exp4
- Stochastic Linear Bandits and UCB
- Ellipsoidal confidence sets for least-squares estimators
- Sparse linear bandits
- Lower bounds for stochastic linear bandits
- Adversarial linear bandits
- Adversarial linear bandits and the curious case of linear bandits on the unit ball

# Bandits: A new beginning

Dear Interested Reader,

Together with Tor, we have worked a lot on **bandit problems** in the past and developed a true passion for them. At the pressure of some friends and students (and a potential publisher), and also just to have some fun, we are developing a new graduate course devoted to this subject. This semester, I am teaching this course at the University of Alberta (UofA) and next semester Tor will do the same at the Indiana University where he just moved after finishing his post-doc at the UofA. The focus of the course will be on understanding the core ideas, mathematics and implementation details for current state-of-the-art algorithms. As we go, we plan to update this site on a weekly basis, describing what was taught in the given week — stealing the idea from Seb (e.g., see his excellent posts about convex optimization). The posts should appear **around Sunday**. Eventually, we hope that the posts will also form the basis of a new book on bandits that we are very excited about.

For now, we would like to **invite everyone** interested in bandit problems to follow this site, **give us feedback** by commenting on these pages, ask questions, make suggestions for other topics, or criticize what we write. In other words, we wish to leverage the wisdom of crowd in this adventure to help us to make the course better.

So this is the high level background. Today, in the remainder of this post I will first briefly motivate why anyone should care about bandits and look at where the name comes from. Next, I will introduce the formal language that we will use later and finish by peeking into what will happen in the rest of the semester. This is pretty basic stuff. To whet your appetite, next week, we will continue with a short review of probability theory and concentration results, including a fuss-free crash course on measure-theoretic probability in 30 minutes or so. These topics form the necessary background as we will first learn about the so-called stochastic bandit problems where one can get lost very easily without proper mathematical foundations. The level of discussion will be intended for anyone with undergraduate training in probability. By the end of the week, we will learn about the explore-then-exploit strategies and the upper-confidence bound algorithm.

# Applications and origin

**Why should we care about bandit problems**? Decision making in the face of uncertainty is a significant challenge in machine learning. Which drugs should a patient receive? How should I allocate my study time between courses? Which version of a website will generate the most revenue? What move should be considered next when playing chess/go? All of these questions can be expressed in the multi-armed bandit framework where a learning agent sequentially takes actions, observes rewards and aims to maximise the total reward over a period of time. The framework is now very popular, used in practice by big companies, and growing fast. In particular, google scholar reports 1000, 2500, and 7700 papers when searching for the phrase bandit algorithm for the periods of 2001-2005, 2006-2010, and 2011- present (see the figure below), respectively. Even if these numbers are somewhat overblown, they indicate that the field is growing rapidly. This could be a fashion or maybe there is something interesting happening here? We think that the latter is true!

Fine, so maybe you decided to care about bandit problems. But what are they exactly? Bandit problems were introduced by **William R. Thompson** (one of our heroes, whose name we will see popping up again soon) in a paper in 1933 for the so-called Bayesian setting (that we will also talk about later). Thompson was interested in medical trials and the cruelty of running a trial “blindly”, without adapting the treatment allocations on the fly as the drug appears more or less effective.

Clinical trials is thus one of the first intended applications. But why would anyone call problems like optimizing drug allocation a “bandit-problem”? The name comes from the 1950s when Frederick Mosteller and Robert Bush decided to study animal learning and ran trials on mice and then on humans. The mice faced the dilemma of choosing to go left or right, after starting in the bottom of a T-shaped maze, not knowing each time at which end they will find food. To study a similar learning setting in humans, a “two-armed bandit” machine was commissioned where humans could choose to pull either the left or the right arm of the machine, each giving a random payoff with the distribution of payoffs for each arm unknown to the human player. The machine was called a “two-armed bandit” in homage to the one-armed bandit, an old-fashioned name for a lever operated slot machine (“bandit” because they steal your money!).

Now, imagine that you are playing on this two-armed bandit machine and you already pulled each lever 5 times, resulting in the following payoffs:

Left arm: 0, 10, 0, 0, 10

Right arm: 10, 0, 0, 0, 0

The left arm appears to be doing a little better: The average payoff for this arm is 4 (say) dollars per round, while that of the right arm is only 2 dollars per round. Let’s say, you have 20 more trials (pulls) altogether. How would you pull the arms in the remaining trials? Will you keep pulling the left arm, ignoring the right (owing to its initial success), or would you pull the right arm a few more times, thinking that maybe the initial poorer average payoff is due to “bad luck” only? This illustrates the interest in bandit problems: They capture the fundamental dilemma a learner faces when choosing between uncertain options. Should one **explore** an option that looks inferior or should one **exploit** one’s knowledge and just go with the currently best looking option? How to maintain the balance between exploration and exploitation is at the heart of bandit problems.

# The language of bandits

There are many real-world problems where a learner repeatedly chooses between options with uncertain payoffs with the sole goal of maximizing the total gain over some period of time. However, the world is a messy place. To gain some principled understanding of the learner’s problem, we will often make extra modelling assumptions to restrict how the rewards are generated. To be able to talk about these assumptions, we need a more formal language. In this formal language we are talking about the **learner** as someone (or something) that interacts with an **environment** as shown in the figure below.

On the figure the learner is a represented by a box on the left, the environment is represent by a box on the right. They are interconnected by two arrows connecting the two boxes. These arrows represent the interaction between the learner and the environment. What the figure fails to capture is that the interaction happens in a sequential fashion over a number of discrete rounds. In particular, the specification starts with the following** interaction protocol**:

For **rounds** $t=1,2,…,n$:

1. Learner chooses an **action** $A_t$ from a set $\cA$ of available actions. The chosen action is sent to the environment;

2. The environment generates a response in the form of a real-valued **reward** $X_t\in \R$, which is sent back to the learner.

The goal of the learner is to maximize the sum of rewards that it receives, $\sum_{t=1}^n X_t$.

A couple of things need to be clarified that are hidden by the above short description: When in step 2 we say that the reward is sent back to the learner, we mean that the learner gets to know this number and can thus use it in the future rounds when it needs to decide which action to take. Thus, in step 1 of round $t$, the learner’s decision is based on the **history** of interaction up to the end of round $t-1$, i.e., on $H_{t-1} = (A_1,X_1,\dots,A_{t-1},X_{t-1})$.

Our goal, is to equip the learner with a **learning algorithm** to maximize its reward. Most of the time, the word “algorithm” will not be taken too seriously in the sense that a “learning algorithm” will be viewed as a mapping of possible histories to actions (possibly randomized). Nevertheless, throughout the course we will keep an eye on discussing whether such maps can be efficiently implemented on computers (justifying the name “algorithms”).

The next question to discuss is how to evaluate a learner (to simplify language, we identify learners and learning algorithms)? One idea is to measure learning speed by what is called the **regret**. The regret of a learner acting in an environment is relative to some action. We will start with a very informal definition (and hope you forgive us). The reason is that there a number of related definitions – all quite similar – and we do not want to be bogged down with the details in this introductory post.

The

regretof learner relative to action $a$ =

Total reward gained when $a$ is used for all $n$ rounds –

Total reward gained by the specific learner in $n$ rounds according to its chosen actions.

We often calculate the **worst-case regret** over the possible actions $a$, though in stochastic environments, we may first take expectations for each action and then calculate the worst-case over the actions (later we should come back to discussing the difference between taking the expectation first and then the maximum, versus taking the maximum first and then the expectation). For a fixed environment, the worst-case regret is just a shift and a sign change of the total reward: Maximizing reward is the same as minimizing regret.

A good learner achieves **sublinear worst-case regret**: That is, if $R_n$ is the regret after $n$ rounds of interaction of a learner, $R_n = o(n)$. In other words, the average per round regret, $R_n/n$ converges to zero as $n\to\infty$. Of major interest will be to see the rate of growth of regret for various classes of environments, as well as for various algorithms, and in particular, seeing how small the regret can be over some class of environment and what algorithms can achieve such **regret lower bounds**.

The astute reader may notice that regret compares payoffs of **fixed actions** to the total payoff of the learner: In particular, in a non-stationary environment, a learner that can change what action it uses over time may achieve negative worst-case regret! In other words, the assumption that the world is **stationary** (i.e., there are no seasonality or other patterns in the environment, there is no payoff drift, etc.) is built into the above simple concept of regret. The notion of regret can and has been extended to make it more meaningful in non-stationary environments, but as a start, we should be content with the assumption of stationarity. Another point of discussion is whether in designing a learning algorithm we can use the **knowledge of the time horizon** $n$ or not. In some learning settings assuming this is perfectly fine (say, businesses can have natural business cycles, say, weeks, months, etc.), but sometimes assuming the knowledge of the time horizon is unnatural. In those cases we will be interested in algorithms that work well in lack of the knowledge of the time horizon, in other words, also known as **anytime algorithms**. As we will see, sometimes the knowledge of time horizon will make a difference when it comes to designing learning algorithms, though usually we find that the difference will be small.

Let’s say we are fine with stationarity. Still, **why should one use (worst-case) regret** as opposed to simply resorting to comparing learners by the total reward they collected, especially since we said that minimizing the regret is the same as maximizing total reward? Why complicate matters with talking about regret then? First, regret facilitates comparisons across environments by doing a bit of normalization: One can shift rewards by an arbitrary constant amount in some environment and the regret does not change, whereas total reward would be impacted by this change (note that scale is still an issue: multiplying the rewards by a positive constant number, the regret also gets multiplied, so usually we assume some fixed scale, or we need to normalize regret with the scale of rewards). More important however is that regret is well aligned with the intuition that learning should be easy in environments where either *(i)* all actions pay similar amounts (all learners do well in such environments and all of them will have low regret), and also in environments where *(ii)* some actions pay significantly more than others (because in such environments it is easy to find the high-paying action or actions, so many learners will have small regret). As a result, the worst-case regret over various classes of environment is a meaningful quantity, as opposed to worst-case total reward which is vacuous. Generally, regret allows one to study learning in worst-case settings. Whether this leads to conservative behavior is another interesting point for further discussion (and is an active topic of research).

With this, let me turn to discussing the typical restrictions on the** bandit environments**. A particularly simple, yet appealing and rich problem setting is that of **stochastic, stationary bandits**. In this case the environment is restricted to generate the reward in response to each action from a distribution that is specific to that action, and independently of the previous action choices (of the possibly randomizing learner) and rewards. Sometimes, this is also called “stochastic **i.i.d.**” bandits since for a given action, any reward for that action is independent of all the other rewards of the same action and they are all generated from identical distributions (that is, there is one distribution per action). Since “stochastic, stationary” and “stochastic i.i.d.” are a bit too long, in the future, we will just refer to this setting as that of “**stochastic bandits**“. This is the problem setting that we will start discussing next week.

For some applications the assumption that the rewards are generated in a stochastic and stationary way may be too restrictive. In particular, **stochastic assumptions can be hard to justify in the real-world**: The world, for most of what we know about it, is deterministic, if it is hard to predict and often chaotic looking. Of course, stochasticity has been enormously successful to explain mass phenomenon and patterns in data and for some this may be sufficient reason to keep it as the modeling assumption. But what if the stochastic assumptions fail to hold? What if they are violated for a single round? Or just for one action, at some rounds? Will all our results become suddenly vacuous? Or will the algorithms developed be robust to smaller or larger deviations from the modeling assumptions? One approach, which is admittedly quite extreme, is to **drop all the assumptions** on how rewards are assigned to arms. More precisely, some minimal assumptions are kept, like that the rewards lie in a bounded interval, or that the reward assignment is done before the interaction begins, or simultaneously with the choice of the learner’s action. In any case, if these are the only assumptions, we get what is called the setting of **adversarial bandits**. Can we still say meaningful things about the various algorithms? Can learning algorithms achieve nontrivial results? Can their regret be sublinear? It turns out, that, perhaps surprisingly, this is possible. At a second thought though this may not be **that** surprising: After all, sublinear regret requires only that the learner uses the **single best action** (best in hindsight) for most of the time, but the learner can still use any other actions for a smaller (and vanishing) fraction of time. What saves learners then is that the identity of the action with the highest total payoff in hindsight cannot change too abruptly and too often as averages cannot be changed too much as long as the summands in the average are bounded.

# Course contents

Let’s close with what the course will cover. The course will have **three major blocks**. In the first block, we will discuss **finite-armed bandits**, both stochastic and adversarial as these help the best to develop our basic intuition of what makes a learning algorithm good. In the next block we will discuss the beautiful developments in **linear bandits**, again, both **stochastic** and **adversarial**, their connection to **contextual bandits**, and also various **combinatorial** settings, which are of major interest in various applications. The reason to spend much time on linearity is because it is the simplest structure that permits one to address large scale problems, and is a very popular and successful modelling assumption. In the last block, we will discuss **partial monitoring** and learning and acting in **Markovian Decision Processes**, that is, topics which go beyond bandit problems. With this little excursion outside of bandit land we hope to give a better glimpse for everyone on how exactly bandits fit the bigger picture of online learning under uncertainty.

# References and additional reading

The paper by Thompson that was the first to consider bandit problems is the following:

- William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.

Besides Thompson’s seminal paper, there are already a number of books on bandits that may serve as useful additional reading. The most recent (and also most related) is by Sebastien Bubeck and Nicolo Cesa-Bianchi and is freely available online. This is an excellent book and is warmly recommended. The book largely overlaps the topics we will cover, though we are also planning to cover developments that did not fit the page limit of Seb and Nicolo’s book, as well as newer developments.

There are also two books that focus mostly on the Bayesian setting, which we will address only a little. Both are based on relatively old material, but are still useful references for this line of work and are well worth reading.

# Finite-armed stochastic bandits: Warming up

On Monday last week we did not have a lecture, so the lectures spilled over to this week’s Monday. This week was devoted to building up foundations, and this post will summarize how far we got. The post is pretty long, but then it covers all there is to measure-theoretic probability in a way that it will still be intuitive.

As indicated in the first post, the first topic is finite-armed stochastic bandits. Here, we start with an informal problem description, followed by a simple statement that gives an equivalent, intuitive and ultimately most useful formula for the expected regret. The proof of this formula is a couple of lines. However, the formula and the proof is mainly used just as an excuse to talk about measure-theoretic probability, the framework that we will use to state and prove our results. Here, rather than giving a dry introduction, we will try to explain the intuition behind all the concepts. In the viewpoint presented here, the “dreaded” sigma algebras become a useful tool that allow a concise language when it comes to summarizing “what can be known”. The first part is the bandit part, and then we dive into probability theory. We then wrap up with reflecting on what was learned by reconsidering the bandit framework which is sketched at the beginning, putting everything that we talk and will talk about on firm foundations.

# A somewhat informal problem definition

The simple **problem statement** of *learning in $K$-armed stochastic bandit problems* is as follows: An environment is given by $K$ distributions over the reals, $P_1, \dots, P_K$ and, as discussed before, the learner and the environment interact sequentially. In particular:

For rounds $t=1,2,\dots,$:

1. Based on its past observations (if any), the learner chooses an action $A_t\in [K] \doteq \{1,\dots,K\}$. The chosen action is sent to the environment;

2. The environment generates a random reward $X_t$ whose distribution is $P_{A_t}$ (in notation: $X_t \sim P_{A_t}$). The generated reward is sent back to the learner.

Note that we allow an indefinitely long interaction, so that we can meaningfully discuss questions like whether a learner chooses a suboptimal action infinitely often.

If we were to write a computer program to **simulate this environment**, we would use a “freshly generated random value” (a call to a random number generator) to obtain the reward of step 2. This is somewhat hidden in the above problem description, but is an important point. In the language of probability theory, we say that $X_t$ has the distribution of $P_{A_t}$ regardless of the history of the interaction of the learner and the environment. This is a notion of *conditional independence*. Sometimes stochastic bandits are also called stationary stochastic bandits, emphasizing that the distributions used to generate the rewards remain fixed.

The **learner’s goal** is to maximize its total reward, $S_n = \sum_{t=1}^n X_t$. Since the $X_t$ are random, $S_n$ is also random and in particular it has a distribution. To illustrate this, the figure on the left shows two possible distributions of the total reward of two different learners, call them $A$ and $B$, but for the same environment. Say, you can choose between learners $A$ and $B$ when running your company whose revenue will be whatever the learner makes. Which one would you choose? One choice is to go with the learner whose total reward distribution has the larger *expected* value. This will be our choice for most of the class when discussing stochastic environments. Thus, we will prefer algorithms that make the expectation of the total reward gained, symbolically, $\EE{S_n}$, large. This is not the only choice but for now we will stick to it (see some notes on this and other things at the end of this post).

Note that we have not said for *which* value of $n$ we would like $S_n$ to be large, that is, what horizon do we want to optimize for. We will not go into this issue in detail, but roughly there are two choices. The simplest case is when $n$ is the total number of rounds to be played and is known in advance. Alternatively, if the number of rounds is not known , then one often tries to make $S_n$ large for all $n$ simultaneously. The latter is somewhat delicate, because one can easily believe that there are trade-offs to be made. For now we leave it at that and return to this issue later.

The Bernoulli, Gaussian and uniform distributions are often used as **examples** for illustrating some specific property of learning in stochastic bandit problems. The Bernoulli distribution (i.e., when the reward is binary valued) is in fact a natural choice – think of applications like maximizing click-through rates in a web-based environment: A click is worth a unit reward, no click means no (zero) reward. If this environment is stationary and stochastic, we get what is called *Bernoulli bandits*. Similarly, Gaussian bandits means that the payoff distributions are given by Gaussians. Of course, one arm could also have a Bernoulli, another arm could have a Gaussian, the third could have a uniform distribution. More generally, we can have bandits where we only know that the range of rewards is bounded to some interval, or some weaker information, such as the reward distribution has thin tails (i.e., not fat tails).

## Decomposing the regret

To formulate the statement about the regret decomposition, let $\mu_k = \int_{-\infty}^{\infty} x P_k(dx)$ be the expected reward gained when action $k$ is chosen (we assume that $P_k$ is such that $\mu_k$ exists and is finite), $\mu^* = \max_k \mu_k$ be the maximal reward and $\Delta_k = \mu^* – \mu_k$. (The range of $k$ here is in $[K]$, i.e., $\mu_k$ and $\Delta_k$ are defined for each $k\in [K]$. To save some typing and to avoid clutter, in the future we will assume that the reader can work out ranges of symbols when we think this can be done in a unique way).

The value of $\Delta_k$ is the expected amount that the learner loses in a single round by choosing action $k$ compared to using an optimal action. This is called the *immediate regret* of action $k$, and is also known as the *action-gap* or *sub-optimality gap* of action $k$. Further, let $T_k(n) = \sum_{t=1}^n \one{A_t = k}$ be the number of times action $k$ was chosen by the learner before the *end* of round $n$ (including the choice made in round $n$). Here, $\one{\cdot}$ stands for what is called the *indicator function*, which we use to convert the value of a logical expression (the argument of the indicator function) to either zero (when the argument evaluates to logical false), or one (when the argument evaluates to logical true).

Note that, in general, $T_k(n)$ is also *random*. This may be surprising if we think about a deterministic learner (a learner which, given the past, decides upon the next action following a specific function, also known as the *policy* of the learning, without any extra randomization). So why is $T_k(n)$ random in this case? The reason is because for $t>1$, $A_t$ depends on the rewards observed in rounds $1,2,\dots,t-1$, which are random, hence $A_t$ will also inherit their randomness.

The total expected regret of the learner is $R_n = n\mu^* – \EE{\sum_{t=1}^n X_t}$: You should convince yourself that *given the knowledge of* $(P_1,\dots,P_K)$, a learner could make on expectation $n\mu^*$, but no learner can make more than this in expectation. That is, $n\mu^*$ is the optimal total expected reward in $n$ rounds. Thus, our $R_n$ above is the difference of how much an omniscient decision maker could have made and how much the learner made. At a first sight this definition may sound quite a bit stronger than asking for the difference between of how much an omniscient decision maker *who is restricted to choosing the same action in every round* could have made, the definition we used in the first post. However, in a stationary stochastic bandit environments there exist optimal omniscient decision makers that would chose the same action in every round. Thus, the two concepts coincide.

The statement mentioned beforehand is as follows:

Lemma (Basic Regret Decomposition Identity): $R_n = \sum_{k=1}^K \Delta_k \EE{ T_k(n) }$.

The lemma simply decomposes the regret in terms of the loss due to using each of the arms. Of course, if $k$ is optimal (i.e., $\mu_k = \mu^*$) then $\Delta_k = 0$. The lemma is useful in that it tells us that to keep the regret small, the learner should try to minimize the *weighted* sum of expected action-counts, where the weights are the respective action gaps. In particular, a good learner should aim to use an arm with a larger action gap proportionally fewer times.

**Proof:**

Since $R_n$ is based on summing over rounds, and the right hand side is based on summing over actions, to convert one sum into the other one we introduce indicators. In particular, note that for any fixed $t$ we have $\sum_k \one{A_t=k} = 1$. Hence, $S_n = \sum_t X_t = \sum_t \sum_k X_t \one{A_t=k}$ and thus

\begin{align*}

R_n = n\mu^* – \EE{S_n} = \sum_{k=1}^K \sum_{t=1}^n \EE{(\mu^*-X_t) \one{A_t=k}}.

\end{align*}

Now, knowing $A_t$, the expected reward is $\mu_{A_t}$. Thus we have

\begin{align*}

\EE{ (\mu^*-X_t) \one{A_t=k}\,|\,A_t} = \one{A_t=k} \EE{ \mu^*-X_t\,|\,A_t} = \one{A_t=k} (\mu^*-\mu_{A_t}) = \one{A_t=k} (\mu^*-\mu_{k}).

\end{align*}

Using the definition of $\Delta_k$ and then plugging in into the right-hand side of the previous equation, followed by using the definition of $T_k(n)$ gives the result.

QED.

The proof is as simple as it gets. The technique of using indicators for switching from sums over one variable to another will be used quite a few times, so it is worth remembering. Readers who are happy with the level of detail and feel comfortable with the math can stop reading here. Readers who are asking themselves about what $\EE{X_t\,|\,A_t}$ means, or why various steps of this proof are valid, or even what is the proper definition of all the random variables mentioned, in short, readers who want to develop a rigorous understanding of what went into the above proof, should read on. The rest of the post is not short, but by the end, readers following the discussion should gain a quite solid foundation for the remaining posts. We also promise to provide a somewhat unusual, and hopefully more insightful than usual description of these foundations.

# Rigorous foundations

The rigorous foundation to the above, as well as all subsequent results, is based on probability theory, or more precisely, on **measure-theoretic probability theory**. There are so many textbooks on measure-theoretic probability theory that the list could easily fill a full book itself. At the end, we will list a few of our favourites. Now, measure-theoretic probability theory may sound off-putting to many as something that is overly technical. People who were exposed to measure-theoretic probability often think that it is only needed when dealing with continuous spaces (like the real numbers), and often the attitude is that measure-theoretic probability is only needed for the sake of rigour and is not more than a mere technical annoyance. There is much truth to this (measure-theoretic probability is needed for all these reasons and can be annoying, too), but we claim that **measure-theoretic probability can offer much more than rigor if treated properly**. This is what we intend to do in the rest of the post.

## Probability spaces and random elements

Imagine the following random game: I throw a dice. If the result is four, I throw two more dice. If the result is not four, I throw one dice only. Looking at the newly thrown dice (one or two), I repeat the same, for a total of three rounds. Afterwards, I pay you the sum of the values on the faces of the dice, which becomes the “value” of the game. How much are you willing to pay me to play this game? Games like these were what drove the development of probability theory at the beginning. The game described looks complicated because the number of dice used is itself random.

Now, here is a simple way of dealing with this complication: instead of considering rolling the dice one by one, imagine that **sufficiently many** (say, more than 7 in this case) **dice were rolled before the game has even started** (why 7? because in the first round we use one, in the next round at most 2, and in the last round at most 4 dice). Then, with the dice that have been rolled already, the game can be emulated easily. For example, we can order the rolled dice and just take the value of the first dice in the chosen ordering in the first round of the game. If we see a four, we look at the next two dice in the ordering, otherwise we look at the single next dice. Note that this way we completely **separate the random acts** (rolling the dice) **and the game mechanism** (which produces values).

The advantage of this is that we get a **simple calculus** for the probabilities of all kinds of *events*. First, the probability of any single outcome of rolling 7 dice, an element of $\Omega \doteq [6]^7$, is simply $1/6^7$ (since all outcomes should be equally probable if the dice are not loaded). The probability of the game payoff taking some value $v$ can then be calculated by calculating the total probability assigned to all those outcomes $\omega\in \Omega$ that, through the complicated game mechanism, would result in the value of $v$. In principle, this is trivial to do thanks to the separation of everything that is probabilistic from the rest. This separation is the essence of probability theory as proposed by Kolmogorov. Or at least the essence of half of probability theory. Here, the set $\Omega$ is called the *outcome space*, its elements are the *outcomes*. The figure below illustrates this idea: Random outcomes are generated on the left, while on the right, various mechanisms are used to arrive at values, some of which may be “observed”, some not.

There will be much benefit to being just a bit more formal about how we come up with the value of the game. The major realization is that the process in which the game gets its value is nothing but a *function* that maps $\Omega$ to the set of natural numbers $\N$. While we view the value of the game as random, this map, call it $X$, is *deterministic*: The randomness is completely removed from $X$! First point of irony: Functions like $X$, whose domain is the space of outcomes, are called *random variables* (we will add some extra restriction on these functions soon, but this is a good working definition for now). Furthermore, $X$ is not a variable in a programming language sense and, again, there is nothing random about $X$ itself. The randomness is in the argument that $X$ is acting on, producing “randomly changing” results.

Pick some natural number $v\in \N$. What is the probability of seeing $X=v$? As this was described above, this probability is just $(1/6)^7$ times the cardinality of the set $\set{\omega\in \Omega\,:\,X(\omega)=v}$, which we will denote by $X^{-1}(v)$ and which is called the preimage (a.k.a. “inverse image”) of $v$ under $X$. More generally, the probability that $X$ takes its value in some set $A\subset \N$ is given by $(1/6)^7$ times the cardinality of $X^{-1}(A) \doteq \set{\omega\in \Omega\,:\,X(\omega)\in A}$ (we have overloaded $X^{-1}$). Here, $X^{-1}(A)$ is also called the preimage of $A$. Note how we always only needed to talk about probabilities assigned to *subsets* of $\Omega$ regardless of the question asked!** It seems that probabilities assigned to subsets are all what one needs** for these sort of calculations.

To make this a bit more general, let us introduce a map $\PP$ that assigns probabilities to (certain) subsets of $\Omega$. The intuitive meaning of $\PP$ is now this: Random outcomes are generated in $\Omega$. The probability that an outcome falls into a set $A\subset \Omega$ is $\Prob{A}$. If $A$ is not in the domain of $\PP$, there is no answer to the question of the probability of the outcome falling in $A$. But let’s postpone the discussion of why $\PP$ should be restricted to only certain subsets of $\Omega$ later. In the above example with the dice, $\Prob{ A } = (1/6)^7 |A|$.

With this new notation, the answer to the question of what is the probability of seeing $X$ taking the value of $v$ (or $X=v$) becomes $\Prob{ X^{-1}(v) }$. To minimize clutter, the more readable notation for this probability is $\Prob{X=v}$. It is important to realize though that the above is exactly the definition of this more familiar symbol sequence! More generally, we also use $\Prob{ \mathrm{predicate}(X,U,V,\dots)} = \Prob{\{\omega \in \Omega : \mathrm{predicate}(X,U,V,\dots) \text{ is true}\}}$ with any predicate, i.e., expression evaluating to true or false, where $X,U,V, \dots$ are functions with domain $\Omega$.

What are the properties that $\PP$ should satisfy? First, we need $\PP(\Omega)=1$ (i.e., it should be defined for $\Omega$ and the probability of an outcome falling into $\Omega$ should be one). Next, for any subset $A\subset \Omega$ for which $\PP$ is defined, $\PP(A)\ge 0$ should hold (no negative probabilities). If $\PP(A^c)$ is also defined where $A^c\doteq \Omega\setminus A$ is the complement of $A$ in $\Omega$, then $\PP(A^c) = 1-\PP(A)$ should hold (negation rule). Finally, if $A,B$ are disjoint (i.e., $A\cap B=\emptyset$) and $\PP(A)$, $\PP(B)$ and $\PP(A\cup B)$ are all defined, then $\PP(A \cup B) = \PP(A) + \PP(B)$ should hold. This is what is called the finite additivity property of $\PP$.

Now, it looks silly to say that $\PP(A^c)$ may not be defined when $\PP(A)$ is defined. Since if $\PP(A^c)$ was not defined, we should just define it as $1-\PP(A)$! Similarly, it is silly if $\PP(A\cup B)$ is not defined for $A,B$ disjoint for which $\PP$ is already defined. Since in this case we should define this value as $\PP(A) + \PP(B)$.

By a logical jump, we will also require the additivity property to hold for countably infinitely many sets: if $\{A_i\}_i$ are disjoint and $\PP(A_i)$ is defined for each $i\in \N$, then $\PP(\cup_i A_i) = \sum_i \PP(A_i)$ should hold. It follows that the set system $\cF$ for which $\PP$ is defined should include $\Omega$, should be closed under complements and countable unions. Such a set system is called a $\sigma$**-algebra** (sometimes $\sigma$-field). Apart from the requirement of being closed under *countable* unions (as opposed to finite unions) we see that it is a rather minimal requirement to demand that the set system for which $\PP$ is defined is a $\sigma$-algebra. If $\cF \subseteq 2^\Omega$ is a $\sigma$-algebra and $\PP$ satisfies the above properties then $\PP$ will be called a *probability measure*. The elements of $\cF$ are called *measurable sets*: They are measurable in the sense that $\PP$ assigns values to them. The pair $(\Omega,\cF)$ alone is called a *measurable space*, while the triplet $(\Omega,\cF,\PP)$ is called a *probability space*. If the condition that $\PP(\Omega)=1$ is lifted $\PP$ is called a *measure*. If the condition that $\PP(A)\ge 0$ is also lifted then $\PP$ is called *a signed measure*. We note in passing that both for measures and signed measures it would be unusual to use the symbol $\PP$, which is mostly reserved for probabilities.

Random variables, like $X$, lead to new probability measures. In particular, in our example, $A \mapsto \Prob{ X^{-1}(A) }$ itself is a probability measure defined for all the subsets of $A$ of $\N$ for which $\Prob{ X^{-1}(A) }$ is defined. This probability map is called the d*istribution induced by* $X$ and $\PP$, and is denoted by $\PP_X$. Alternatively, it is also called the *pushforward* of $\PP$ under $X$. If we want to answer any probabilistic question related to $X$, all we need is just $\PP_X$! It is worth noting that if we keep $X$ but change $\PP$ (e.g, switch to a loaded dice), the distribution of $X$ changes. We will often use arguments that will do exactly this, especially when proving lower bounds about the regret of bandit algorithms.

Now *which* probability questions related to $X$ can we answer based on a probability map $\PP:\cF \to [0,1]$? The predicates in all these questions take the form $X\in A$ with some $A\subset \N$. Now, to be able to get an answer, we need that $X^{-1}(A)$ is in $\cF$. It can be checked that if we take the set of all the sets of the above form then we get a $\sigma$-algebra. We can also insist that we want to be able to answer questions on the probability of $X\in A$ for any $A\in \cG$ where $\cG$ is some fixed set system (not necessarily a $\sigma$-algebra). If this can be done, we call the $X$ an *$\cF/\cG$-measurable map* (read: $\cF$-to-$\cG$ measurable). When $\cG$ is obvious from the context, $X$ is often called a measurable map. What is a typical choice for $\cG$? When $X$ is real-valued (being a little more general than in our example) then the typical choice is to let $\cG$ be the set of all open intervals. The reader can verify that if $X$ is $\cF/\cG$-measurable, it is also $\cF/\sigma(\cG)$-measurable where $\sigma(\cG)$ is the *smallest* $\sigma$*-algebra that contains* $\cG$. This can be seen to exists and to contain exactly those sets $A$ that are in every $\sigma$-algebra that contains $\cG$. (It is a good practice to prove these statements.) When $\cG$ is the set of open intervals, $\sigma(\cG)$ is the so-called *Borel *$\sigma$-*algebra* and $X$ is called a *random variable*. One can of course consider other range spaces than the set of reals; the ideas still apply. In these cases, instead of random variable, we say that $X$ is a *random-element*.

You may be wondering why we insist that $\PP$ may not be defined for all subsets of $\Omega$? In fact, in our simple example it *would be defined* for all subsets of $\Omega$, i.e., $\cF = 2^\Omega$ (and would be in fact given by $\PP(A)=(1/6)^7 | A|$, where $|A|$ is the number of elements of $A$). To understand the advantage of defining $\PP$ only for certain subsets we need to dig a little deeper.

## The real reason to use $\sigma$-algebras

Consider this: Let’s say in the game described I tell you the value of $X$. What does the **knowledge of **$X$ entail? It is clear that knowing the value of $X$ we will also know the value of say $X-1$ or $X^2$ or any nice function of $X$. So anything presented explicitly as “$f(X)$” with nice functions $f$ (e.g., Borel) will be known with certainty when the value of $X$ is known. But what if you have a variable that is not presented explicitly as a function $X$? What can we say about it based on knowing $X$?

For instance, can you tell with certainty the value of the first dice roll $F$? Or can you tell the value of the first dice roll being four, i.e., whether $F=4$? Or the value of $V=\one{F=4,X\ge 7}$, i.e., whether $F=4$ and $X\ge 7$ hold simultaneously? Or the value of $U=\one{F=4,X>24}$? Note that $F,V,U$ are all functions whose domain is $\Omega$. The question whether the value of say, $U$, can be determined with certainty given the value of $X$ is the same as asking whether there exist a function $f$ whose domain is the range of $X$ and whose range is the range of $U$ (binary in this case) such that $U(\omega)=f(X(\omega))$ holds no matter the value of $\omega\in \Omega$. The answer to this specific question is yes: If $X>24$ then $F=4$ because for $F\ne 4$, the maximum value for $X$ is $24$. Thus, $f(x) = \one{x>24}$ will work in this case. In the case of $V$, the answer to the question whether $V$ can be written as a function $X$ is no. Why? What are those random variables that can be determined with certainty given the knowledge of $X$ for some $X$ random variable? As it turns out, there is a very nice answer to this question: Denoting by $\sigma(X)$ the smallest $\sigma$-algebra over $\Omega$ that makes $X$ measurable (this is also called the $\sigma$-algebra generated by $X$), it holds that a random variable $U$ of the same measure space that carries $X$ can be written as a measurable deterministic function of $X$ if and only if $U$ is $\sigma(X)$-measurable. An alternative way of saying this is that $U$ is $\sigma(X)$-measurable if and only if $U$ factorizes into $f$ and $X$ in the sense that $U = f\circ X$ for some $f:\R \to \R$ measurable ($\circ$ stands for the composition of its two arguments). This result is known as the **factorization lemma** and is one of the core results in probability theory.

The significance of this result cannot be underestimated. The result essentially says that $\sigma(X)$ contains all that there is to know about $X$ in the sense that is suffices to know $\sigma(X)$ if we want to figure out whether some random variable $U$ can be “calculated” based on $X$. In particular, one can forget the values of $X$(!) and still be able to see if $U$ is determined by $X$. One just needs to check whether $\sigma(U) \subset \sigma(X)$. (This may look like a daunting task. The trick to something like this is to work with subsets $\mathcal{E}$ of $\sigma(U)$ such that $\sigma(U)\subset \sigma(\mathcal{E})$.) It also follows that if $X$ and $Y$ generate the same $\sigma$ algebra (i.e., $\sigma(X) = \sigma(Y)$) then if $U$ is determined by $X$ then it is also determined by $Y$ and vice versa. In short, **$\sigma(X)$ summarizes all that there is to know about $X$ when it comes to determining what can be computed based on the knowledge of $X$**. Without ever explicitly constructing the functions specifying this computation.

Since we see that $\sigma$-algebras are in a way more fundamental than random variables, often we keep the $\sigma$-algebras only, without mentioning any random variables that would have generated them. One can also notice that if $\cG$ and $\cH$ are $\sigma$-algebras and $\cG\subset \cH=\sigma(Y)$ then $\cH$ contains more information in the sense that if a random variable $X$ is $\cG$-measurable, then $X$ is also $\cH$-measurable, hence, $X$ can be written as a deterministic function of $Y$; thus $Y$ (or $\cH$) contains more information than $\cG$. When we think of $\sigma$-algebras, we are often thinking of all the random variables that are measurable with respect to them.

Now let $X$ be a real-valued random variable and consider taking the restriction of $\PP$ to $\sigma(X)$: $\PP’: \sigma(X) \to [0,1]$, $\PP'(A) = \PP(A)$, $A\in \sigma(X)$. Clearly, $(\Omega,\sigma(X),\PP’)$ is also a probability space. Why should we care? This is a probability space that allows us to answer any question based on knowing $X$ and of all those probability spaces this is the smallest. To break down the first part: Taking any $A\in \sigma(X)$, one can show that there exists $B$ Borel-measurable set such that $A = X^{-1}(B)$. That is, all probabilities for all sets $A$ included in the considered $\sigma$-algebra are probabilities of $X$ being the element of some (Borel-measurable) set $B$. Observe that this probability space has less (not more and potentially much less) detail than the original. Sometimes removing details is what one needs! The process of gradually adding detail is called *filtration*. More precisely, a filtration is a sequence $(\cF_t)_t$ of $\sigma$-algebras such that $\cF_t \subset \cF_{s}$ for any $t<s$.

## Independence and conditional probability

Very often we will want to talk about **independent events** or **independent random variables**. Let $(\Omega, \cF, \PP)$ be a probability space and let $X$ and $Y$ be random variables, which by definition means that $X$ and $Y$ are $\cF$-measurable. We say that $X$ and $Y$ are independent if

\begin{align*}

\Prob{X^{-1}(A) \cap Y^{-1}(B)} = \Prob{X^{-1}(A)} \Prob{Y^{-1}(B)}

\end{align*}

for all measurable $A$ and $B$ (note that here $A, B \subseteq \R$ are not elements of $\cF$, but rather of the Borel $\sigma$-algebra that defines the measure space on the reals). A more familiar way of writing the above equation is perhaps

\begin{align*}

\Prob{ X\in A, Y\in B } = \Prob{X\in A} \Prob{Y\in B }\,.

\end{align*}

An alternative way of defining the independence of random variables is in terms of the $\sigma$-algebras they generate. If $\cG$ and $\cH$ are two sub-$\sigma$-algebras of $\cF$, then we say $\cG$ and $\cH$ are independent if $\Prob{A \cap B} = \Prob{A} \Prob{B}$ for all $A \in \cG$ and $B \in \cH$. Then random variables $X$ and $Y$ are independent if and only if the $\sigma$-algebras they generate are independent. An event can be represented by a binary random variable. Let $A, B \in \cF$ be two events, then $A$ and $B$ are independent if $\one{A}$ and $\one{B}$ are independent. If $\Prob{B} > 0$, then we can also define the **conditional probability** of $A$ given $B$ by

\begin{align*}

\Prob{A|B} = \frac{\Prob{A \cap B}}{\Prob{B}}\,.

\end{align*}

In the special case that $A$ and $B$ are independent this reduces to the statement that $\Prob{A|B} = \Prob{A}$, which reconciles our understanding of independence and conditional probabilities: Intuitively, $A$ and $B$ are independent if knowing whether $B$ happens does not change the probability of whether $A$ happens, and knowing whether $A$ happens does not change the probability of whether $B$ happens.

Note that if $\Prob{B} = 0$, then the above definition of conditional probability cannot be used. While in some cases the quantity $\Prob{A|B}$ cannot have a reasonable definition, often it is still possible to define this meaningfully, as we shall see it in a later section.

# Integration and Expectation

Near the beginning of this post we defined the mean of arm $i$ as $\mu_i = \int^\infty_{-\infty} x dP_i(x)$.

This is also called the expectation of $X$ where $X:\Omega \to \R$ is a random variable sampled from $P_i$. We are now in a good position to reconcile this definition with the measure-theoretic view of probability given above. Let $(\Omega, \cF, \PP)$ be a probability space and $X: \Omega \to \R$ be a random variable. We would like to give a definition of the expectation of $X$ (written $\EE{X}$) that does not require us to write a density or to take a discrete sum (of what?). Rather, we want to use only the definitions of $X$ and the probability space $(\Omega, \cF, \PP)$ directly.

We start with the simple case where $X$ can take only finitely many values (eg., $X$ is the outcome of a fair dice). Suppose these values are $\alpha_1,\ldots,\alpha_k$, then the expectation is defined by

\begin{align*}

\EE{X} = \sum_{i=1}^k \Prob{X^{-1}(\{\alpha_i\})} \alpha_i\,.

\end{align*}

Remember that $X$ is measurable, which means that $X^{-1}(\{\alpha_i\})$ is in $\cF$ and the expression above makes sense. For the remainder we aim to generalize this natural definition to the case when $X$ can take on infinitely many values (even uncountably many!).

Rather than use the expectation notation, we prefer to simply define the expectation in terms of the **Lebesgue integral** and then define the latter in a way that is consistent with the usual definitions of the expectation (and for that matter with Riemann integration).

\begin{align*}

\EE{X} = \int_{\Omega} X d\PP

\end{align*}

Our approach for defining the Lebesgue integral follows a standard path. Recall that $X$ is a measurable function from $\Omega$ to $\R$. We call $X$ a **positive simple function** if $X$ can be written as

\begin{align*}

X(\omega) = \sum_{i=1}^k \one{\omega \in A_i} \alpha_i\,,

\end{align*}

where $\alpha_i$ is a positive real number for all $i$ and $A_1,\ldots,A_k \in \cF$. (Often, in the definition it is required that the sets $(A_i)_i$ are pairwise disjoint, but this condition is in fact superfluous: Our definition gives the same set of simple functions as the standard one.) Then we define the **Lebesgue integral** of this kind of simple random variable $X$ by

\begin{align*}

\int_\Omega X d\PP \doteq \sum_{i=1}^k \Prob{A_i} \alpha_i\,.

\end{align*}

This definition corresponds naturally to the discrete version given above by letting $A_i = X^{-1}(\{\alpha_i\})$. We now extend the definition of the Lebesgue integral to positive measurable functions. The idea is to approximate the function using simple functions. Formally, if $X$ is a non-negative random variable (that is, it is a measurable function from $\Omega$ to $[0,\infty)$), then

\begin{align*}

\int_{\Omega} X d\PP \doteq \sup \left\{\int_\Omega h d\PP : h \text{ is simple and } 0 \leq h \leq X\right\}\,,

\end{align*}

where $h \leq X$ if $h(\omega) \leq X(\omega)$ for all $\omega \in \Omega$. Of course this quantity need not exist (the limit hidden by the supremum could tend to infinity). If it does not exist then we say the integral (or expectation) of $X$ is unbounded or does not exist.

Finally, if $X$ is any measurable function (i.e, it may also take negative values), then define $X^+(\omega) = X(\omega) \one{X(\omega) > 0}$ and $X^{-}(\omega) = -X(\omega) \one{X(\omega) < 0}$ so that $X(\omega) = X^+(\omega) – X^-(\omega)$. Then $X^+$ and $X^-$ are non-negative random variables and provided that $\int_{\Omega} X^+ d\PP$ and $\int_\Omega X^- d\PP$ both exist, then we define

\begin{align*}

\int_\Omega X d\PP \doteq \int_\Omega X^+ d\PP – \int_\Omega X^-d\PP\,.

\end{align*}

None of what we have done depends on $\PP$ being a probability measure (that is $\Prob{A} \geq 0$ and $\Prob{\Omega} = 1$). The definitions all hold more generally for any measure and in particular, if $\Omega = \R$ is the real line and $\cF$ is the Lebesgue $\sigma$-algebra (defined in the notes below) and the measure is the so called Lebesgue measure $\lambda$, which is the unique measure such that $\lambda((a,b)) = b-a$ for any $a \leq b$. In this scenario, if $f:\R \to \R$ is a measurable function, then we can write the Lebesgue integral of $f$ with respect to the Lebesgue measure as

\begin{align*}

\int_{\R} f \,d\lambda\,.

\end{align*}

Perhaps unsurprisingly this almost always coincides with the improper Riemann integral of $f$, which is normally written as $\int^\infty_{-\infty} f(x) dx$. Precisely, if $|f|$ is both Lebesgue integrable and Riemann integrable, then the integrals are equal. There do, however, exist functions that are Riemann integrable and not Lebesgue integrable, and also the other way around (although examples of the former are more unusual than the latter).

Having defined the expectation of a random variable in terms of the Lebesgue integral, one might ask about the properties of the expectation. By far the most important property is its linearity. Let $X_i$ be a (possibly infinite) set of random variables on the same probability space and assume that $\EE{X_i}$ exists for all $i$ and furthermore that $X = \sum_i X_i$ and $\EE{X}$ also exist. Then

\begin{align*}

\EE{X} = \sum_i \EE{X_i}\,.

\end{align*}

This property, “swapping the order of $\E$ and $\sum_i$”, is the source of much magic in probability theory because it holds *even if $X_i$ are not independent*. This means that (unlike probabilities) we can very often decouple the expectations of dependent random variables, which often proves extremely useful. We will not prove this statement here, but as usual suggest the reader do so for themselves. The other requirement for linearity is that if $c \in \R$ is a constant, then $\EE{c X} = c \EE{X}$, which is also true and rather easy to prove. Note that if $X$ and $Y$ are independent, then one can show that $\EE{X Y} = \EE{X} \EE{Y}$, but this is not true in general for dependent random variables (try to come up with a simple example demonstrating this).

## Conditional expectation

Besides the expectation, we will also need **conditional expectation**, which allows us to talk about the expectation of a random variable given the value of another random variable. To illustrate with an example, let $\Omega = [6]$ (the outcomes of a dice) and $\cF = 2^\Omega$ and $\PP$ is the uniform measure on $\Omega$ so that $\Prob{A} = |A|/6$ for each $A \in \cF$. Define two random variables $X,Y$, with $Y(\omega) = \one{\omega > 3}$ and $X(\omega) = \omega$ and suppose that only $Y$ is observed. Given a specific value of $Y$ (say, $Y = 1$), what can be said about the expectation of $X$? Arguing intuitively, we might notice that $Y=1$ means that the unobserved $X$ must be either $4$, $5$ or $6$, and that each of these outcomes is equally likely and so the expectation of $X$ given that $Y=1$ should be $(4+5+6)/3 = 5$. Similarly, the expectation of $X$ given $Y=0$ should be $(1+2+3)/3=2$. If we want a concise summary, we can just write that “the expectation of $X$ given $Y$” is $5Y + 2(1-Y)$. Notice how this is a random variable itself! The notation for the expectation of $X$ conditioned on the knowledge of $Y$ is $\EE{X|Y}$. Thus, in the above example, $\EE{X|Y} = 5Y + 2(1-Y)$. Similar question: Say, $X$ is Bernoulli with parameter $P$, where $P$ is a uniformly distributed random variable on $[0,1]$. Since the expectation of a Bernoulli random variable with fixed (nonrandom) parameter $q$ is $q$, intuitively, knowing $P$, we should get that the expectation of $X$, given $P$ is just $P$: $\EE{X|P}=P$. More generally, for any $X,Y$, $\EE{X|Y}$ should be a function of $Y$. In other words, $\EE{X|Y}$ should be $\sigma(Y)$ measurable! Also, if we replace $Y$ with $Z$ such that $\sigma(Z) = \sigma(Y)$, should $\EE{X|Z}$ be different than $\EE{X|Y}$? Take for example, $Q=1-P$ and let $X$ be Bernoulli with parameter $P$ still. We have $\EE{X|P} = P$. What is $\EE{X|Q}$? Again, intuitively, $\EE{X|Q} = 1-Q$ should hold (if $Q=1$, $P=0$, the expected value of $X$ with parameter zero is zero, etc.) But $1-Q= P$, hence $\EE{X|Q}=P$. At the end, what matters is not what specific random variable we are conditioning on, but the knowledge contained in that random variable, which is encoded by the $\sigma$-algebra it generates. So, how should we construct, or define conditional expectations to align them with the above intuitions, and to be rigorous (even in case like the above when for any particular $p\in [0,1]$, $\Prob{P=p}=0$!).

This is done as follows: Let $(\Omega, \cF, \PP)$ be a probability space and $X: \Omega \to \R$ be random variable and $\cH$ be a sub-$\sigma$-algebra of $\cF$. The conditional expectation of $X$ given $\cH$ is a $\cH$-measurable random variable that is denoted by $\E[X|\cH]$ and defined to be *any* $\cH$-measurable random variable such that for all $H \in \cH$,

\begin{align*}

\int_H \E[X|\cH] d\PP = \int_H X d\PP\,.

\end{align*}

Given a $\cF$-measurable random variable $Y$, the conditional expectation of $X$ *given* $Y$ is $\EE{X|Y} = \EE{X|\sigma(Y)}$. Again, at the risk of being a little overly verbose, what is the meaning of all this? Returning to the dice example above we see that $\EE{X|Y} = \EE{X|\sigma(Y)}$ and $\sigma(Y) = \{\{1,2,3\}, \{4,5,6\}\, \emptyset, \Omega\}$. Now the condition that $\E[X|\cH]$ is $\cH$-measurable can only be satisfied if $\E[X|\cH](\omega)$ is constant on $\{1,2,3\}$ and $\{4,5,6\}$ and then the display equation above immediately implies that

\begin{align*}

\EE{X|\cH}(\omega) = \begin{cases}

2, & \text{if } \omega \in \{1,2,3\}\,; \\

5, & \text{if } \omega \in \{4,5,6\}\,.

\end{cases}

\end{align*}

Finally, we want to emphasize that the definition of conditional expectation given above is not constructive. Even more off-putting is that $\E[X|\cH]$ is not even uniquely defined, although it is uniquely defined up to a set of measure zero that can safely be disregarded in calculations. A related notation that will be useful is as follows: If $X=Y$ up to a zero measure set according to $\PP$, i.e., the “exceptional set” $\{\omega\in \Omega\,: X(\omega) \ne Y(\omega)\}$ has a zero $\PP$-measure, then we write $X=Y$ $\PP$-almost surely, or, in abbreviated form $X=Y$ ($\PP$-a.s.). Sometimes this is expressed in the literature by “$X=Y$ with probability one”, which agrees with $\Prob{X\ne Y}=0$.

Let us summarize some important properties of conditional expectations (which follow from the definition directly):

Theorem (Properties of Conditional Expectations): Let $(\Omega,\cF,\PP)$ be a probability space, $\cG\subset \cF$ a sub-$\sigma$-algebra of $\cF$, $X,Y$ random variables on $(\Omega,\cF,\PP)$.

- $\EE{X|\cG} \ge 0$ if $X\ge 0$ ($\PP$-a.s.);
- $\EE{1|\cG} = 1$ ($\PP$-a.s.);
- $\EE{ X+Y|\cG } = \EE{ X|\cG } + \EE{ Y|\cG }$ ($\PP$-a.s.), assuming the expression on the right-hand side is defined;
- $\EE{ XY|\cG } = Y \EE{X|\cG}$ if $\EE{XY}$ exists and $Y$ is $\cG$-measurable;
- if $\cG_1\subset \cG_2 \subset \cF$ then ($\PP$-a.s.) $\EE{ X|\cG_1 } = \EE{ \EE{ X|\cG_2} | \cG_1 }$;
- if $\cG$ and $\cF$ are independent then ($\PP$-a.s.) $\EE{X|\cG} = \EE{X}$. In particular, if $\cG = \{\emptyset,\Omega\}$ is the trivial $\sigma$-algebra then $\EE{ X|\cG } = \EE X$ ($\PP$-a.s.).

# Notes and departing thoughts

This was a long post. If you are still with us, good. Here are some extra thoughts. At the end of this section the notes return to the bandit problem that we started with.

Note 1: It is not obvious why the expected value is a good summary of the reward distribution. Decision makers who base their decisions on expected values are called risk-neutral. In the example shown on the figure above, a risk-averse decision maker may actually prefer the distribution labeled as $A$ because occasionally distribution $B$ may incur a very small (even negative) reward. Risk-seeking decision makers, if they exist at all, would prefer distributions with occasional large rewards to distributions that give mediocre rewards only. There is a formal theory of what makes a decision maker rational (a decision maker in a nutshell is rational if he/she does not contradict himself/herself). Rational decision makers compare stochastic alternatives based on the alternatives’ expected utilities, according to a theorem of von Neumann and Morgenstern. Humans are known to be not doing this, i.e., they are irrational. No surprise here.

Note 2: Note that in our toy example instead of $\Omega=[6]^7$, we could have chosen $\Omega = [6]^8$ (considering rolling eight dice instead of 7, one dice never used). There are many other possibilities. We can consider coin flips instead of dice rolls (think about how this could be done). To make this easy, we could use weighted coins (e.g, a coin lands on its head with probability 1/6), but we don’t actually need weighted coins (this may be a little tricky to see). The main point is that there are many ways to emulate one randomization device by using another. The difference between these is the set $\Omega$. What makes a choice of $\Omega$ viable is if we can emulate the game mechanism on the top of $\Omega$ so that in the end the probability of seeing any particular value remains the same. In other words, the choice of $\Omega$ is far from unique. The same is true for the way we calculate the value of the game! For example, the dice could be reordered, if we stay with the first construction. The biggest irony in all probability theory is that we first make a big fuss about introducing $\Omega$ and then it turns out that the actual construction of $\Omega$ does not matter.

Note 3: The Lebesgue $\sigma$-algebra is obtained as the completion of the Borel $\sigma$-algebra with the following process: Take the null-sets in the Borel $\sigma$-algebra, i.e., the sets which have zero Lebesgue measure (one first constructs the Lebesgue measure for the Borel sets). Add all these to the Borel sets and then close the resulting set to make it again a $\sigma$-algebra. The resulting set is the Lebesgue $\sigma$-algebra and the Lebesgue measure is then extended to this set. With the same process, we can complete any $\sigma$-algebra with respect to some chosen measure. Incomplete $\sigma$-algebras are annoying to work with as one can meet sets that have a zero measure superset but whose measure is not defined.

Note 4: We did not talk about this, but there is a whole lot to say about why the sum, or the product of random variables are also random variables, or why $\inf_n X_n$, $\sup_n X_n$, $\liminf_n X_n$, $\limsup_n X_n$ are measurable when $X_n$ are, just to list a few things. For studying sums, products, etc, the key point is to show first that the composition of measurable maps is also a measurable map and that continuous maps are measurable, and then apply these results. For $\limsup_n X_n$, just rewrite it as $\lim_{m\to\infty} \sup_{n\ge m} X_n$, note that $\sup_{n\ge m} X_n$ is decreasing (we take suprema of smaller sets as $m$ increases), hence $\limsup_n X_n = \inf_m \sup_{n\ge m} X_n$, reducing the question to studying $\inf_n X_n$ and $\sup_n X_n$. Finally, for $\inf_n X_n$ note that it suffices if $\{\omega\,:\,\inf_n X_n \ge t\}$ is measurable any $t$ real. Now, $\inf_n X_n \ge t$ if and only if $X_n\ge t$ for all $n$. Hence, $\{\omega\,:\,\inf_n X_n \ge t\} = \cap_n \{\omega\,:\,X_n \ge t\}$, which is a countable intersection of measurable sets, hence measurable.

Note 5: The factorization lemma (attributed to Joseph L. Doob, the developer of the theory of martingales and author of the classic book “Stochastic Processes”) sneakily uses the properties of real numbers (think about why). So what we said about $\sigma$-algebras containing all information is mostly true. There are extensions, e.g., to Polish spaces, essentially covering all the interesting cases. In some sense the key is that the $\sigma$-algebra of the image space of the variable whose factorization we are seeking should be rich enough.

Note 6: We did not talk about basic results, like Lebesgue’s dominated, monotone convergence theorems, or Fatou’s lemma, or Jensen’s inequality. Of these, we will definitely use the last.

Note 7: The Radon-Nykodim derivative is what helps one to define conditional expectations with the required properties.

Note 8: With the help of conditional expectations, the protocol in the bandit problem requires that for any Borel set $U$,

\begin{align}

\label{eq:condprobconstraint}

\Prob{X_t \in U|A_1,X_1,\dots,A_{t-1},X_{t-1},A_t} = P_{A_t}(U)\,.

\end{align}

Now, you should go back and see whether you can prove the lemma based on the information in this post.

Note 9: In full formality, the lemma, and many of the subsequent results would read as follows: For all probability spaces $(\Omega,\cF,\PP)$ and for any infinite sequence of random variables $(A_1,X_1,\dots)$ such that for all $t\ge 1$, $A_t$ is $\sigma(A_1,X_1,\dots,A_{t-1},X_{t-1})$-measurable, and $X_t$ satisfies $\eqref{eq:condprobconstraint}$ for any Borel set $U$, it holds that $\dots$. Think about this: There are many choices for all these things. The statement guarantees universality: No matter how we chooses these things, the statement will hold. This is how probability spaces are there, but they are never mentioned. Why did we just say that $A_t$ is $\sigma(A_1,X_1,\dots,A_{t-1},X_{t-1})$-measurable? What is the meaning of this?

Note 10: If we want to account for randomized algorithms, all that we have to do is to replace in $\eqref{eq:condprobconstraint}$ $\sigma(A_1,X_1,\dots,A_{t-1},X_{t-1},A_t)$ ($t\ge 1$) with an increasing sequence of $\sigma$-algebras $(\cF_t)_{t\ge 1}$ such that $A_t$ is $\cF_t$-measurable, and $X_t$ is $\cF_{t+1}$-measurable. Why?

Note 11: But can we even construct a probability space $(\Omega,\cF,\PP)$ that can hold the infinitely many random variables $(X_1,A_1,\dots)$ that we need for our protocol with the required properties? There is a theorem by Ionescu Tulcea, which assures that this is the case. Kolmogorov’s extension theorem is another well-known result of this type, but for our purposes the theorem of Ionescu Tulcea is more suitable.

Note 12: We also wanted to add a few words about regular conditional probability measures, but the post is already too long. Maybe another time.

We mentioned that we will give some references. Here is at least one: In terms of its notation, David Pollard’s “A measure theoretic guide to probability” is perhaps somewhat unusual (it follows de Finetti’s notation, where $\PP$ is eliminated as redundant and gets replaced by $\mathbb E$ and sets stand for their indicator functions, etc.). This may put off some people. It should not! First, the notation is in fact quite convenient (if unusual). But the main reason I recommend it is because it is very carefully written and explains much more about the why and how than many other books. Our text was also inspired by this book.

# First steps: Explore-then-Commit

With most of the background material out of the way, we are almost ready to start designing algorithms for finite-armed stochastic bandits and analyzing their properties, and especially the regret. The last thing we need is an introduction to **concentration of measure**. As we have already seen, the regret measures the difference between the rewards you would expect to obtain if you choose the optimal action in every round, and the rewards you actually expect to see by using your policy. Of course the optimal action is the one with the largest mean and the mean payoffs of the actions are not known from the beginning but must be learned from data. We now ask how long it takes to learn about the mean reward of an action.

Suppose that $X_1,X_2,\ldots,X_n$ is a sequence of independent and identically distributed random variables (see the notes of the last post to see how one sets this up formally). Let $X$ be another random variable sharing the common distribution of the $X_i$s. Assume that the mean $\mu = \EE{X}$ exists (by construction, $\mu = \EE{X_i}$ for $i\in [n]$). Having observed $X_1,X_2,\ldots,X_n$ we would like to define an **estimator** of the common mean $\mu$ of the $X_i$ random variables. Of course the natural choice to estimate $\mu$ is to use the average of the observations, also known as the **sample mean** or **empirical mean**:

\begin{align*}

\hat \mu \doteq \frac{1}{n} \sum_{i=1}^n X_i\,.

\end{align*}

Note that $\hat \mu$ is itself a random variable. The question is how far from $\mu$ do we expect $\hat \mu$ to be? First, by the linearity of $\E$, we can notice that $\EE{\hat \mu} = \mu$. A simple measure of the spread of the distribution of a random variable $Z$ is its variance, defined by $\VV{Z} \doteq \EE{ (Z-\EE{Z})^2 }$. Applying this to the sample mean $\hat \mu$, a quick calculation shows that $\VV{\hat \mu} = \sigma^2 / n$ where $\sigma^2$ is the variance of $X$. From this, we get

\begin{align}\label{eq:iidvarbound}

\EE{(\hat \mu – \mu)^2} = \frac{\sigma^2}{n}\,,

\end{align}

which means that we expect the squared distance between $\mu$ and $\hat \mu$ to shrink as $n$ grows large at rate $1/n$ and scale linearly with the variance of $X$ (so larger variance means larger expected squared difference). While the expected squared error is important, it does not tell us very much about the *distribution of the error*. To do this we usually analyze the probability that $\hat \mu$ overestimates or underestimates $\mu$ by more than some value $\epsilon > 0$. Precisely, how do the following quantities depend on $\epsilon$?

\begin{align*}

\Prob{\hat \mu \geq \mu + \epsilon} \quad\text{ and }\quad \Prob{\hat \mu \leq \mu – \epsilon}

\end{align*}

The quantities above (as a function of $\epsilon$) are often called the **tail probabilities** of $\hat \mu – \mu$, see the figure below. In particular, the first is called an upper tail probability, while the second the lower tail probability. Analogously, the probability $\Prob{|\hat \mu -\mu|\geq \epsilon}$ is called a two-sided tail probability.

By combining $\eqref{eq:iidvarbound}$ with **Chebyschev’s inequality** we can bound the two-sided tail directly by

\begin{align*}

\Prob{|\hat \mu – \mu| \geq \epsilon} \leq \frac{\sigma^2}{n \epsilon^2}\,.

\end{align*}

This result is nice because it was so easily bought and relied on no assumptions other than the existence of the mean and variance. The downside is that in many cases the inequality is extremely loose and that huge improvement is possible if the distribution of $X$ is itself “nicer”. In particular, by assuming that higher moments of $X$ exist, Chebyshev’s inequality can be greatly improved, by applying Markov’s inequality to $|\hat \mu – \mu|^k$ with the positive integer $k$ to be chosen so that the resulting bound is optimized. This is a bit cumbersome and while this method is essentially as effective as the one we present below (which can be thought of as the continuous analog of choosing $k$), the method we present looks more elegant.

To calibrate our expectations on what gains to expect over Chebyshev’s inequality, let us first discuss the **central limit theorem**. Let $S_n = \sum_{t=1}^n (X_t – \mu)$. Informally, the central limit theorem (CLT) says that $S_n / \sqrt{n}$ is approximately distributed like a Gaussian with mean zero and variance $\sigma^2$. This would suggest that

\begin{align*}

\Prob{\hat \mu \geq \mu + \epsilon}

&= \Prob{S_n / \sqrt{n} \geq \epsilon \sqrt{n}} \\

&\approx \int^\infty_{\epsilon \sqrt{n}} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{x^2}{2\sigma^2}\right) dx\,.

\end{align*}

The integral has no closed form solution, but is easy to bound:

\begin{align*}

\int^\infty_{\epsilon \sqrt{n}} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{x^2}{2\sigma^2}\right) dx

&\leq \frac{1}{\epsilon \sqrt{2n\pi \sigma^2}} \int^\infty_{\epsilon \sqrt{n}} x \exp\left(-\frac{x^2}{2\sigma^2}\right) dx \\

&= \sqrt{\frac{\sigma^2}{2\pi n \epsilon^2}} \exp\left(-\frac{n\epsilon^2}{2\sigma^2}\right)\,.

\end{align*}

This quantity is almost always much smaller than what we obtained using Chebyschev’s inequality. In particular, it decays slightly faster than the negative exponential of $n \epsilon^2/\sigma^2$. Thus, the measure defined by the mean $\hat \mu$ rapidly concentrates around its mean. Unfortunately, since the central limit theorem is *asymptotic*, we cannot use it to study the regret when the number of rounds is a fixed finite number. Despite the folk-rule that $n = 30$ is sufficient for the Gaussian approximation brought by CLT to be reasonable, this is simply *not true*. One well known example is provided by Bernoulli variables with parameter $p \approx 1/n$, in which case the distribution of the average is known to be much better approximated by the Poisson distribution with parameter one, which is nowhere near similar to the Gaussian distribution. Furthermore, it is not hard to believe that the asymptotic behavior of an algorithm is a poor predictor of its finite time behavior, hence it is vital to take a closer look at measure concentration and prove a “version” of the CLT that is true even for small values of $n$.

To prove our measure concentration results, it is necessary to make some assumptions. In order to move rapidly towards bandits we start with a straightforward and relatively fundamental assumption on the distribution of $X$, known as the **subgaussian** assumption.

Definition (subgaussianity)A random variable $X$ is $\sigma^2$-subgaussian if for all $\lambda \in \R$ it holds that $\EE{\exp(\lambda X)} \leq \exp\left(\lambda^2 \sigma^2 / 2\right)$

The function $M_X(\lambda) = \EE{\exp(\lambda X)}$ is called the *moment generating function* of $X$ and appears often in probability theory. Note that $M_X(\lambda)$ need not exist for all random variables and the definition of a subgaussian random variable is implicitly assuming this existence.

Lemma: Suppose that $X$ is $\sigma^2$-subgaussian and $X_1$ and $X_2$ are independent and $\sigma^2_1$ and $\sigma^2_2$-subgaussian respectively, then:

- $\E[X] = 0$ and $\VV{X} \leq \sigma^2$.
- $cX$ is $c^2\sigma^2$-subgaussian for all $c \in \R$.
- $X_1 + X_2$ is $(\sigma^2_1 + \sigma^2_2)$-subgaussian.

The proof of the lemma is left as an exercise (hint: Taylor series). Note that in the lack of independence $X_1+X_2$ is $(\sigma_1+\sigma_2)^2$; independence improves this to $\sigma_1^2 + \sigma_2^2$. We are now ready for our key **concentration inequality**.

Theorem: If $X$ is $\sigma^2$-subgaussian, then $\Prob{X \geq \epsilon} \leq \exp\left(-\frac{\epsilon^2}{2\sigma^2}\right)$.

**Proof**

We take a generic approach called Chernoff’s method. Let $\lambda > 0$ be some constant to be tuned later. Then

\begin{align*}

\Prob{X \geq \epsilon}

&= \Prob{\exp\left(\lambda X\right) \geq \exp\left(\lambda \epsilon\right)} \\

\tag{Markov’s inequality} &\leq \EE{\exp\left(\lambda X\right)} \exp\left(-\lambda \epsilon\right) \\

\tag{Def. of subgaussianity} &\leq \exp\left(\frac{\lambda^2 \sigma^2}{2} – \lambda \epsilon\right)\,.

\end{align*}

Now $\lambda$ was any positive constant, and in particular may be chosen to minimize the bound above, which is achieved by $\lambda = \epsilon / \sigma^2$.

QED

Combining the previous theorem and lemma leads to a very straightforward analysis of the tails of $\hat \mu – \mu$ under the assumption that $X_i – \mu$ are $\sigma^2$-subgaussian. Since $X_i$ are assumed to be independent, by the lemma it holds that $\hat \mu – \mu = \sum_{i=1}^n (X_i – \mu) / n$ is $\sigma^2/n$-subgaussian. Then by the theorem leads to a bound on the tails of $\hat \mu$:

Corollary (Hoeffding’s bound): Assume that $X_i-\mu$ are independent, $\sigma^2$-subgaussian random variables. Then, their average $\hat\mu$ satisfies

\begin{align*}

\Prob{\hat \mu \geq \mu + \epsilon} \leq \exp\left(-\frac{n\epsilon^2}{2\sigma^2}\right)

\quad \text{ and } \quad \Prob{\hat \mu \leq \mu -\epsilon} \leq \exp\left(-\frac{n\epsilon^2}{2\sigma^2}\right)\,.

\end{align*}

By the inequality $\exp(-x) \leq 1/(ex)$, which holds for all $x \geq 0$ we can see that except for a very small $\epsilon$ the above inequality is strictly stronger than what we obtained via Chebyshev’s inequality and exponentially smaller (tighter) if $n \epsilon^2$ is large relative to $\sigma^2$.

Before we finally return to bandits, one might be wondering what variables *are* subgaussian? We give two fundamental examples. The first is if $X$ is Gaussian with zero mean and variance $\sigma^2$, then $X$ is $\sigma^2$-subgaussian. The second is bounded zero mean random variables. If $\EE{X}=0$ and $|X| \leq B$ almost surely for some $B \geq 0$, then $X$ is $B^2$-subgaussian. A special case is when $X$ is a shifted Bernoulli with $\Prob{X = 1 – p} = p$ and $\Prob{X = -p} = 1-p$. In this case it also holds that $X$ is $1/4$-subgaussian.

We will be returning to measure concentration many times throughout the course, and note here that it is a very interesting (and still active) topic of research. What we have done and seen is only the very tip of the iceberg, but it is enough for now.

# The explore-then-commit strategy

With the background on concentration out of the way we are ready to present the first strategy of the course. The approach, which is to try each action/arm a fixed number of times (exploring) and subsequently choose the arm that had the largest payoff (exploiting), is possibly the simplest approach there is and it is certainly one of the first that come to mind immediately after learning about the definition of a bandit problem. In what follows we assume the **noise** for all rewards is $1$-subgaussian, by which we mean that $X_t – \E[X_t]$ is $1$-subgaussian for all $t$.

Remark: We assume $1$-subgaussian noise for convenience, in order to avoid endlessly writing $\sigma^2$. All results hold for other values of $\sigma^2$ just by scaling. The important point is that all the algorithms that follow rely on the knowledge of $\sigma^2$. More about this in the notes at the end of the post.

The **explore-then-commit** strategy is characterized by a natural number $m$, which is the number of times each arm will be explored before committing. Thus the algorithm will explore for $mK$ rounds before choosing a single action for the remaining $n – mK$ rounds. We can write this strategy formally as

\begin{align*}

A_t = \begin{cases}

i\,, & \text{if } (t \operatorname{mod} K)+1 = i \text{ and } t \leq mK\,; \\

\argmax_i \hat \mu_i(mK)\,, & t > mK\,,

\end{cases}

\end{align*}

where ties in the argmax will be broken in a fixed arbitrary way and $\hat \mu_i(t)$ is the average pay-off for arm $i$ up to round $t$:

\begin{align*}

\hat \mu_i(t) = \frac{1}{T_i(t)} \sum_{s=1}^t \one{A_s = i} X_s\,.

\end{align*}

where recall that $T_i(t) = \sum_{s=1}^t \one{A_s=i}$ is the number of times action $i$ was chosen up to the end of round $t$. In the ETC strategy, $T_i(t)$ is deterministic for $t \leq mK$ and $\hat \mu_i(t)$ is never used for $t > mK$. However, in other strategies (that also use $\hat\mu_i(t)$) $T_i(t)$ could be genuinely random (i.e., not concentrated on a single deterministic quantity like here). In those cases, the concentration analysis of $\hat \mu_i(t)$ will need to take the randomness of $T_i(t)$ properly into account. In the case of ETC, the concentration analysis of $\hat \mu_i(t)$ is simple: $\hat \mu_i(t)$ is the average of $m$ independent, identically distributed random variables, hence the corollary above applies immediately.

The formal definition of the explore-then-commit strategy leads rather immediately to the analysis of its regret. First, recall from the previous post that the regret can be written as

\begin{align*}

R_n = \sum_{t=1}^n \EE{\Delta_{A_t}}\,.

\end{align*}

Now in the first $mK$ rounds the strategy given above is completely deterministic, choosing each action exactly $m$ times. Subsequently it chooses a single action that gave the largest average payoff during exploring. Therefore, by splitting the sum we have

\begin{align*}

R_n = m \sum_{i=1}^K \Delta_i + (n-mK) \sum_{i=1}^K \Delta_i \Prob{i = \argmax_j \hat \mu_j(mK)}\,,

\end{align*}

where again we assume that ties in the argmax are broken in a fixed specific way. Now we need to bound the probability in the second term above. Assume without loss of generality that the optimal arm is $i = 1$ so that $\mu_1 = \mu^* = \max_i \mu_i$ (though the learner does not know this). Then,

\begin{align*}

\Prob{i = \argmax_j \hat \mu_j(mK)}

&\leq \Prob{\hat \mu_i(mK) – \hat \mu_1(mK) \geq 0} \\

&= \Prob{\hat \mu_i(mK) – \mu_i – \hat \mu_1(mK) + \mu_1 \geq \Delta_i}\,.

\end{align*}

The next step is to check that $\hat \mu_i(mK) – \mu_i – \hat \mu_1(mK) + \mu_1$ is $2/m$-subgaussian, which by the properties of subgaussian random variables follows from the definitions of $\hat \mu_i$ and the algorithm. Therefore, by the theorem in the previous section (also by our observation above) we have

\begin{align*}

\Prob{\hat \mu_i(mK) – \mu_i – \hat \mu_1(mK) + \mu_1 \geq \Delta_i} \leq \exp\left(-\frac{m \Delta_i^2}{4}\right)

\end{align*}

and by straightforward substitution we obtain

\begin{align}

\label{eq:etcregretbound}

R_n \leq m \sum_{i=1}^K \Delta_i + (n – mK) \sum_{i=1}^K \Delta_i \exp\left(-\frac{m\Delta_i^2}{4}\right)\,.

\end{align}

To summarize:

Theorem (Regret of ETC): Assume that the noise of the reward of each arm in a $K$-armed stochastic bandit problem is $1$-subgaussian. Then, after $n\ge mK$ rounds, the expected regret $R_n$ of ETC which explores each arm exactly $m$ times before committing is bounded as shown in $\eqref{eq:etcregretbound}$.

Although we will discuss many disadvantages of this algorithm, the above bound cleanly illustrates the fundamental challenge faced by the learner, which is the trade-off between exploration and exploitation. If $m$ is large, then the strategy explores very often and the first term will be relatively larger. On the other hand, if $m$ is small, then the probability that the algorithm commits to the wrong arm will grow and the second term becomes large. The big question is how to choose $m$? If we limit ourselves to $K = 2$, then $\Delta_1 = 0$ and by using $\Delta = \Delta_2$ the above display simplifies to

\begin{align*}

R_n \leq m \Delta + (n – 2m) \Delta \exp\left(-\frac{m\Delta^2}{4}\right)

\leq m\Delta + n \Delta \exp\left(-\frac{m\Delta^2}{4}\right)\,.

\end{align*}

Provided that $n$ is reasonably large this quantity is minimized (up to a possible rounding error) by

\begin{align*}

m = \ceil{\frac{4}{\Delta^2} \log\left(\frac{n \Delta^2}{4}\right)}

\end{align*}

and for this choice the regret is bounded by

\begin{align}

\label{eq:regret_g}

R_n \leq \Delta + \frac{4}{\Delta} \left(1 + \log\left(\frac{n \Delta^2}{4}\right)\right)\,.

\end{align}

As we will eventually see, this result is not far from being optimal, but there is one big caveat. **The choice of $m$ given above (which defines the strategy) depends on $\Delta$ and $n$.** While sometimes it might be reasonable that the horizon is known in advance, it is practically never the case that the sub-optimality gaps are known. So is there a reasonable way to choose $m$ that does not depend on the unknown gap? This is a useful exercise, but it turns out that one can choose an $m$ that depends on $n$, but not the gap in such a way that the regret satisfies $R_n = O(n^{2/3})$ and that you cannot do better than this using an explore-then-commit strategy. At the price of killing the suspense, there are *other* algorithms that do not even need to know $n$ that can improve this rate significantly to $O(n^{1/2})$.

Setting this issue aside for a moment and returning to the regret guarantee above, we notice that $\Delta$ appears in the denominator of the regret bound, which means that as $\Delta$ becomes very small the regret grows unboundedly. Is that reasonable and why is it happening? By the definition of the regret, when $K = 2$ we have a very trivial bound that follows because $\sum_{i=1}^K \E[T_i(n)] = n$, giving rise to

\begin{align*}

R_n = \sum_{i=1}^2 \Delta_i \E[T_i(n)] = \Delta \E[T_2(n)] \leq n\Delta\,.

\end{align*}

This means that if $\Delta$ is very small, then the regret must also be very small because even choosing the suboptimal arm leads to only minimal pain. The reason the regret guarantee above does not capture this is because we assumed that $n$ was very large when solving for $m$, but for small $n$ it is not possible to choose $m$ large enough that the best arm can actually be identified at all, and at this point the best reasonable bound on the regret is $O(n\Delta)$. To summarize, in order to get a bound on the regret that makes sense we can take the minimum of the bound proven in \eqref{eq:regret_g} and $n\Delta$:

\begin{align*}

R_n \leq \min\left\{n\Delta,\, \Delta + \frac{4}{\Delta}\left(1 + \log\left(\frac{n\Delta^2}{4}\right)\right)\right\}\,.

\end{align*}

We leave it as an exercise to the reader to check that provided $\Delta \leq \sqrt{n}$, then $R_n = O(\sqrt{n})$. Bounds of this nature are usually called **worst-case** or **problem-free** or **problem independent**. The reason is that the regret depends only on the distributional assumption, but not on the means. In contrast, bounds that depend on the sub-optimality gaps are called **problem/distribution dependent**.

# Notes

Note 1: The Berry-Esseen theorem quantifies the speed of convergence in the CLT. It essentially says that the distance between the Gaussian and the actual distribution decays at a rate of $1/\sqrt{n}$ under some mild assumptions. This is known to be tight for the class of probability distributions that appear in the Berry-Esseen result. However, this is a poor result when the tail probabilities themselves are much smaller than $1/\sqrt{n}$. Hence, the need for alternate results.

Note 2: The Explore-then-Commit strategy is also known as the “Explore-then-Exploit” strategy. A similar idea is to “force” a certain amount of exploration. Policies that are based on forced exploration ensure that each arm is explored (chosen) sufficiently often. The exploration may happen upfront like in ETC, or spread out in time. In fact, the limitation of ETC that it needs to know $n$ even to achieve the $O(n^{2/3})$ regret can be addressed by this (e.g., using the so-called “doubling trick”, where the same strategy that assumes a fixed horizon is used on horizons of length that increase at some rate, e.g., they could double). The $\epsilon$-greedy strategy chooses a random action with probability $\epsilon$ in every round, and otherwise chooses the action with the highest mean. Oftentimes, $\epsilon$ is chosen as a function of time to slowly decrease to zero. As it turns out, $\epsilon$-greedy (even with this modification) shares the same the difficulties that ETC faces.

# The Upper Confidence Bound Algorithm

We now describe the celebrated Upper Confidence Bound (UCB) algorithm that overcomes all of the limitations of strategies based on exploration followed by commitment, including the need to know the horizon and sub-optimality gaps. The algorithm has many different forms, depending on the distributional assumptions on the noise.

The algorithm is based on the principle of **optimism in the face of uncertainty**, which is to choose your actions as if the environment (in this case bandit) is as nice as is **plausibly possible**. By this we mean that the unknown mean payoffs of each arm is as large as plausibly possible based on the data that has been observed (unfounded optimism will not work — see the illustration on the right!). The intuitive reason that this works is that when acting optimistically one of two things happens. Either the optimism was justified, in which case the learner is acting optimally, or the optimism was not justified. In the latter case the agent takes some action that they believed might give a large reward when in fact it does not. If this happens sufficiently often, then the learner will learn what is the true payoff of this action and not choose it in the future. The careful reader may notice that this explains why this rule will eventually get things right (it will be “consistent” in some sense), but the argument does not quite explain why an optimistic algorithm should actually be a good algorithm among all consistent ones. However, before getting to this, let us clarify what we mean by **plausible**.

Recall that if $X_1, X_2,\ldots, X_n$ are independent and $1$-subgaussian (which means that $\E[X_i] = 0$) and $\hat \mu = \sum_{t=1}^n X_t / n$, then

\begin{align*}

\Prob{\hat \mu \geq \epsilon} \leq \exp\left(-n\epsilon^2 / 2\right)\,.

\end{align*}

Equating the right-hand side with $\delta$ and solving for $\epsilon$ leads to

\begin{align}

\label{eq:simple-conc}

\Prob{\hat \mu \geq \sqrt{\frac{2}{n} \log\left(\frac{1}{\delta}\right)}} \leq \delta\,.

\end{align}

This analysis immediately suggests a definition of “as large as plausibly possible”. Using the notation of the previous post, we can say that when the learner is deciding what to do in round $t$ it has observed $T_i(t-1)$ samples from arm $i$ and observed rewards with an empirical mean of $\hat \mu_i(t-1)$ for it. Then a good candidate for the largest plausible estimate of the mean for arm $i$ is

\begin{align*}

\hat \mu_i(t-1) + \sqrt{\frac{2}{T_i(t-1)} \log\left(\frac{1}{\delta}\right)}\,.

\end{align*}

Then the algorithm chooses the action $i$ that maximizes the above quantity. If $\delta$ is chosen very small, then the algorithm will be more optimistic and if $\delta$ is large, then the optimism is less certain. We have to be very careful when comparing the above display to \eqref{eq:simple-conc} because in one the number of samples is the constant $n$ and in the other it is a *random variable* $T_i(t-1)$. Nevertheless, this is in some sense a technical issue (that needs to be taken care of properly, of course) and the intuition remains that $\delta$ is approximately an upper bound on the probability of the event that the above quantity is an underestimate of the true mean.

The value of $1-\delta$ is called the *confidence level* and different choices lead to different algorithms, each with their pros and cons, and sometimes different analysis. For now we will choose $1/\delta = f(t)= 1 + t \log^2(t)$, $t=1,2,\dots$. That is, $\delta$ is time-dependent, and is decreasing to zero slightly faster than $1/t$. Readers are not (yet) expected to understand this choice whose pros and cons we will discuss later. In summary, in round $t$ the UCB algorithm will choose arm $A_t$ given by

\begin{align}

A_t = \begin{cases}

\argmax_i \left(\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(t)}{T_i(t-1)}}\right)\,, & \text{if } t > K\,; \\

t\,, & \text{otherwise}\,.

\end{cases}

\label{eq:ucb}

\end{align}

The reason for the cases is that the term inside the square root is undefined if $T_i(t-1) = 0$ (as it is when $t = 1$), so we will simply have the algorithm spend the first $K$ rounds choosing each arm once. The value inside the argmax is called the **index** of arm $i$. Generally speaking, an **index** algorithm chooses the arm in each round that maximizes some value (the index), which usually only depends on current time-step and the samples from that arm. In the case of UCB, the index is the sum of the empirical mean of rewards experienced and the so-called *exploration bonus*, also known as the *confidence width*.

Besides the slightly vague “optimism guarantees optimality or learning” intuition we gave before, it is worth exploring other intuitions for this choice of index. At a very basic level, we should explore arms more often if they are (a) promising (in that $\hat \mu_i(t-1)$ is large) or (b) not well explored ($T_i(t-1)$ is small). As one can plainly see from the definition, the UCB index above exhibits this behaviour. This explanation is unsatisfying because it does not explain why the form of the functions is just so.

An alternative explanation comes from thinking of what we expect from any reasonable algorithm. Suppose in some round we have played some arm (let’s say arm $1$) much more frequently than the others. If we did a good job designing our algorithm we would hope this is the optimal arm. Since we played it so much we can expect that $\hat \mu_1(t-1) \approx \mu_1$. To confirm the hypothesis that arm $1$ is indeed optimal the algorithm better be highly confident about that other arms are indeed worse. This leads very naturally to confidence intervals and the requirement that $T_i(t-1)$ for other arms $i\ne 1$ better be so large that

\begin{align}\label{eq:ucbconstraint}

\hat \mu_i(t-1) + \sqrt{\frac{2}{T_i(t-1)} \log\left(\frac{1}{\delta}\right)} \leq \mu_1\,,

\end{align}

because, at a confidence level of $1-\delta$ this guarantees that $\mu_i$ is smaller than $\mu_1$ and if the above inequality did not hold, the algorithm would not be justified in choosing arm $1$ much more often than arm $i$. Then, planning for $\eqref{eq:ucbconstraint}$ to hold makes it reasonable to follow the UCB rule as this will eventually guarantee that this inequality holds when arm $1$ is indeed optimal and arm $i$ is suboptimal. But how to choose $\delta$? If the confidence interval fails, by which we mean, if actually it turns out that arm $i$ is optimal and by unlucky chance it holds that

\begin{align*}

\hat \mu_i(t-1) + \sqrt{\frac{2}{T_i(t-1)} \log\left(\frac{1}{\delta}\right)} \leq \mu_i\,,

\end{align*}

then arm $i$ can be disregarded even though it is optimal. In this case the algorithm may pay linear regret (in $n$), so it better be the case that the failure occurs with about $1/n$ probability to fix the upper bound on the expected regret to be constant for the case when the confidence interval fails. Approximating $n \approx t$ leads then (after a few technicalities) to the choice of $f(t)$ in the definition of UCB given in \eqref{eq:ucb}. With this much introduction, we state the main result of this post:

Theorem (UCB Regret): The regret of UCB is bounded by

\begin{align} \label{eq:ucbbound}

R_n \leq \sum_{i:\Delta_i > 0} \inf_{\epsilon \in (0, \Delta_i)} \Delta_i\left(1 + \frac{5}{\epsilon^2} + \frac{2}{(\Delta_i – \epsilon)^2} \left( \log f(n) + \sqrt{\pi \log f(n)} + 1\right)\right)\,.

\end{align}

Furthermore,

\begin{align} \label{eq:asucbbound}

\displaystyle \limsup_{n\to\infty} R_n / \log(n) \leq \sum_{i:\Delta_i > 0} \frac{2}{\Delta_i}\,.

\end{align}

Note that in the first display, $\log f(n) \approx \log(n) + 2\log\log(n)$. We thus see that this bound scales logarithmically with the length of the horizon and is able to essentially reproduce the bound that we obtained for the unfeasible version of ETC with $K=2$ (when we tuned the exploration time based on the knowledge of $\Delta_2$). We shall discuss further properties of this bound later, but now let us present a simpler version of the above bound, avoiding all these epsilons and infimums that make for a confusing theorem statement. By choosing $\epsilon = \Delta_i/2$ inside the sum leads to the following corollary:

Corollary (UCB Simplified Regret): The regret of UCB is bounded by

\begin{align*}

R_n \leq \sum_{i:\Delta_i > 0} \left(\Delta_i + \frac{1}{\Delta_i}\left(8 \log f(n) + 8\sqrt{\pi \log f(n)} + 28\right)\right)\,.

\end{align*}

and in particular there exists some universal constant $C>0$ such that for all $n\ge 2$, $R_n \le \sum_{i:\Delta_i>0} \left(\Delta_i + \frac{C \log n}{\Delta_i}\right)$.

Note that taking the limit of the ratio of the bound above and $\log(n)$ does not result in the same rate as in the theorem, which is the main justification for introducing the epsilons in the first place. In fact, as we shall see the asymptotic bound on the regret given in \eqref{eq:asucbbound}, which is derived from~\eqref{eq:ucbbound} by choosing $\epsilon = \log^{-1/4}(n)$, is **unimprovable** in a strong sense.

The proof of the theorem relies on the basic regret decomposition identity that expresses the expected regret as the weighted sum of the expected number of times the suboptimal actions are chosen. So why will $\EE{T_i(n)}$ be small for a suboptimal action $i$? This is based on a couple of simple observations: First, (disregarding the initial period when all arms are chosen once) the suboptimal action $i$ can only be chosen if its UCB index is higher than that of an optimal arm. Now, this can only happen if the UCB index of action $i$ is “too high”, i.e., higher than $\mu^*-\epsilon>\mu_i$ **or** the UCB index of that optimal arm is “too low”, i.e., if it is below $\mu^*-\epsilon<\mu^*$. Since the UCB index of any arm is with reasonably high probability an upper bound on the arm’s mean, we don’t expect the index of any arm to be below its mean. Hence, the total number of times when the optimal arm’s index is “too low” (as defined above) is expected to be negligibly small. Furthermore, if the sub-optimal arm $i$ is played sufficiently often, then its exploration bonus becomes small and simultaneously the empirical estimate of its mean converges to the true value, making the expected total number of times when its index stays above $\mu^*-\epsilon$ small.

We start with a useful lemma that will help us quantify the *last* argument.

LemmaLet $X_1,X_2,\ldots$ be a sequence of independent $1$-subgaussian random variables, $\hat \mu_t = \sum_{s=1}^t X_s / t$, $\epsilon > 0$ and

\begin{align*}

\kappa = \sum_{t=1}^n \one{\hat \mu_t + \sqrt{\frac{2a}{t}} \geq \epsilon}\,.

\end{align*}

Then, $\displaystyle \E[\kappa] \leq 1 + \frac{2}{\epsilon^2} (a + \sqrt{\pi a} + 1)$.

Because the $X_i$ are $1$-subgaussian and independent we have $\E[\hat \mu_t] = 0$, so we cannot expect $\hat \mu_t + \sqrt{2a/t}$ to be smaller than $\epsilon$ until $t$ is at least $2a/\epsilon^2$. The lemma confirms that this is indeed of the right order as an estimate for $\EE{\kappa}$.

**Proof **

Let $u = 2a \epsilon^{-2}$. Then, by the concentration theorem for subgaussian variables,

\begin{align*}

\E[\kappa]

&\leq u + \sum_{t=\ceil{u}}^n \Prob{\hat \mu_t + \sqrt{\frac{2a}{t}} \geq \epsilon} \\

&\leq u + \sum_{t=\ceil{u}}^n \exp\left(-\frac{t\left(\epsilon – \sqrt{\frac{2a}{t}}\right)^2}{2}\right) \\

&\leq 1 + u + \int^\infty_u \exp\left(-\frac{t\left(\epsilon – \sqrt{\frac{2a}{t}}\right)^2}{2}\right) dt \\

&= 1 + \frac{2}{\epsilon^2}(a + \sqrt{\pi a} + 1)\,.

\end{align*}

QED

Before the proof of the UCB regret theorem we need a brief diversion back to the bandit model. We have defined $\hat \mu_i(t)$ as the empirical mean of the $i$th arm after the $t$th round, which served us well enough for the analysis of the explore-then-commit strategy where the actions were chosen following a deterministic rule. For UCB it is very useful also to have $\hat \mu_{i,s}$, the empirical average of the $i$th arm *after $s$ observations from that arm*, which occurs at a random time (or maybe not at all). To define $\hat \mu_{i,s}$ rigorously, we argue that without the loss of generality one may assume that the reward $X_t$ received in round $t$ comes from choosing the $T_i(t)$th element from the reward sequence $(Z_{i,s})_{1\le s \le n}$ associated with arm $i$, where $(Z_{i,s})_s$ is an i.i.d. sequence with $Z_{i,s}\sim P_i$. Formally,

\begin{align}\label{eq:rewardindepmodel}

X_t = Z_{A_t,T_{A_t}(t)}\,.

\end{align}

The advantage of introducing $(Z_{i,s})_s$ is that it allows a clean definition (without $Z_{i,s}$, how does one even define $\hat \mu_{i,s}$ if $T_i(n) \leq s$?). In particular, we let

\begin{align*}

\hat \mu_{i,s} &= \frac{1}{s} \sum_{u=1}^s Z_{i,u}\,.

\end{align*}

Note that $\hat \mu_{i,s} = \hat \mu_i(t)$ when $T_i(t)=s$ (formally: $\hat \mu_{i,T_i(t)} = \hat \mu_i(t)$).

**Proof of Theorem**

As in the analysis of the explore-then-commit strategy we start by writing the regret decomposition.

\begin{align*}

R_n = \sum_{i:\Delta_i > 0} \Delta_i \E[T_i(n)]\,.

\end{align*}

The rest of the proof revolves around bounding $\E[T_i(n)]$. Let $i$ be some sub-optimal arm (so that $\Delta_i > 0$). Following the suggested intuition we decompose $T_i(n)$ into two terms. The first measures the number of times the index of the optimal arm is less than $\mu_1 – \epsilon$. The second term measures the number of times that $A_t = i$ and its index is larger than $\mu_1 – \epsilon$.

\begin{align}

T_i(n)

&= \sum_{t=1}^n \one{A_t = i} \nonumber \\

&\leq \sum_{t=1}^n \one{\hat \mu_1(t-1) + \sqrt{\frac{2\log f(t)}{T_1(t-1)}} \leq \mu_1 – \epsilon} + \nonumber \\

&\qquad \sum_{t=1}^n \one{\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(t)}{T_i(t-1)}} \geq \mu_1 – \epsilon \text{ and } A_t = i}\,. \label{eq:ucb1}

\end{align}

The proof of the first part of the theorem is completed by bounding the expectation of each of these two sums. Starting with the first, we again use the concentration guarantee.

\begin{align*}

\EE{\sum_{t=1}^n \one{\hat \mu_1(t-1) + \sqrt{\frac{2 \log f(t)}{T_1(t-1)}} \leq \mu_1 – \epsilon}}

&= \sum_{t=1}^n \Prob{\hat \mu_1(t-1) + \sqrt{\frac{2 \log f(t)}{T_1(t-1)}} \leq \mu_1 – \epsilon} \\

&\leq \sum_{t=1}^n \sum_{s=1}^n \Prob{\hat \mu_{1,s} + \sqrt{\frac{2 \log f(t)}{s}} \leq \mu_1 – \epsilon} \\

&\leq \sum_{t=1}^n \sum_{s=1}^n \exp\left(-\frac{s\left(\sqrt{\frac{2 \log f(t)}{s}} + \epsilon\right)^2}{2}\right) \\

&\leq \sum_{t=1}^n \frac{1}{f(t)} \sum_{s=1}^n \exp\left(-\frac{s\epsilon^2}{2}\right) \\

&\leq \frac{5}{\epsilon^2}\,.

\end{align*}

The first inequality follows from the union bound over all possible values of $T_1(t-1)$. This is an important point. The concentration guarantee cannot be applied directly because $T_1(t-1)$ is a random variable and not a constant. The last inequality is an algebraic exercise. The function $f(t)$ was chosen precisely so this bound would hold. If $f(t) = t$ instead, then the sum would diverge. Since $f(n)$ appears in the numerator below we would like $f$ to be large enough that its reciprocal is summable and otherwise as small as possible. For the second term in \eqref{eq:ucb1} we use the previous lemma.

\begin{align*}

&\EE{\sum_{t=1}^n \one{\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(t)}{T_i(t-1)}} \geq \mu_1 – \epsilon \text{ and } A_t = i}} \\

&\qquad\leq \EE{\sum_{t=1}^n \one{\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(n)}{T_i(t-1)}} \geq \mu_1 – \epsilon \text{ and } A_t = i}} \\

&\qquad\leq \EE{\sum_{s=1}^n \one{\hat \mu_{i,s} + \sqrt{\frac{2 \log f(n)}{s}} \geq \mu_1 – \epsilon}} \\

&\qquad= \EE{\sum_{s=1}^n \one{\hat \mu_{i,s} – \mu_i + \sqrt{\frac{2 \log f(n)}{s}} \geq \Delta_i – \epsilon}} \\

&\qquad\leq 1 + \frac{2}{(\Delta_i – \epsilon)^2} \left(\log f(n) + \sqrt{\pi \log f(n)} + 1\right)\,.

\end{align*}

The first part of the theorem follows by substituting the results of the previous two displays into \eqref{eq:ucb1}. The second part follows by choosing $\epsilon = \log^{-1/4}(n)$ and taking the limit as $n$ tends to infinity.

QED

Next week we will see that UCB is close to optimal in several ways. As with the explore-then-commit strategy, the bound given in the previous theorem is not meaningful when the gaps $\Delta_i$ are small. Like that algorithm it is possible to prove a *distribution-free* bound for UCB by treating the arms $i$ with small $\Delta_i$ differently. Fix $\Delta>0$ to be chosen later. Then, from the proof of the bound on the regret of UCB we can derive that $\EE{T_i(n)} \le \frac{C \log(n)}{\Delta_i^2}$ holds for all $n\ge 2$ with some universal constant $C>0$. Hence, the regret can be bounded without dependence on the sub-optimality gaps by

\begin{align*}

R_n

&= \sum_{i:\Delta_i > 0} \Delta_i \E[T_i(n)]

= \sum_{i:\Delta_i < \Delta} \Delta_i \E[T_i(n)] + \sum_{i:\Delta_i \geq \Delta} \Delta_i \E[T_i(n)] \\

&< n \Delta + \sum_{i:\Delta_i \geq \Delta} \Delta_i \E[T_i(n)]

\leq n \Delta + \sum_{i:\Delta_i \geq \Delta} \frac{C \log n}{\Delta_i} \\

&\leq n \Delta + K\frac{C \log n}{\Delta}

= \sqrt{C K n \log(n)}\,,

\end{align*}

where in the last step we chose $\Delta = \sqrt{K C \log(n) / n}$, which optimizes the upper bound.

There are many directions to improve or generalize this result. For example, if more is known about the noise model besides that it is subgaussian, then this can often be exploited to improve the regret. The main example is the Bernoulli case, where one should make use of the fact that the variance is small when the mean is close to zero or one. Another direction is improving the worst-case regret to match the lower bound of $\Omega(\sqrt{Kn})$ that we will see next week. This requires a modification of the confidence level and a more complicated analysis.

# Notes

Note 1: Here we argue that there is no loss in generality in assuming that the rewards experienced satisfy $\eqref{eq:rewardindepmodel}$. Indeed, let $T’ = (A’_1,X’_1,\dots,A’_n,X’_n)$ be any sequence of random variables satisfying that $A_t’ = f_t(A’_1,X’_1,\dots,A’_{t-1},X’_{t-1})$ and that for any $U\subset \R$ open interval

\begin{align*}

\Prob{X_t’\in U\,|\,A’_1,X’_1,\dots,A’_{t-1},X’_{t-1},A’_t} = P_{A’_t}(U)\,,

\end{align*}

where $1\le t\le n$. Then, choosing $(Z_{i,s})_s$ as described in the paragraph before $\eqref{eq:rewardindepmodel}$, we let $T=(A_1,X_1,\dots,A_n,X_n)$ be such that $A_t = f_t(A_1,X_1,\dots,A_{t-1},X_{t-1})$ and $X_t$ be so that it satisfies $\eqref{eq:rewardindepmodel}$. It is not hard to see then that the distributions of $T$ and $T’$ agree. Hence, there is indeed no loss of generality by assuming that the rewards are indeed generated by $\eqref{eq:rewardindepmodel}$.

Note 2: The view that $n$ rewards are generated ahead of time for each arm and the algorithm consumes these rewards as it chooses an action was helpful in the proof as it reduced the argument to the study of averages of independent random variables. The analysis could also have been done directly without relying on the “virtual” rewards $(Z_{i,s})_s$ with the help of martingales, which we will meet later.

A third model of how $X_t$ is generated could have been that $X_t = Z_{A_t,t}$. We will meet this “skipping model” later when studying adversarial bandits. For the stochastic bandit models we study here, all these models coincide (they are indistinguishable in the sense described in the first note above).

Note 3: So is the optimism principle universal? Does it always give good algorithms, even in more complicated settings? Unfortunately, the answer is no. The optimism principle leads to reasonable algorithms when using an action gives feedback that informs the learner about how much the action is worth. If this is not true (i.e., in models where you have to choose action $B$ to learn about the rewards of action $A$, and choosing action $A$ would not give you information about the reward of action $A$), the principle fails! (Why?) Furthermore, even if all actions give information about their own value, the optimistic principle may give rise to algorithms whose regret is overly large compared to what could be achieved with more clever algorithms. Thus, in a way, finite-armed stochastic bandits is a perfect fit for optimistic algorithms. While the more complex feedback models may not make much sense at the moment, we will talk about them later.

# References

The idea of using upper confidence bounds appeared in ’85 in the landmark paper of Lai and Robbins. In this paper they introduced a strategy which plays the leader of the “often sampled” actions except that for any action $j$ in every $K$th round the strategy is checking whether the UCB index of arm $j$ is higher than the estimated reward of the leader. They proved that this strategy, when appropriately tuned, is asymptotically unimprovable the same way UCB as we defined it is asymptotically unimprovable (we still owe the definition of this and a proof, which will come soon). The cleaner UCB idea must have been ready to be found in ’95 because Agrawal and Katehakis & Robbins discovered this idea independently in that year. Auer et al. later modified the strategy slightly and proved a finite-time analysis.

- Tzu L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules, 1985
- Rajeev Agrawal. Sample mean based index policies with $O(\log n)$ regret for the multi-armed bandit problem, 1995
- Michael N Katehakis and Herbert Robbins. Sequential choice from several populations, 1995
- Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem, 2002

# Optimality concepts and information theory

In this post we introduce the concept of minimax regret and reformulate our previous result on the upper bound on the worst-case regret of UCB in terms of the minimax regret. We briefly discuss the strengths and weaknesses of using minimax regret as a performance yardstick. Then we discuss with more precision what we meant when we mentioned that UCB’s worst-case regret cannot be greatly improved, and sketch the proof (that will be included in all its glory in the next post). We finish by introducing a few of the core concepts of information theory, and especially the relative entropy that will be crucial for the formal lower bound proofs.

# How to measure regret?

Previously we have seen that in the worst case the UCB algorithm suffers a regret of at most $O(\sqrt{Kn \log(n)})$. It will prove useful to restate this result here. The following notation will be helpful: Let $\cE_K$ stand for the class of $K$-action stochastic environments where the rewards for action $i\in [K]$ are generated from a distribution $P_i$ with mean $\mu_i\in [0,1]$ where $P_i$ is such that if $X\sim P_i$ then $X-\mu_i$ is $1$-subgaussian (environments with this property will be called $1$-subgaussian). Our previous result then reads as follows:

Theorem (Worst-case regret-bound for UCB): There exists a constant $C>0$ such that the following hold: For all $K>0$, $E\in \cE_K$, $n>1$ if $R_n$ is the (expected) regret of UCB when interacting with $E$,

\begin{align}

\label{eq:UCBbound}

R_n \le C \sqrt{K n \log(n)}\,.

\end{align}

(In the future when we formulate statements like this, we will say that $C$ is a *universal constant*. The constant is universal in the sense that its value does not depend on the particulars of how $K$, $n$, or $E$ as long as these satisfy the constraints mentioned.)

We call the bound on the right hand-side of the above display a **worst-case bound**, because it holds even for the environment that is the worst for UCB. Worst-case bounds can point to certain weaknesses of an algorithm. But what did we mean when we said that this bound cannot be improved in any significant manner? In simple terms, this means that no matter the policy, there will be an environment on which the policy achieves almost the same regret as the upper bound on the above side.

This can be concisely written in terms of the **minimax regret**. The minimax regret is defined for a given horizon $n>0$ and a fixed number of actions $K>0$ as follows. Let $\cA_{n,K}$ stand for the set of all possible policies on horizon $n$ and action set $[K]$. Then, the minimax regret for horizon $n$ and environment class $\cE$ (such as $\cE_K$ above) is defined as

\begin{align*}

R_n^*(\cE) = \inf_{A\in \cA_{n,K}} \sup_{E\in \cE} R_n(A,E)\,.

\end{align*}

Note how this value is independent of any specific choice of a policy, but only depends on $n$, $\cE$ and $K$ (the dependence on $K$ is hidden in $\cE$). Further, no matter how a policy $A$ is chosen, $R_n^*(\cE)\le R_n^*(A,\cE) \doteq \sup_{E\in \cE} R_n(A,E)$, i.e., the $n$-round minimax regret of $\cE$ is a lower bound on the worst-case regret of any policy $A$. If we lower bound the minimax regret, the lower bound tells us that *no policy at all* exist that can do better than the lower bound. In other words, such a bound is independent of the policy chosen, it shows the fundamental hardness of the problem.

A minimax optimal policy is a policy $A$ whose worst-case regret is equal to the minimax regret: $R_n^*(A,\cE) = R_n^*(\cE)$. Finding a minimax optimal policy is often exceptionally hard; hence we often resort to the weaker goal of finding a **near-minimax** policy, i.e., a policy whose worst-case regret is at most a constant multiple of $R_n^*(\cE)$. Technically, this makes sense only if this holds universally over $n$, $K$ and $\cE$ (within some limits): To define this formally, we would need to demand that the policy takes as input $n,K$, but we skip the formal definition, hoping that the idea is clear. UCB, as we specified earlier, can be thought as such a “universal” policy for the class of all $1$-subgaussian environments (and it does not even use the knowledge of $n$). When near-minimaxity is too strong as defined above, we may relax it by demanding that the ratio of $R_n^*(A,\cE)/R_n^*(\cE)$ is bounded by a logarithmic function of $\max(1,R_n^*(\cE))$; the idea being that the logarithm of any number is so substantially smaller than itself that it can be neglected. UCB is near-minimax optimal in this sense.

The value $R_n^*(\cE)$ is of interest on its own. In particular, a small value of $R_n^*(\cE)$ indicates that the underlying bandit problem is less challenging in the worst-case sense. A core activity in bandit theory (and learning theory, more generally) is to understand what makes $R_n^*(\cE)$ large or small, often focusing on its behavior as a function of the number of rounds $n$. This focus is justified when $n$ gets large, but often other quantities, such as the number of actions $K$, are also important.

Since $R_n^*(\cE_K) \le R_n(\mathrm{UCB},\cE_K)$, the above theorem immediately gives that

\begin{align}

\label{eq:minimaxUCBupper}

R_n^*(\cE_K) \le C \sqrt{ K n \log(n)}\,,

\end{align}

while our earlier discussion of the explore-then-commit strategies $\mathrm{ETC}_{n,m}$ with a fixed commitment time $m$ can be concisely stated as

\begin{align}

\inf_{m} R_n^*(\mathrm{ETC}_{n,m}, \cE_K) \asymp n^{2/3}\,,

\end{align}

where $\asymp$ hides a constant proportionality factor which is independent of $n$ (the factor scales with $K^{1/3}$).

Thus, we see that UCB does better than ETC with the best a priori $n$-dependent choice of $m$ in a worst-case sense, as $n$ gets large. But perhaps there are are even better policies than UCB? The answer was already given above, but let us state it this time in the form of a theorem:

Theorem (Worst-case regret lower-bound): For any $K\ge 2$, $n\ge 2$,

\begin{align}\label{eq:Karmedminimaxlowerbound}

R_n^*(\cE_K) \ge c \sqrt{Kn}

\end{align}

for some universal constant $c>0$.

In particular, we see that UCB is near-minimax optimal. But how to prove that the above result? The intuition is relatively simple and can be understood by just studying Gaussian tails.

# Lower bounding ideas

Given $n$ i.i.d. observations from a Gaussian distribution with a known variance of say one and an unknown mean $\mu$, the sample mean of the observations is a sufficient statistic for the mean. This means that the observations can be replaced by the sample mean $\hat\mu$ without losing any information. The distribution of the sample is also Gaussian, with the same mean as the unknown mean and a variance of $1/n$. Now, assume that we told that the unknown mean can take on only two values: Either it is (say) zero, or $\Delta$, which, without loss of generality, we assume to be positive. The task is to decide whether we are in the first (when $\mu=0$), or the second case (when $\mu=\Delta$). As we said before, there is no loss of generality assuming that the decision is based on $\hat\mu$ only. If $n$ is large compared to $\Delta$, intuitively we feel that the decision is easy: If $\hat\mu$ is closer to zero than to $\Delta$, choose zero, otherwise choose $\Delta$. No matter then whether $\mu$ was indeed zero or $\Delta$, the probability of an incorrect choice will be low. How low? An easy upper bound based on our earlier upper bound on the tails of a Gaussian (see here) is

\begin{align*}

(2\pi n (\Delta/2)^2)^{-1/2} \exp(-n (\Delta/2)^2/2) \le \exp(-\frac{n(\Delta/2)^2}{2})\,,

\end{align*}

where the inequality assumed that $2\pi n (\Delta/2)^2\ge 1$). We see that the error probability decays exponentially with $n$, but the upper would still be useless if $\Delta$ was small compared to $n$! Can we do better than this? One might believe the decision procedure could be improved, but the symmetry of the problem makes this seem improbable.

The other possibility is that the upper bound on the Gaussian tails that we use is loose. This turns out not to be the case either. With a calculation slightly more complex than the one given before (using integration by parts, e.g., see here), we can show that if $X$ has the standard normal distribution,

\begin{align*}

\Prob{X>\eps} \ge \left(\frac{1}{\eps}-\frac{1}{\eps^3}\right)\, \frac{\exp( – \frac{\eps^2}{2} )}{\sqrt{2\pi}}

\end{align*}

also holds, showing that there is basically no room for improvement. In particular, this means that if $n(\Delta/2)^2/2=c$ ($\Delta \le \sqrt{8c/n}$, or, equivalently, $n\le 8c/\Delta^2$) then the probability that *any*procedure will make a mistake in either of the two cases when $\mu \in \{0,\Delta\}$ is at least $C c^{-1/2} (1-1/c)\exp(-c)>0$. This puts a lower limit on how reliably we can decide between the two given alternatives. The take home message is that if $n$ is small compared to the differences in the means that we need to *test* for, then we are facing an impossible mission.

The problem as described is the most classic hypothesis testing problem there is. The ideas underlying this argument are core to any arguments that show an *impossibility* result in a statistical problem. The next task is to reduce our bandit problem to a hypothesis testing problem as described here.

The high level idea is to select two bandit problem instances. For simplicity, we select two bandit instances $E,E’$ where the reward distributions are Gaussians with a unit variance. The vector of means are in $[0,1]^K$, thus the instances are in $\cE_K$. Let the means of the two instances be $\mu$ and $\mu’$, respectively. Our goal is to choose $E,E’$ (alternatively, $\mu$ and $\mu’$) in such a way that the following two conditions hold simultaneously:

- Competition: Algorithms doing well on one instance, cannot do well on the other instance.
- Similarity: The instances are “close” so that no matter what policy interacts with them, given the observations, the two instances are indistinguishable as in the hypothesis testing example above.

In particular, for the second requirement we want that any way of interacting with the instances results in data that no decision procedure that needs to decide about which instance it was running on can achieve a low error probability on both instances. The two requirements are clearly conflicting. The first requirement makes us want to choose $\mu$ and $\mu’$ far from each other, while the second requirement makes us want to choose them to be close to each other. A lower bound on the regret follows then a simple reasoning: We lower bound the regret in terms of how well it does on a hypothesis testing problem. The conflict between the two objectives is resolved by choosing the means so that the lower bound is maximized.

This strategy as described works in simple cases, but for our case a slight amendment is necessary. The twist we add is this: We choose $\mu$ in a specific way so that one arm has a slightly higher mean (by a constant $\Delta>0$ to be chosen later) than all the other arms, call this $i_E$. Then we find $i_{E’}\ne i_E$ that was least favored in expectation by the algorithm $A$ and increase the mean payoff of this arm by $2\Delta$ to crease environment $E’$. In particular, all other means are the same in $E$ and $E’$, but $i_{E’}$ is the optimal action in $E’$ and $i_E$ is the optimal action in $E$. Further, the absolute gap between the immediate rewards of these actions is $\Delta$ in both environments.

Note that $i_{E’}$ cannot be used more than $\approx n/K$ times on expectation when $A$ is run on $E$. And the two environments differ only in terms of the mean payoff of exactly of action $i_{E’}$. To make the two algorithms essentially indistinguishable, set $\Delta = 1/\sqrt{n/K}$. This will also mean that when $A$ is run on $E’$, since $\Delta$ is so small, the interaction of $A$ and $E’$ will produce a near-identically distributed sequence of action-choices than the distribution when $A$ was used on $E$.

Now, when $A$ is run on $E$ and $i_E$ is chosen fewer than $n(1-1/K)$ times on expectation then it will have more than $\Delta (n-n(1-1/K)) \approx c\sqrt{Kn}$ regret (here, $c$ is a universal constant “whose value can change line by line”). How about algorithms that choose $i_E$ more than $n(1-1/K)$ times when interacting with $E$? By the indistinguishability of $E$ and $E’$ by $A$, this algorithm will choose $i_E$ almost the same number of times on $E’$, too, which means that it cannot choose $i_{E’}$ too often, and in particular, it will have a regret of at least $c\sqrt{Kn}$ on $E’$. Thus, we see that no algorithm will be able to do well uniformly on all instances.

The worst-case instance in this construction clearly depends on $A$, $n$ and $K$. Thus, the analysis has little relevance, for example, if an algorithm $A$ is run on a fixed instance $E$ and we consider the regret on this instance in the long run. We will return to this question later.

# Information theory

In order to make the above argument rigorous (and easily generalizable to multiple other settings) we will rely on some classic tools from information theory and statistics. In particular, we will need the concept of **relative entropy**, also known as the Kullback-Leibler divergence, named of Solomon Kullback and Richard Leibler (**KL divergence**, for short).

Relative entropy puts a numerical value on how surprised one should be on average when observing a random quantity $X$ whose distribution is $P$ when one expected that $X$’s distribution will be $Q$. This can also be said to be the extra information we gain when $X$ is observed, again relative to the expectation that $X$’s distribution would have been $Q$. As such, relative entropy can be used to answer question like how many observations we need to notice that the distribution of our observations follow $P\ne Q$, hence its relevance to our problems! To explain these ideas, the best is to start with the case when $X$ is finite-valued.

Assume thus that $X$ takes on finitely many values. Let us first discuss how to define the **amount of information that observing $X$ conveys**. One way to start is to define information as the amount of communication needed if we want to “tell” a friend about the value we observed. If $X$ takes on $N$ distinct values, say, $X\in [N]$ (we can assume this for simplicity and without loss of generality), we may be thinking of using $\lceil\log_2(N)\rceil$ bits no matter the value of $X$. However, if we have the luxury of knowing the probabilities $p_i = \Prob{X=i}$, $i=1,\dots,N$ *and* we also have the luxury to agree with our friend on a specific coding to be used (before even observing any values) then we can do much better.

The idea is to **use a code that respects the probability distribution of $X$**. But how many bits should the code of $i$ use? Intuitively, the smaller $p_i$ is, the larger the length of the code of $i$ should be. The coding also has to be such that our friend can tell when the code of a symbol ended or it will be useless (this is impossible if and only if there exists $i\ne j$ such that the respective codes $c_i,c_j$ are such that $c_i$ is a prefix of $c_j$). It turns out that the optimal bit-length for observation $i$ under the previous conditions is $\log_2 1/p_i$ and is achieved by Huffman-coding. This is meant in a limiting sense, when we observe an $X_1,X_2,\dots$ which are independent copies of $X$. Thus, the **optimal average code length** for this setting is

\begin{align}\label{eq:entropy}

H(p) = \sum_i p_i \log \frac{1}{p_i}\,.

\end{align}

The astute reader may have noticed that we switched from $\log_2$ to $\log$ here. Effectively, this means that we changed the units in which we measure the length of codes from bits to something else, which happens to be called *nats*. Of course, changing units does not change the meaning of the definition, just the scale, hence the change should be of no concern. The switch is justified because it simplifies some formulae later.

But is $H$ well-defined at all? (This is a question that should be asked after every definition!) What happens if $p_i=0$ for some $i$? This comes up, for example, when we take $N$ to be larger than the number of possible values that $X$ really takes on, i.e., if we introduce “impossible” values. We expect that introducing such superfluous impossible values should not change the value of $H$. Indeed, it is not hard to see that $\lim_{x\to 0+} x\log(1/x)=0$, suggesting that we should define $p \log(1/p)=0$ when $p=0$ and this is indeed what we will do, which also means that we can introduce as many superfluous symbols as we want without every changing the amount $H$.

Thus, we can agree that for $X\in [N]$, $H(p)$ with $p_i = \Prob{X=i}$ measures the (expected) information content of observing $X$ (actually, in a repeated setting, as described above). Another interpretation of $H$ takes a more global view. According to this view, $H$ measures the **amount of uncertainty** in $p$ (or, equivalently but redundantly, in $X$). But what do we mean by uncertainty? One approach to define uncertainty is to think of how much one should be surprised to see a particular value of $X$. If $X$ is completely deterministic ($p_i=1$ for some $i$) then there is absolutely no surprise in observing $X$, thus the uncertainty measure should be zero and this is the smallest uncertainty there is. If $X$ is uniformly distributed, then we should be equally surprised by seeing any values of $i$ and the uncertainty should be maximal. If we observe a pair, $(X,Y)$, which are independent of each other, the uncertainty of the pair should be the sum of the uncertainties of $X$ and $Y$. Long story short, it turns out that reasonable definitions of uncertainty actually give rise to $H$ as defined by $\eqref{eq:entropy}$. Since in physics, **entropy** is the expression used to express the uncertainty of a system, the quantity $H$ is also called the entropy of $p$ (or the entropy of $X$ giving rise to $p$).

Of course, optimal code lengths can also be connected to uncertainty directly: Think of recording the independent observations $X_1,X_2,\dots,X_n$ for $n$ large where $X_i \sim X$ in some notebook. The more uncertainty the distribution of $X$ has, the longer the transcript of $X_1,\dots,X_n$ will be.

We can summarize our discussion so far as follows:

Information = Optimal code length per symbol = Uncertainty

Now, let us return to the definition of relative entropy: As described earlier, relative entropy puts a value on how surprised one should be on average when observing a random quantity $X$ whose distribution is $P$ (equivalently, $p=(p_i)_i$, where $p_i$ is the probability of seeing $X=i$) when one expected that $X$’s distribution will be $Q$ (equivalently, $q=(q_i)$, where $q_i$ is the probability of seeing $X=i$). We also said that relative entropy is the information contained in observing $X$ when $X$ actually comes from $P$ (equivalently, $p$) relative to one’s a priori expecting that $X$ will follow distribution $Q$ (equivalently, $q$).

Let us discuss the second approach: The situation described in more details is this. Information is how much we need to communicate with our friend with whom we agreed on a coding protocol. That we expect $X\sim q$ means that before communication even starts, we agree on using a protocol respecting this knowledge. But then if $X\sim p$ then when we send our friend the messages about our observations, we will use more bits than we could have used had we chosen the protocol that is the best for $p$. More bits means more information? Well, yes and no. More bits means more information *relative* to one’s expectation that $X\sim q$! However, of course, more bits is not more information relative to knowing the true distribution of course. The code we use has some extra redundancy, in other words it has some excess, which we can say measures the extra information that $X$ contains, *relative* to our expectation on the distribution of $X$. Working out the math gives that the excess information is

\begin{align}\label{eq:discreterelentropy}

\KL(p,q) &\doteq \sum_i p_i \log \frac{1}{q_i} \, – \,\sum_i p_i \log \frac{1}{p_i} \\

&=\sum_i p_i \log \frac{p_i}{q_i}\,.

\end{align}

If this quantity is larger, we also expect to be able to tell that $p\ne q$ with fewer independent observations sharing $p$ than if this quantity was smaller. For example, when $p_i>0$ and $q_i=0$ for some $i$, the first time we see $i$, we can tell that $p\ne q$. In fact, in this case, $\KL(p,q)=+\infty$: With positive probability a single observation can tell the difference between $p$ and $q$.

Still poking around the definition, what happens when $q_i=0$ and $p_i=0$? This means that the symbol $i$ is superfluous and the value of $\KL(p,q)$ should not be impacted by introducing superfluous symbols. By our earlier convention that $x \log(1/x)=0$ when $x=0$, this works out fine based on the definition after $\doteq$. We also see that the sufficient and necessary condition for $\KL(p,q)<+\infty$ is that for each $i$ such that $q_i=0$, we also have that $p_i=0$. The condition we discovered is also expressed as saying that $p$ is absolutely continuous w.r.t. $q$, which is also written as $p\ll q$.
More generally, for two measures $P,Q$ on a common measurable space $(\Omega,\cF)$, we say that $P$ is **absolutely continuous** with respect to $Q$ (and write $P\ll Q$) if for any $A\in \cF$, $Q(A)=0$ implies that $P(A)=0$ (intuitively, $\ll$ is like $\le$ except that it only constraints the values when the right-hand side is also zero). This brings us back to defining relative entropy between two arbitrary probability distributions $P,Q$ defined over a common probability space. The difficulty we face is that if $X\sim P$ takes on uncountably infinitely many values then we cannot really use the ideas that use communication because no matter what coding we use, we would need infinitely many symbols to describe some values of $X$. How can then be the entropy of $X$ be defined at all? This seems to be a truly fundamental difficulty! Luckily, this impasse gets resolved automatically if we only consider relative entropy. While we cannot communicate $X$, for any *finite* “discretization” of the the possible values that $X$ can take on, the discretized values can be communicated finitely and all our definitions will work. Formally, if $X$ takes values in the measurable space $(\cX,\cG)$, with $\cX$ possibly having uncountably many elements, a discretization to $[N]$ levels would be specified using some function $f:\cX \to [N]$ that is $\cF/2^{[N]}$-measurable map. Then, the entropy of $P$ relative $Q$, $\KL(P,Q)$ can be defined as

\begin{align*}

\KL(P,Q) \doteq \sup_{f} \KL( P_f, Q_f )\,,

\end{align*}

where $P_f$ is the distribution of $Y=f(X)$ when $X\sim P$ and $Q_f$ is the distribution of $Y=f(X)$ when $X\sim Q$ and the supremum is for all $N\in \N$ and all maps $f$ as defined above. In words, we take all possible discretizations $f$ (with no limit on the “fineness” of the discretization) and define $\KL(P,Q)$ as the excess information when expecting to see $f(X)$ with $X\sim Q$ while reality is $X\sim P$. If this is well-defined (i.e., finite), we expect this to be a reasonable definition. As it turns out and as we shall see it soon, this is indeed a reasonable definition.

To state this, we need the concept of **densities**, or, **Radon-Nykodim derivatives**, as they are called in measure theory. Given $P,Q$ as above, the density of $P$ with respect to $Q$ is defined as a $\cF/\cB$-measurable map $f:\Omega \to [0,\infty)$ such that for all $A\in \cF$,

\begin{align}\label{eq:densitydef}

\int_A f(\omega) dQ(\omega) = P(A)\,.

\end{align}

(Here, $\cB$ is the Borel $\sigma$-algebra restricted to $[0,\infty)$.) Recalling that $P(A) = \int_A dP(\omega)$, we would write the above as $f dQ = dP$ meaning that if we integrated this equality over any set $A$, we would get identity. Because of this shorthand notation, when the density of $P$ with respect to $Q$ exists, we also write it as

\begin{align*}

f=\frac{dP}{dQ}\,.

\end{align*}

Note that this equation defines the symbol $\frac{dP}{dQ}$ in terms of the function $f$ that satisfies $\eqref{eq:densitydef}$. In particular, $\frac{dP}{dQ}$ is a nonnegative-valued random variable on $(\Omega,\cF)$. When $\frac{dP}{dQ}$ exists then it follows immediately that $P$ is absolutely continuous with respect to $Q$ (just look at $\eqref{eq:densitydef}$). It turns out that this is both necessary and sufficient for the existence of the density of $P$ with respect to $Q$. This is stated as a theorem in a slightly more general form in that $P,Q$ are not restricted to be probability measures (the condition $P(\Omega)=Q(\Omega)=1$ is thus lifted for them). The definition of density for such measures is still given by $\eqref{eq:densitydef}$. To avoid some pathologies we need to assume that $Q$ is $\sigma$-finite, which means that while $Q(\Omega)=\infty$ may hold, there exists a countable covering $\{A_i\}$ of $\Omega$ with $\cF$-measurable sets such that $Q(A_i)<+\infty$ for each $i$.

Theorem (Radon-Nikodym Theorem):

Let $P,Q$ be measures on a common measurable space $(\Omega,\cF)$ and assume that $Q$ is $\sigma$-finite. Then, the density of $P$ with respect to $Q$, $\frac{dP}{dQ}$, exists if and only if $P$ is absolutely continuous with respect to $Q$. Furthermore, $\frac{dP}{dQ}$ is uniquely defined up to a $Q$-null set, i.e., for any $f_1,f_2$ satisfying $\eqref{eq:densitydef}$, $f_1=f_2$ holds $Q$-almost surely.

Densities work as expected: Recall that the central limit theorem states that under mild conditions, $\sqrt{n} \hat\mu_n$ of $n$ iid random variables $X_1,\dots,X_n$ with common zero mean and variance one converges to a distribution of a standard normal random variable $Z$. Without much thinking, we usually write that the density of this random variable is $g(x) = \frac{1}{\sqrt{2\pi}} \exp(-x^2/2)$. As it happens, this is indeed the density, but now we know that we need to be more precise: This is the density of $Z$ with respect the Lebesgue measure (which we shall often denote by $\lambda$). The densities of “classical” distributions are almost always defined with respect to the Lebesgue measure. A very useful result is the **chain rule**, which states that if $P\ll Q \ll S$, then $\frac{dP}{dQ} \frac{dQ}{dS} = \frac{dP}{dS}$. The proof of this result follows from the definition of the densities and the “usual machinery”, which means proving the result holds for simple functions and then applying the monotone convergence theorem to take the limit for any measurable function. The chain rule is often used to reduce the calculation of densities to calculation with known densities. Another useful piece of knowledge that we will need is that if $\nu$ is a counting measure on $([N],2^{[N]})$ then if $P$ is a distribution on $([N],2^{[N]})$ then $\frac{dP}{d\nu}(i) = P(\{i\})$ for all $i\in [N]$.

With this much preparation, we are ready to state the promised result:

Theorem (Relative Entropy):

Let $(\Omega, \cF)$ be a measurable space and let $P$ and $Q$ be measures on this space.

Then,

\begin{align*}

\KL(P, Q) =\begin{cases}

\int \log \left(\frac{dP}{dQ}(\omega)\right) dP(\omega)\,, & \text{if } P \ll Q\,;\\

+\infty\,, & \text{otherwise}\,.

\end{cases}

\end{align*}

Note that by our earlier remark, this reduces to $\eqref{eq:discreterelentropy}$ for discrete measures. Also note that the integral is always well defined (but may diverge to positive infinity). If $\lambda$ is a common dominating $\sigma$-finite measure for $P$ and $Q$ (i.e., $P\ll \lambda$ and $Q\ll \lambda$ both hold) then letting $p = \frac{dP}{d\lambda}$ and $q = \frac{dQ}{d\lambda}$, if also $P\ll Q$, the chain rule gives $\frac{dP}{dQ} \frac{dQ}{d\lambda} = \frac{dP}{d\lambda}$, which lets us write

\begin{align*}

\KL(P,Q) = \int p \log( \frac{p}{q} ) d\lambda\,,

\end{align*}

which is perhaps the best known expression for relative entropy, which is also often used as a definition. Note that for probability measures, a common dominating $\sigma$-finite measure can always be bound. For example, $\lambda = P+Q$ always dominates both $P$ and $Q$.

Relative entropy is a kind of “distance” measure between distributions $P$ and $Q$. In particular, if $P = Q$, then $\KL(P, Q) = 0$ and otherwise $\KL(P, Q) > 0$. Strictly speaking, relative entropy is not a distance because it satisfies neither the triangle inequality nor is it symmetric. Nevertheless, it serves the same purpose.

The relative entropy between many standard distributions is often quite easy to compute. For example, the relative entropy between two Gaussians with means $\mu_1, \mu_2 \in \R$ and common variance $\sigma^2$ is

\begin{align*}

\KL(\mathcal N(\mu_1, \sigma^2), \mathcal N(\mu_2, \sigma^2)) = \frac{(\mu_1 – \mu_2)^2}{2\sigma^2}\,.

\end{align*}

The dependence on the difference in means and the variance is consistent with our intuition. If $\mu_1$ is close to $\mu_2$, then the “difference” between the distributions should be small, but if the variance is very small, then there is little overlap and the difference is large. The relative entropy between two Bernoulli distributions with means $p,q \in [0,1]$

\begin{align*}

\KL(\mathcal B(p), \mathcal B(q)) = p \log \left(\frac{p}{q}\right) + (1-p) \log\left(\frac{(1-p)}{(1-q)}\right)\,,

\end{align*}

where $0 \log (\cdot) = 0$. The divergence is infinite if $q \in \set{0,1}$ and $p \in (0,1)$ (because absolute continuity is violated).

To summarize, we have developed the minimax optimality concept and given a heuristic argument that no algorithm can improve on $O(\sqrt{Kn})$ regret in the worst case when the noise is Gaussian. The formal proof of this result will appear in the next post and relies on several of the core concepts from information theory, which we introduced here. Finally we also stated the definition and existence of the Radon-Nikodym derivative (or density), that unifies (and generalizes) the probability distribution (for discrete spaces as learned in high school) and a probability density.

# Notes

Note 1: The worst-case regret has a natural game-theoretic interpretation. This is best described if you imagine a **game** between a protagonist and an antagonist that works as follows: Given $K>0$ and $n>1$, the protagonist proposes a bandit policy $A$, while the antagonist, *after looking at the policy chosen*, picks a bandit environment $E$ from the class of environments considered (in our case from the class $\cE_K$). After both players made their choices, the game unfold by the bandit policy deployed in the chosen environment. This leads to some value for the expected regret, which denote by $R_n(A,E)$. The payoff for the antagonist is then $R_n(A,E)$, while the payoff for the protagonist is $-R_n(A,E)$, i.e., the game is **zero sum**. Both players aim at maximizing their payoffs. The game is completely described by $n,K$ and $\cE$. One characteristic value in a game is its minimax value. As described above, this is a sequential game (the protagonist moves first, followed by the antagonist). The minimax value of this game, from the perspective of the antagonist, is then exactly $R_n^*(\cE)$, while it is $\sup_A \inf_E -R_n(A,E) = -R_n^*(\cE)$ from the perspective of the protagonist.

Note 2: By its definition, $R_n^*(\cE)$ is a worst-case measure. Should we worry about that an algorithm that optimizes for the worst-case may perform very poorly on specific instances? Intuitively, a bandit problem where the action gaps are quite large or quite small should be easier, the former because large gaps require fewer observations to detect, while smaller gaps can be just ignored. Yet, we can perhaps imagine policies that are worst-case optimal whose regret is $R_n^*(\cE)$ regardless of the nature of the instance that they are running on. When such an policy would be used on an “easy” instance as described above, we would certainly be disappointed to see it suffer a large regret. Ideally, what we would like is if in some sense the policy we end up using with would be optimal for every instance, or, in short, if it was **instance-optimal**. Instance-optimality in its vanilla form is a **highly dubious idea**: After all, the best regret on any given instance is always zero! How could then an algorithm achieve this for *every* instance of some nontrivial class $\cE$ (a class $\cE$ is trivial if it contains environments where it is always optimal to choose, say, action one)? This is clearly impossible. The problem comes from that the best policy for a given instance is a very dumb policy for almost all the other instances. We certainly do not wish to choose such a dumb policy! In other words, we want both good performance on individual instances while we also want good worst-case performance on the same horizon $n$. Or we may want to achieve both good performance on individual instances while we also demand good asymptotic behavior on any instance. In any case, the idea is to restrict the class $\cA_{n,K}$ by ruling out all the dumb policies. If $\cA_{n,K}^{\rm smart}$ is the resulting class, then the target is modified to achieve $R_n^*(E) \doteq \inf_{A\in \cA_{n,K}^{\rm smart}} R_n(A,E)$, up to, say, a constant factor. We will return to various notions of instance-optimality later and in particular will discuss whether for some selections of the class of smart policies, instance-optimal (or near-instance optimal) policies exist.

Note 3: The theorem that connects our definition of relative entropies to densities, i.e., the “classic definition”, can be found e.g., Section 5.2 of the information theory book of Gray.

# More information theory and minimax lower bounds

Continuing the previous post, we prove the claimed minimax lower bound.

We start with a useful result that quantifies the difficulty of identifying whether or not an observation is drawn from similar distributions $P$ and $Q$ defined over the same measurable space $(\Omega,\cF)$. In particular, the difficulty will be shown to increase as a function of the relative entropy of $P$ with respect to $Q$. In the result below, as usual in probability theory, for $A\subset \Omega$, $A^c$ denotes the complement of $A$ with respect to $\Omega$: $A^c = \Omega \setminus A$.

Theorem (High-probability Pinsker): Let $P$ and $Q$ be probability measures on the same measurable space $(\Omega, \cF)$ and let $A \in \cF$ be an arbitrary event. Then,

\begin{align}\label{eq:pinskerhp}

P(A) + Q(A^c) \geq \frac{1}{2}\, \exp\left(-\KL(P, Q)\right)\,.

\end{align}

The logic of the result is as follows: If $P$ and $Q$ are similar to each other (e.g., in the sense that $\KL(P,Q)$ is small) then for any event $A\in \cF$, $P(A)\approx Q(A)$ and $P(A^c) \approx Q(A^c)$. Since $P(A)+P(A^c)=1$, we thus also expect $P(A)+Q(A^c)$ to be “large”. How large is “large” is the contents of the result. In particular, using $2 \max(a,b)\ge a+b$, it follows that for any $0<\delta\le 1$, to guarantee $\max(P(A),Q(A^c))\le \delta$, it is necessary that $\KL(P,Q)\ge \log(\frac{1}{4\delta})$, that is relative entropy of $P$ with respect to $Q$ must be larger than the right-hand side. Connecting the dots, this means that the information coming from an observation drawn from $P$ relative to expecting an observation from $Q$ must be just large enough.

Note that if $P$ is not absolutely continuous with respect to $Q$ then $\KL(P,Q)=\infty$ and the result is vacuous. Also note that the result is symmetric. We could replace $\KL(P, Q)$ with $\KL(Q, P)$, which sometimes leads to a stronger result because the relative entropy is *not symmetric*.

This theorem is useful for proving lower bounds because it implies that at least one of $P(A)$ and $Q(A^c)$ is large. Then, in an application where after some observations a decision needs to be made, $P$ and $Q$ are selected such that the a correct decision under $P$ is incorrect under $Q$ and vice versa. Then one lets $A$ be the event when the decision is incorrect under $P$ and $A^c$ is the event that the decision is incorrect under $Q$. The result then tells us that it is impossible to design a decision rule such that the probability of an incorrect decision is small under both $P$ and $Q$.

To illustrate this, consider the normal means problem when $X\sim N(\mu,1)$ is observed with $\mu \in \{0,\Delta\}$. In this case, we let $P$ the distribution of $X$ under $\mu=0$, $Q$ the distribution of $X$ under $\mu=\Delta$ and if $f: \R \to \{0,\Delta\}$ is the measurable function that is proposed as the decision rule, $A = \{\, x\in \R \,: f(x) \ne 0 \}$. Then, using e.g. $\Delta=1$, the theorem tells us that

\begin{align*}

P(A) + Q(A^c) \geq \frac{1}{2}\, \exp\left(-\KL(P, Q)\right)

= \frac{1}{2}\, \exp\left(-\frac{\Delta^2}{2\sigma^2}\right)

= \frac{1}{2}\, \exp\left(-1/2\right) \geq 3/10\,.

\end{align*}

This means that no matter how we chose our decision rule $f$, we simply do not have enough data to make a decision for which the probability of error on either $P$ or $Q$ is smaller than $3/20$. All that remains is to apply this idea to bandits and we will have our lower bound. The proof of the above result is postponed to the end of the post.

After this short excursion into information theory, let us return to the world of $K$-action stochastic bandits. In what follows we fix $n>0$, the number of rounds and $K>0$, the number of actions. Recall that given a $K$-action bandit environment $\nu = (P_1,\dots,P_K)$ ($P_i$ is the distribution of rewards for action $i$) and a policy $\pi$, the outcome of connecting $\pi$ and $\nu$ was defined as the random variables $(A_1,X_1,\dots,A_n,X_n)$ such that for $t=1,2,\dots,n$, the distribution of $A_t$ given the history $A_1,X_1,\dots,A_{t-1},X_{t-1}$ is uniquely determined by the policy $\pi$, while the distribution of $X_t$ given the history $A_1,X_1,\dots,A_{t-1},X_{t-1},A_t$ was required to be $P_{A_t}$. While the probability space $(\Omega,\cF,\PP)$ that carries these random variables can be chosen in many different ways, you should convince yourself that the constraints mentioned uniquely determine the distribution of the random element $H_n = (A_1,X_1,\dots,A_n,X_n)$ (or wait two paragraphs). As such, despite that $(\Omega,\cF,\PP)$ is non-unique, we can legitimately define $\PP_{\nu,\pi}$ to be the distribution of $H_n$ (suppressing dependence on $n$).

In fact, we will find it useful to write this distribution with a density. To do this, recall from the Radon-Nykodim theorem that if $P,Q$ are measures over a common measurable space $(\Omega,\cF)$ such that $Q$ is $\sigma$-finite and $P\ll Q$ (i.e., $P$ is absolutely continuous w.r.t. $Q$, which is sometimes written as **$Q$ dominates $P$**) then there exists a random variable $f:\Omega \to \R$ such that

\begin{align}\label{eq:rnagain}

\int_A f dQ = \int_A dP, \qquad \forall A\in \cF\,.

\end{align}

The function $f$ is said to be the density of $P$ with respect to $Q$, also denoted by $\frac{dP}{dQ}$. Since writing $\forall A\in \cF$, $\int_A \dots = \int_A \dots $ is a bit redundant, to save typing, following standard practice, in the future **we will write $f dQ = dP$** (or $f(\omega) dQ(\omega) = dP(\omega)$ to detail the dependence on $\omega\in \Omega$) as a shorthand for $\eqref{eq:rnagain}$. This also explains why we write $f = \frac{dP}{dQ}$.

With this, it is not hard to verify that for any $(a_1,x_1,\dots,a_n,x_n)\in \cH_n \doteq ([K]\times \R)^n$ we have

\begin{align}

\begin{split}

d\PP_\nu(a_1,x_1,\dots,a_n,x_n)

& = \pi_1(a_1) p_{a_1}(x_1) \, \pi_2(a_2|a_1,x_1) p_{a_2}(x_2)\cdot \\

& \quad \cdots \pi_n(a_n|a_1,x_1,\dots,a_{n-1},x_{n-1} )\,p_{a_n}(x_n) \\

& \qquad \times d\lambda(x_1) \dots d\lambda(x_n) d\rho(a_1) \dots d\rho(a_n)\, ,

\end{split}

\label{eq:jointdensity}

\end{align}

where

- $\pi = (\pi_t)_{1\le t \le n}$ with $\pi_t(a_t|a_1,x_1,\dots,a_{t-1},x_{t-1})$ being the probability that policy $\pi$ in round $t$ chooses action $a_t\in [K]$ when the sequence of actions and rewards in the previous rounds are $(a_1,x_1,\dots,a_{t-1},x_{t-1})\in \cH_{t-1}$;
- $p_i$ is defined by $p_i d\lambda = d P_i$, where $\lambda$ is a common dominating measure for $P_1,\dots,P_K$, and
- $\rho$ is the counting measure on $[K]$: For $A\subset [K]$, $\rho(A) = |A \cap [K]|$.

There is no loss of generality in assuming that $\lambda$ exist: For example, we can always use $\lambda = P_1+\dots+P_K$. In a note at the end of the post, for full mathematical rigor, we add another condition to the definition of $\pi$. (Can you guess what this condition should be?)

Note that $\PP_{\nu,\pi}$, as given by the right-hand side of the above displayed equation, is a distribution over $(\cH_n,\cF_n)$ where $\cF_n=\mathcal L(\R^{2n})|_{\cH_n}$. Here, $\mathcal L(\R^{2n})$ is the Lebesgue $\sigma$-algebra over $\R^{2n}$ and $\mathcal L(\R^{2n})|_{\cH_n}$ is its restriction to $\cH_n$.

As mentioned in the previous post, given any two bandit environments, $\nu=(P_1,\dots,P_K)$ and $\nu’=(P’_1,\dots,P’_K)$, we expect $\PP_{\nu,\pi}$ and $\PP_{\nu’,\pi}$ to be close to each other if all of $P_i$ and $P_i’$ are close to each other, regardless of the policy $\pi$ (a sort of continuity result). Our next result will make this precise; in fact we will show an identity that expresses $\KL( \PP_{\nu,\pi}, \PP_{\nu’,\pi})$ as a function $\KL(P_i,P’_i)$, $i=1,\dots,K$.

Before stating the result, we need to introduce the concept of ($n$-round $K$-action) **canonical bandit models**. The idea is to fix the measurable space $(\Omega,\cF)$ (that holds the variables $(X_t,A_t)_t$) in a special way. In particular, we set $\Omega=\cH_n$ and $\cF=\cF_n$. Then, we define $A_t,X_t$, $t=1,\dots,n$ to be the coordinate projections:

\begin{align}

\begin{split}

A_t( a_1,x_1,\dots,a_n,x_n ) &= a_t\,, \text{ and }\\

X_t( a_1,x_1,\dots,a_n,x_n) &= x_t,\, \quad \forall t\in [n]\,\text{ and } \forall (a_1,x_1,\dots,a_n,x_n)\in \cH_n\,.

\end{split}

\label{eq:coordproj}

\end{align}

Now, given some policy $\pi$ and bandit environment $\nu$, if we set $\PP=\PP_{\nu,\pi}$ then, trivially, the distribution of $H_n=(A_1,X_1,\dots,A_n,X_n)$ is indeed $\PP_{\nu,\pi}$ as expected. Thus, $A_t,X_t$ as defined by $\eqref{eq:coordproj}$, satisfies the requirements we posed against the random variables that describe the outcome of the interaction of a policy and an environment. However, $A_t,X_t$ are fixed, no matter the choice of $\nu,\pi$. This is in fact the reason why it is convenient to use the canonical bandit model! Furthermore, since all our past and future statements concern only the properties of $\PP_{\nu,\pi}$ (and not the minute details of the definition of the random variables $(A_1,X_1,\dots,A_n,X_n)$), there is no loss of generality in assuming that in all our expressions $(A_1,X_1,\dots,A_n,X_n)$ is in fact defined over the canonical bandit model using $\eqref{eq:coordproj}$. (We avoided initially the canonical model because up to now there was no advantage to introducing and using it.)

Before we state our promised result, we introduce one last notation that we need: Since we use the same measurable space $(\cH_n,\cF_n)$ with multiple environments $\nu$ and policies $\pi$, which give rise to various probability measures $\PP = \PP_{\nu,\pi}$, when writing expectations we will index $\E$ the same way as $\PP$ is indexed. Thus, $\E_{\nu,\pi}$ denotes expectation underlying $\PP_{\nu,\pi}$ (i.e., $\E_{\nu,\pi}[X] = \int X d\PP_{\nu,\pi}$. When $\pi$ is fixed, we will also drop $\pi$ from the indices and just write $\PP_{\nu}$ (and $\E_{\nu}$) for $\PP_{\nu,\pi}$ (respectively, $\E_{\nu,\pi}$).

With this, we are ready to state the promised result:

Lemma (Divergence decomposition): Let $\nu=(P_1,\ldots,P_K)$ be the reward distributions associated with one $K$-armed bandit, and let $\nu’=(P’_1,\ldots,P’_K)$ be the reward distributions associated with the another $K$-armed bandit. Fix some policy $\pi$ and let $\PP_\nu = \PP_{\nu,\pi}$ and $\PP_{\nu’} = \PP_{\nu’,\pi}$ be the distributions induced by the interconnection of $\pi$ and $\nu$ (respectively, $\pi$ and $\nu’$). Then

\begin{align}

\KL(\PP_{\nu}, \PP_{\nu’}) = \sum_{i=1}^K \E_{\nu}[T_i(n)] \, \KL(P_i, P’_i)\,.

\label{eq:infoproc}

\end{align}

In particular the lemma implies that $\PP_{\nu}$ and $\PP_{\nu’}$ will be close if for all $i$, $P_i$ and $P’_i$ were close. A nice feature of the lemma is that it provides not only a bound on the divergence of $\PP_{\nu}$ from $\PP_{\nu’}$, but it actually expresses this with an identity.

**Proof**

First, assume that $\KL(P_i,P’_i)<\infty$ for all $i\in [K]$. From this, it follows that $P_i\ll P’_i$. Define $\lambda = \sum_i P_i + P’_i$ and let $p_i = \frac{dP_i}{d\lambda}$ and let $p’_i = \frac{dP’_i}{d\lambda}$. By the chain rule of Radon-Nikodym derivatives, $\KL(P_i,P’_i) = \int p_i \log( \frac{p_i}{p’_i} ) d\lambda$.

Let us now evaluate $\KL( \PP_{\nu}, \PP_{\nu’} )$. By the fundamental identity for the relative entropy, this is equal to $\E_\nu[ \log \frac{d\PP_\nu}{d\PP_{\nu’}} ]$, where $\frac{d\PP_\nu}{d\PP_{\nu’}}$ is the Radon-Nikodym derivative of $\PP_\nu$ with respect to $\PP_{\nu’}$. It is easy to see that both $\PP_\nu$ and $\PP_{\nu’}$ can be written in the form given by $\eqref{eq:jointdensity}$ (this is where we use that $\lambda$ dominates all of $P_i$ and $P_i’$). Then, using the chain rule we get

\begin{align*}%\label{eq:logrn}

\log \frac{d\PP_\nu}{d\PP_{\nu’}} = \sum_{t=1}^n \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)}\,

\end{align*}

(the terms involving the policy cancel!). Now, taking expectations of both sides, we get

\begin{align*}

\E_{\nu}\left[ \log \frac{d\PP_\nu}{d\PP_{\nu’}} \right]

= \sum_{t=1}^n \E_{\nu}\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right]\,,

\end{align*}

and

\begin{align*}

\E_{\nu}\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right]

& = \E_{\nu}\left[ \E_{\nu}\left[\log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \, \big|\, A_t \right] \, \right]

= \E_{\nu}\left[ \KL(P_{A_t},P’_{A_t}) \right]\,,

\end{align*}

where in the second equality we used that under $\PP_{\nu}$, conditionally on $A_t$, the distribution of $X_t$ is $dP_{A_t} = p_{A_t} d\lambda$. Plugging back into the previous display,

\begin{align*}

\E_{\nu}\left[ \log \frac{d\PP_\nu}{d\PP_{\nu’}} \right]

&= \sum_{t=1}^n \E_{\nu}\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right] \\

&= \sum_{t=1}^n \E_{\nu}\left[ \KL(P_{A_t},P’_{A_t}) \right] \\

&= \sum_{i=1}^K \E_{\nu}\left[ \sum_{t=1}^n \one{A_t=i} \KL(P_{A_t},P’_{A_t}) \right] \\

&= \sum_{i=1}^K \E_{\nu}\left[ T_i(n)\right]\KL(P_{i},P’_{i})\,.

\end{align*}

When the right-hand side of $\eqref{eq:infoproc}$ is infinite, by our previous calculation, it is not hard to see that the left-hand side will also be infinite.

QED

Using the previous theorem and lemma we are now in a position to present and prove the main result of the post, a worst-case lower bound on the regret of any algorithm for finite-armed stochastic bandits with Gaussian noise. Recall that previously we used $\cE_K$ to denote the class of $K$ action bandit environments where the reward distributions have means in $[0,1]^K$ and the noise in the rewards have 1-subgaussian tails. Denote by $G_\mu\in \cE_K$ the bandit environment where all distributions are Gaussian with unit variance and the vector of means is $\mu\in [0,1]^K$. To make the dependence of the regret on the policy and the environment explicit, in accordance with the notation of the previous post, we will use $R_n(\pi,E)$ to denote the regret of policy $\pi$ on a bandit instance $E$.

Theorem (Worst-case lower bound): Fix $K>1$ and $n\ge K-1$. Then, for any policy $\pi$ there exists a mean vector $\mu \in [0,1]^K$ such that

\begin{align*}

R_n(\pi,G_\mu) \geq \frac{1}{27}\sqrt{(K-1) n}\,.

\end{align*}

Since $G_\mu \in \cE_K$, it follows that the minimax regret for $\cE_K$ is lower bounded by the right-hand side of the above display as soon as $n\ge K-1$:

\begin{align*}

R_n^*(\cE_K) \geq \frac{1}{27}\sqrt{(K-1) n}\,.

\end{align*}

The idea of the proof is illustrated on the following figure:

**Proof**

Fix a policy $\pi$. Let $\Delta\in [0,1/2]$ be some constant to be chosen later. As described in the previous post, we choose two Gaussian environments with unit variance. In the first environment, the vector of means of the rewards per arm is

\begin{align*}

\mu = (\Delta, 0, 0, \dots,0)\,.

\end{align*}

This environment and $\pi$ gives rise to the distribution $\PP_{G_\mu,\pi}$ on the canonical bandit model $(\cH_n,\cF_n)$. For brevity, we will use $\PP_{\mu}$ in place of $\PP_{G_\mu,\pi}$. The expectation under $\PP_{\mu}$ will be denoted by $\E_{\mu}$. Recall that on measurable space $(\cH_n,\cF_n)$, $A_t,X_t$ are just the coordinate projections defined using $\eqref{eq:coordproj}$.

To choose the second environment, pick

\begin{align*}

i = \argmin_{j\ne 1} \E_{\mu}[ T_j(n) ]\,.

\end{align*}

Note that $\E_{\mu}[ T_i(n) ] \le n/(K-1)$ because otherwise we would have $\sum_{j\ne 1} \E_{\mu}[T_j(n)]>n$, which is impossible. Note also that $i$ depends on $\pi$. The second environment has means

\begin{align*}

\mu’ = (\Delta,0,0, \dots, 0, 2\Delta, 0, \dots, 0 )\,,

\end{align*}

where specifically $\mu’_i = 2\Delta$. In particular, $\mu_j = \mu’_j$ except at index $i$. Note that the optimal action in $G_{\mu}$ is action one, while the optimal action in $G_{\mu’}$ is action $i\ne 1$. We again use the shorthand $\PP_{\mu’} = \PP_{G_{\mu’},\pi}$.

Intuitively, if $\pi$ chooses action one infrequently while interacting with $\mu$ (i.e., if $T_1(n)$ is small with high probability), then $\pi$ should do poorly in $G_{\mu}$, while if it chooses action one frequently while interacting with $G_{\mu’}$, then it will do poorly in $G_{\mu’}$. In particular, by the basic regret decomposition identity and a simple calculation,

\begin{align*}

R_n(\pi,G_\mu) %= \E_\mu[ (n-T_1(n)) \Delta ] \ge \E_\mu[ \one{T_1(n)\le n/2} ]\, \left(n-\frac{n}{2}\right)\, \Delta =

\ge \PP_{\mu}(T_1(n)\le n/2) \frac{n\Delta}{2}\, \quad \text{and} \quad

R_n(\pi,G_{\mu’}) > \PP_{\mu’}(T_1(n)> n/2)\, \frac{n\Delta}{2}\,.

%R_n(\pi,G_{\mu’}) %\ge \E_{\mu’}[ T_1(n) \Delta ] > \E_{\mu’}[ \one{T_1(n)> n/2} ]\, \frac{n\Delta}{2} =

%\ge \PP_{\mu’}(T_1(n)> n/2)\, \frac{n\Delta}{2}\,.

\end{align*}

Chaining this with the high-probability Pinsker inequality,

\begin{align*}

R_n(\pi,G_\mu) + R_n(\pi,G_{\mu’})

& > \frac{n\Delta}{2}\,\left( \PP_{\mu}(T_1(n)\le n/2) + \PP_{\mu’}(T_1(n)> n/2) \right) \\

& \ge \frac{n\Delta}{4}\, \exp( – \KL(\PP_{\mu},\PP_{\mu’}) )\,.

\end{align*}

It remains to upper bound $\KL(\PP_{\mu},\PP_{\mu’})$, i.e., to show that $\PP_\mu$ and $\PP_{\mu’}$ are indeed not far. For this, we use the divergence decomposition lemma, the definitions of $\mu$ and $\mu’$ and then exploit the choice of $i$ to get

\begin{align*}

\KL(\PP_\mu, \PP_{\mu’}) = \E_{\mu}[T_i(n)] \KL( \mathcal N(0,1), \mathcal N(2\Delta,1) ) =

\E_{\mu}[T_i(n)]\, \frac{(2\Delta)^2}{2} \leq \frac{2n\Delta^2}{K-1} \,.

\end{align*}

Plugging this into the previous display, we find that

\begin{align*}

R_n(\pi,G_\mu) + R_n(\pi,G_{\mu’}) \ge \frac{n\Delta}{4}\, \exp\left( – \frac{2n\Delta^2}{K-1} \right)\,.

\end{align*}

The result is completed by tuning $\Delta$. In particular, the optimal choice for $\Delta$ happens to be $\Delta = \sqrt{(K-1)/4n}$, which is below $1/2$ when $n$ is large compared to $K$, as postulated in the conditions of the theorem. The final steps are lower bounding $\exp(-1/2)$ and using $2\max(a,b) \ge a+b$.

QED

This concludes the proof of the worst-case lower bound, showing that no algorithm can do better than $O(\sqrt{nK})$ over all problems simultaneously. Coming in the next post we will show that the asymptotic logarithmic regret of UCB is essentially not improvable when the noise is Gaussian.

In the remainder of the post, we give the proof of the “high probability” Pinsker inequality, followed by miscellaneous thoughts.

# Proof of the Pinsker-type inequality

It remains to show the proof of the Pinsker-type inequality stated above. In the proof, to save space, we use the convention of denoting the minimum of $a$ and $b$ by $a\wedge b$, while we denote their maximum by $a \vee b$.

If $\KL(P,Q)=\infty$, there is nothing to be proven. On the other hand, by our previous theorem on relative entropy, $\KL(P,Q)<\infty$ implies that $P \ll Q$. Take $\nu = P+Q$. Note that $P,Q\ll \nu$. Hence, we can define $p = \frac{dP}{d\nu}$ and $q = \frac{dQ}{d\nu}$ and use the chain rule for Radon-Nikodym derivatives to derive that $\frac{dP}{dQ} \frac{dQ}{d\nu} = \frac{dP}{d\nu}$, or $\frac{dP}{dQ} = \frac{\frac{dP}{d\nu}}{\frac{dQ}{d\nu}}$. Thus,

\begin{align*}

\KL(P,Q) = \int p \log(\frac{p}{q}) d\nu\,.

\end{align*}

For brevity, when writing integrals with respect to $\nu$, in this proof, we will drop $d\nu$. Thus, we will write, for example $\int p \log(p/q)$ for the above integral.

Instead of $\eqref{eq:pinskerhp}$, we prove the stronger result that

\begin{align}\label{eq:pinskerhp2}

\int p \wedge q \ge \frac12 \, \exp( – \KL(P,Q) )\,.

\end{align}

This indeed is sufficient since $\int p \wedge q = \int_A p \wedge q + \int_{A^c} p \wedge q \le \int_A p + \int_{A^c} q = P(A) + Q(A^c)$.

We start with an inequality attributed to Lucien Le Cam, which lower bounds the left-hand side of $\eqref{eq:pinskerhp2}$. The inequality states that

\begin{align}\label{eq:lecam}

\int p \wedge q \ge \frac12 \left( \int \sqrt{pq} \right)^2\,.

\end{align}

Starting from the right-hand side above, using $pq = (p\wedge q) (p \vee q)$ and then Cauchy-Schwartz, we get

\begin{align*}

\left( \int \sqrt{pq} \right)^2

= \left(\int \sqrt{(p\wedge q)(p\vee q)} \right)^2

\le \left(\int p \wedge q \right)\, \left(\int p \vee q\right)\,.

\end{align*}

Now, using $p\wedge q + p \vee q = p+q$, the proof is finished by substituting $\int p \vee q = 2-\int p \wedge q \le 2$ and dividing both sides by two.

Thus, it remains to lower bound the right-hand side of $\eqref{eq:lecam}$. For this, we use Jensen’s inequality. First, we write $(\cdot)^2$ as $\exp (2 \log (\cdot))$ and then move the $\log$ inside the integral:

\begin{align*}

\left(\int \sqrt{pq} \right)^2 &= \exp\left( 2 \log \int \sqrt{pq} \right)

= \exp\left( 2 \log \int p \sqrt{\frac{q}{p}} \right) \\

&\ge \exp\left( 2 \int p \,\frac12\,\log \left(\frac{q}{p}\right)\, \right) \\

&= \exp\left( 2 \int_{pq>0} p \,\frac12\,\log \left(\frac{q}{p}\right)\, \right) \\

&= \exp\left( – \int_{pq>0} p \log \left(\frac{p}{q}\right)\, \right) \\

&= \exp\left( – \int p \log \left(\frac{p}{q}\right)\, \right)\,.

\end{align*}

Here, in the fourth and the last step we used that since $P\ll Q$, $q=0$ implies $p=0$, hence $p>0$ implies $q>0$, and eventually $pq>0$. Putting together the inequalities finishes the proof.

QED

# Notes

Note 1: We used the Gaussian noise model because the KL divergences are so easily calculated in this case, but all that we actually used was that $\KL(P_i, P_i’) = O((\mu_i – \mu_i’)^2)$ when the gap between the means $\Delta = \mu_i – \mu_i’$ is small. While this is certainly not true for *all* distributions, it very often is. Why is that? Let $\{P_\mu : \mu \in \R\}$ be some parametric family of distributions on $\Omega$ and assume that distribution $P_\mu$ has mean $\mu$. Then assuming the densities are twice differentiable and that everything is sufficiently “nice” that integrals and derivatives can be exchanged (as is almost always the case), we can use a Taylor expansion about $\mu$ to show that

\begin{align*}

\KL(P_\mu, P_{\mu+\Delta})

&\approx \KL(P_\mu, P_\mu) + \left.\frac{\partial}{\partial \Delta} \KL(P_\mu, P_{\mu+\Delta})\right|_{\Delta = 0} \Delta + \frac{1}{2}\left.\frac{\partial^2}{\partial\Delta^2} \KL(P_\mu, P_{\mu+\Delta})\right|_{\Delta = 0} \Delta^2 \\

&= \left.\frac{\partial}{\partial \Delta} \int_{\Omega} \log\left(\frac{dP_\mu}{dP_{\mu+\Delta}}\right) dP_\mu \right|_{\Delta = 0}\Delta

+ \frac12 I(\mu) \Delta^2 \\

&= -\left.\int_{\Omega} \frac{\partial}{\partial \Delta} \log\left(\frac{dP_{\mu+\Delta}}{dP_{\mu}}\right) \right|_{\Delta=0} dP_\mu \Delta

+ \frac12 I(\mu) \Delta^2 \\

&= -\left.\int_{\Omega} \frac{\partial}{\partial \Delta} \frac{dP_{\mu+\Delta}}{dP_{\mu}} \right|_{\Delta=0} dP_\mu \Delta

+ \frac12 I(\mu) \Delta^2 \\

&= – \frac{\partial}{\partial \Delta} \left.\int_{\Omega} \frac{dP_{\mu+\Delta}}{dP_{\mu}} dP_\mu \right|_{\Delta=0} \Delta

+ \frac12 I(\mu) \Delta^2 \\

&= -\left.\frac{\partial}{\partial \Delta} \int_{\Omega} dP_{\mu+\Delta} \right|_{\Delta=0}\Delta

+ \frac12 I(\mu) \Delta^2 \\

&= \frac12 I(\mu)\Delta^2\,,

\end{align*}

where $I(\mu)$, introduced in the second line, is called the Fisher information of the family $(P_\mu)_{\mu}$ at $\mu$. Note that if $\lambda$ is a common dominating measure for $(P_{\mu+\Delta})$ for $\Delta$ small, $dP_{\mu+\Delta} = p_{\mu+\Delta} d\lambda$ and we can write

\begin{align*}

I(\mu) = -\int \left.\frac{\partial^2}{\partial \Delta^2} \log\, p_{\mu+\Delta}\right|_{\Delta=0}\,p_\mu d\lambda\,,

\end{align*}

which is the form that is usually given in elementary texts. The upshot of all this is that $\KL(P_\mu, P_{\mu+\Delta})$ for $\Delta$ small is indeed quadratic in $\Delta$, with the scaling provided by $I(\mu)$, and as a result the worst-case regret is always $O(\sqrt{nK})$, provided the class of distributions considered is sufficiently rich and not too bizarre.

Note 2: In the previous post we showed that the regret of UCB is at most $O(\sqrt{nK \log(n)})$, which does not match the lower bound that we just proved. It is possible to show two things. First, that UCB does not achieve $O(\sqrt{nK})$ regret. The logarithm is real, and not merely a consequence of lazy analysis. Second, there is a modification of UCB (called MOSS) that matches the lower bound up to constant factors. We hope to cover this algorithm in future posts, but the general approach is to use a less conservative confidence level.

Note 3: We have now shown a lower bound that is $\Omega(\sqrt{nK})$, while in the previous post we showed an upper bound for UCB that was $O(\log(n))$. What is going on? Do we have a contradiction? Of course, the answer is no and the reason is that the logarithmic regret guarantees depended on the inverse gaps between the arms, which may be very large.

Note 4: Our lower bounding theorem was only proved when $n$ is at least linear in $K$. What happens when $n$ is small compared to $K$ (in particular, $n\le (K-1)$) above? A careful investigation of the proof of the theorem shows that in this case a linear lower bound of size $n/8\exp(-1/2)$ also holds for the regret (choose $\Delta=1/2$). Note that for small values of $n$ compared to $K$, the square-root lower bound gives larger than linear regret owning to that for $0\le x\le 1$, $\sqrt{x}\ge x$ and if the mean rewards are restricted the $[0,1]$ interval than the regret can not be larger than $n$, i.e., for $n$ small, the lower bound is vacuous (in particular, this happens when $n\le (K-1)/27$). Hence, the square-root form is simply too strong to hold for small values of $n$. A lower bound, which joins the two cases and thus holds for any value of $n$ regardless of $K$ is $\frac{\exp(-1/2)}{16}\, \min( n, \sqrt{(K-1)n} )$.

Note 5: To guarantee that the right-hand side of $\eqref{eq:jointdensity}$ defines a measure, we also must have that for any $t\ge 1$, $\pi_t(i|\cdot): \cH_{t-1} \to [0,1]$ is a measurable map. In fact, $\pi_t$ is a simple version of a so-called **probability kernel**. For the interested reader, we add the general definition for a probability kernel $K$: Let $(\cX,\cF)$, $(\cY,\cG)$ be a measurable spaces. $K:\cX \times \cG \to [0,1]$ is a probability kernel if it satisfies the following two properties: *(i)* For any $x\in \cX$, $K(x,\cdot)$ as the $U \mapsto K(x,U)$ ($U\in \cG$) map is a probability measure; *(ii)* For any $U\in \cG$, $K(\cdot,U)$ as the $x\mapsto K(x,U)$ ($x\in \cX$) map is measurable. When $\cY$ is finite and $\cG= 2^{\cY}$, instead of $K(x,U)$ for $U\subset \cY$, it is common that we specify the values of $K(x,\{y\})$ for $x\in \cX$ and $y\in \cY$ and in these cases we also often write $K(x,y)$. It is also common to write $K(U|x)$ (or in the discrete case $K(y|x)$).

Note 6: There is another subtlety in connection to the definition of $\PP_{\nu,\pi}$ is that we emphasized that $\lambda$ should be a measure over the Lebesgue $\sigma$-algebra $\mathcal L(\R)$. This will be typically easy to satisfy and holds for example when $P_1,\dots,P_K$ are defined over $\mathcal L(\R)$. This requirement would not be needed though if we did not insist on using the canonical bandit model. In particular, by their original definition, $X_1,\dots,X_n$ (as maps from $\Omega$ to $\R$) need only be $\cF/\mathcal B(\R)$ measurable (where $\mathcal B(\R)$ is the Borel $\sigma$-algebra over $\R$) and thus at a first sight it may seem strange to require that $P_1,\dots,P_K$ are defined over $\mathcal L(\R)$. The reason we require this is because in the canonical model we wish to use the Lebesgue $\sigma$-algebra of $\R^{2n}$ (in particular its restriction to $\cH_n$), which is in fact demanded by our wish to work with complete measure-spaces. This latter, as explained at the link, is useful because it essentially allows us not to worry about (and drop) sets of zero measure. Now, $\PP_{\nu,\pi}$, as the joint distribution of $(A_1,X_1,\dots,A_n,X_n)$ would only be defined for Borel measurable sets. Then, $\eqref{eq:jointdensity}$ could be first derived for the restriction of $\lambda$ to Borel measurable sets and then $\eqref{eq:jointdensity}$ with $\lambda$ not restricted to Borel measurable sets gives the required extension of $\PP_{\nu,\pi}$ to Lebesgue measurable sets. We can also take $\eqref{eq:jointdensity}$ as our definition of $\PP_{\nu,\pi}$, in which case the issue does not arise.

Note 7: (A simple version of) **Jensen’s inequality** states that if for any $U\subset \R$ convex, with non-empty interior and for any $f: U \to \R$ convex function and random variable $X\in U$ such that $\EE{X}$ exists, $f(\EE{X})\le \EE{f(X)}$. The proof is simple if one notes that for such a convex $f$ function, at every point $m\in R$ in the interior of $U$, there exists a “slope” $a\in \R$ such that $a(x-m)+f(m)\le f(x)$ for all $x\in \R$ (if $f$ is differentiable at $m$, take $a = f'(m)$). Indeed, is such a slope exists, taking $m = \EE{X}$ and replacing $x$ by $X$ we get $a(X-\EE{X}) + f(\EE{X}) \le f(X)$. Then, taking the expectation of both sides we arrive at Jensen’s inequality. The idea can be generalized into multiple directions, i.e., the domain of $f$ could be a convex set in a vector space, etc.

Note 8: We can’t help ourselves not to point out that the quantity $\int \sqrt{p q}$ that appeared in the previous proof is called the **Bhattacharyya coefficient** and that $\sqrt{1 – \int \sqrt{pq}}$ is the **Hellinger distance**, which is also a measure of the distance between probability measures $P$ and $Q$ and is frequently useful (perhaps especially in mathematical statistics) as a more well-behaved distance than the relative entropy (it is bounded, and symmetric and satisfies the triangle inequality!).

# References

The first minimax lower bound that we know of was given by Auer et al. below. Our proof uses a slightly different technique, but otherwise the idea is the same.

Note 9: Earlier, what in the last version of the post is called the divergence decomposition lemma was called the information processing lemma. This previous choice of name however is not very fortunate as it collides with the information processing inequality, which is a related inequality, but still quite different. Perhaps we should call this the bandit divergence decomposition lemma, or its claim the bandit divergence decomposition identity, though there is very little used of the specifics of bandits problems here. In particular, the identity holds whenever a sequential policy interacts with a stochastic environment where the feedback to the stochastic policy is based on a fixed set of distributions (swap the reward distributions with feedback distributions in the more general setting).

- Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem, 1995.

The high-probability Pinsker inequality is appears as as Lemma 2.6 in Sasha Tsybakov‘s book on nonparametric estimation. According to him, the result is originally due to Bretagnolle and Huber and the original reference is:

- Bretagnolle, J. and Huber, C. (1979) Estimation des densitiés: risque minimax. Z. für Wahrscheinlichkeitstheorie und verw. Geb., 47, 199-137.

One of the authors learned this inequality from the following paper:

- Sébastien Bubeck, Vianney Perchet, Philippe Rigollet: Bounded regret in stochastic multi-armed bandits. COLT 2013: 122-134

# Instance dependent lower bounds

In the last post we showed that under mild assumptions ($n = \Omega(K)$ and Gaussian noise), the regret in the worst case is at least $\Omega(\sqrt{Kn})$. More precisely, we showed that for every policy $\pi$ and $n\ge K-1$ there exists a $K$-armed stochastic bandit problem $\nu\in \mathcal E_K$ with unit variance Gaussian rewards and means in $[0,1]$ such that the regret $R_n(\pi,\nu)$ of $\pi$ on $\nu$ satisfies $R_n(\pi,\nu) \geq c\sqrt{nK}$ for some universal constant $c > 0$, or

\begin{align*}

R_n^*(\cE_K)\doteq\inf_\pi \sup_{\nu \in \cE_K} R_n(\pi,\nu)\ge c \sqrt{nK}\,.

\end{align*}

Earlier we have also seen that the regret of UCB on such problems is at most $C \sqrt{nK \log(n)}$ with some universal constant $C>0$: $R_n(\mathrm{UCB},\cE_K)\le C \sqrt{nK\log(n)}$. Thus, UCB is near minimax-optimal, in the sense that except for a logarithmic factor its upper bound matches the lower bound for $\cE_K$.

Does this mean that UCB is necessarily a good algorithm for class $\cE_K$? Not quite! Consider the following modification of UCB, which we will call by the descriptive, but not particularly flattering name UCBWaste. Choose $0 < C' \ll C$, where $C$ is the universal constant in the upper bound on the regret of UCB (if we actually calculated the value of $C$, we could see that e.g. $C'=0.001$ would do). Then, in the first $m=C'\sqrt{Kn}$ rounds, UCBWaste will explore each of the $K$ actions equally often, after which it switches to the UCB strategy. It is easy to see that from the perspective of worst-case regret, UCBWaste is almost as good as UCB (being at most $C'\sqrt{Kn}$ worse). Would you accept UCBWaste as a good policy?

Perhaps not. To see why, consider for example a problem $\nu\in \cE_K$ such that on $\nu$ the immediate regret for choosing any suboptimal action is close to one. From the regret analysis of UCB we can show that after seeing a few rewards from some suboptimal action, with high probability UCB will stop using that action. As a result, UCB on $\nu$ would achieve a very small regret. In fact, it follows from our previous analysis that for sufficiently large $n$, UCB will achieve a regret of $\sum_{i:\Delta_i(\nu)>0} \frac{C\log(n)}{\Delta_i(\nu)} \approx C K \log(n)$, where $\Delta_i(\nu)$ is the suboptimality gap of action $i$ in $\nu$. However, UCBWaste, true to its name, will suffer a regret of at least $C’\sqrt{Kn/2}$ on $\nu$, a quantity that for $n$ large is much larger than the logarithmic bound on the regret of UCB.

Thus, on the “easy”, or “nice” instances, UCBWaste’s regret seems to be unnecessarily large at least as $n\to\infty$, even though UCBWaste is a near worst-case optimal algorithm for any $n$. If we care about what happens for $n$ large (and why should not we — after all, having higher standards should not hurt), it may be worthwhile to look beyond finite-time worst-case optimality.

Algorithms that are nearly worst-case optimal may fail to take advantage of environments that are not the worst case. What is more desirable is to have algorithms which are near worst-case optimal, while their performance gets better on “nicer” instances.

But what makes an instance “nice”? For a fixed $K$-action stochastic bandit environment $\nu$ with $1$-subgaussian reward-noise, the regret of UCB was seen to be logarithmic and in particular

\begin{align}

\limsup_{n\to\infty} \frac{R_n}{\log(n)} \leq c^*(\nu) \doteq \sum_{i:\Delta_i(\nu) > 0} \frac{2}{\Delta_i(\nu)}\,.

\label{eq:cnudef}

\end{align}

Then perhaps $c^*(\nu)$ could be used as an intrinsic measure of the difficulty of learning on $\nu$. Or perhaps there exist some constant $c'(\nu)$ that is even smaller than $c^*(\nu)$ such that some strategy suffers at most $c'(\nu) \log(n)$ regret asymptotically? If, for example, $c'(\nu)$ is smaller by a factor of two than $c^*(\nu)$, then this could be a big difference!

In this post we will show that, in some sense, $c^*(\nu)$ is the best possible. In particular, we will show that no *reasonable* strategy can outperform UCB in the asymptotic sense above on *any problem instance* $\nu$. This is a big difference from the previous approach because the “order of quantifiers” is different: In the minimax result proven in the last post we showed that for any policy $\pi$ there exists a bandit instance $\nu \in \cE_K$ on which $\pi$ suffers large regret. Here we show that for all reasonable policies $\pi$, the policy’s asymptotic regret is always at least as large as that of UCB.

We have not yet said what we mean by reasonable. Clearly for any $\nu$, there are policies for which $\limsup_{n\to\infty} R_n / \log(n) = 0$. For example, the policy that chooses $A_t = \argmax_i \mu_i(\nu)$. But this is not a reasonable policy because it suffers linear regret for any $\nu’$ such that the optimal action of $\nu$ is not optimal in $\nu’$. At least, we should require that the policy achieves sublinear regret over all problems. This may still look a bit wasteful because UCB achieves logarithmic asymptotic regret for any $1$-subgaussian problem. A compromise between the undemanding sublinear and the perhaps overly restrictive logarithmic growth is to require that the policy has *subpolynomial* regret growth, a choice that has historically been the most used, and which we will also take. The resulting policies are said to be *consistent*.

# Asymptotic lower bound

To firm up the ideas, define $\cE_K’$ as the class of $K$-action stochastic bandits where each action has a reward distribution with $1$-subgaussian noise. Further, let $\cE_K'(\mathcal N)\subset \cE_K’$ be those environments in $\cE_K’$ where the reward distribution is Gaussian. Note that for minimax lower bounds it is important to assume the sub-optimality gaps are bounded from above, this is not required in the asymptotic setting. The reason is that any reasonable (that word again) strategy should choose each arm at least once, which does not affect the asymptotic regret at all, but *does* affect the finite-time minimax regret.

Definition (consistency): A policy $\pi$ isconsistentover $\cE$ if for all bandits $\nu \in \mathcal \cE$ and for all $p>0$ it holds that

\begin{align}\label{eq:subpoly}

R_n(\pi,\nu) = O(n^p)\, \quad \text{ as } n \to\infty\,.

\end{align}

We denote the class of consistent policies over $\cE$ by $\Pi_{\text{cons}}(\cE)$.

Note that $\eqref{eq:subpoly}$ trivially holds for $p\ge 1$. Note also that because $n \mapsto n^p$ is positive-valued, the above is equivalent to saying that for each $\nu \in \cE$ there exists a constant $C_\nu>0$ such that $R_n(\pi,\nu) \le C_\nu n^p$ for *all* $n\in \N$. In fact, one can also show that above we can also replace $O(\cdot)$ with $o(\cdot)$ with no effect on whether a policy is called consistent or not. Just like in statistics, consistency is a purely asymptotic notion.

We can see that UCB is consistent (its regret is logarithmic) over $\cE_K’$. On the other hand, the strategy that always chooses a fixed action $i\in K$ is not consistent (if $K > 1$) over any reasonable $\cE$, since its regret is linear in any $\nu$ where $i$ is suboptimal.

The main theorem of this post is that no consistent strategy can outperform UCB asymptotically in the Gaussian case:

Theorem (Asymptotic lower bound for Gaussian environments): For any policy $\pi$ consistent over the $K$-action unit-variance Gaussian environments $\cE_K'(\mathcal N)$ and any $\nu\in \cE_K'(\mathcal N)$, it holds that

\begin{align*}

\liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)} \geq c^*(\nu)\,,

\end{align*}

where recall that $c^*(\nu)= \sum_{i:\Delta_i(\nu)>0} \frac{2}{\Delta_i(\nu)}$.

Since UCB is consistent for $\cE_K’\supset \cE_K'(\mathcal N)$,

\begin{align*}

c^*(\nu) = \inf_{\pi \in \Pi_{\text{cons}}(\cE_K'(\mathcal N))} \liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)}

\le

\limsup_{n\to\infty} \frac{R_n(\text{UCB},\nu)}{\log(n)}\le c^*(n)\,,

\end{align*}

that is

\begin{align*}

\inf_{\pi \in \Pi_{\text{cons}}(\cE_K'(\mathcal N))} \liminf_{n\to\infty} \frac{R_n(\pi,\nu)}{\log(n)}

=

\limsup_{n\to\infty} \frac{R_n(\text{UCB},\nu)}{\log(n)} = c^*(\nu)\,.

\end{align*}

Thus, we see that **UCB is asymptotically optimal** over the class of unit variance Gaussian environments $\cE_K'(\mathcal N)$.

**Proof**

Pick a $\cE_K'(\mathcal N)$-consistent policy $\pi$ and a unit variance Gaussian environment $\nu =(P_1,\dots,P_K)\in \cE_K'(\mathcal N)$ with mean rewards $\mu \in \R^K$. We need to lower bound $R_n \doteq R_n(\pi,\nu)$. Let $\Delta_i \doteq \Delta_i(\nu)$. Based on the basic regret decomposition identity $R_n = \sum_i \Delta_i \E_{\nu,\pi}[T_i(n)]$, it is not hard to see that the result will follow if for any action $i\in [K]$ that is suboptimal in $\nu$ we prove that

\begin{align}

\liminf_{n\to\infty} \frac{\E_{\nu,\pi}[T_i(n)]}{\log(n)} \ge \frac{2}{\Delta_i^2}\,.

\label{eq:armlowerbound}

\end{align}

As in the previous lower bound proof, we assume that all random variables are defined over the canonical bandit measurable space $(\cH_n,\cF_n)$, where $\cH_n = ([K]\times \R)^n$, $\cF_n$ is the restriction of $\mathcal L(R^{2n})$ to $\cH_n$, and $\PP_{\nu,\pi}$ (which induces $\E_{\nu,\pi}$) is the distribution of $n$-round action-reward histories induced by the interconnection of $\nu$ and $\pi$.

*The intuitive logic of the lower bound is as follows*: $\E_{\nu,\pi}[T_i(n)]$ must be (relatively) large because the regret in all environments $\nu’$ where (as opposed to what happens in $\nu$) action $i$ is optimal must be small by our assumption that $\pi$ is a “reasonable” policy and for this $\E_{\nu’,\pi}[T_i(n)]$ must be large. However, if the gap $\Delta_i$ is small and $\nu’$ is close to $\nu$, $\E_{\nu,\pi}[T_i(n)]$ will necessarily be large when $\E_{\nu’,\pi}[T_i(n)]$ is large because the distributions $\PP_{\nu,\pi}$ and $\PP_{\nu’,\pi}$ will also be close.

As in the previous lower bound proof, the divergence decomposition identity and the high-probability Pinsker inequality will help us to carry out this argument. In particular, from the divergence decomposition lemma, we get that for any $\nu’=(P’_1,\dots,P_K’)$ such that $P_j=P_j’$ except for $j=i$,

\begin{align}

\KL(\PP_{\nu,\pi}, \PP_{\nu’,\pi}) = \E_{\nu,\pi}[T_i(n)] \, \KL(P_i, P’_i)\,,

\label{eq:infoplem}

\end{align}

while the high-probability Pinsker inequality gives that for any event $A$,

\begin{align*}

\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c)\ge \frac12 \exp(-\KL(\PP_{\nu,\pi}, \PP_{\nu’,\pi}) )\,,

\end{align*}

or equivalently,

\begin{align}

\KL(\PP_{\nu,\pi},\PP_{\nu’,\pi}) \ge \log \frac{1}{2 (\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c))}\,.

\label{eq:pinsker}

\end{align}

We now upper bound $\PP_{\nu,\pi}(A)+\PP_{\nu’,\pi}(A^c)$ in terms of $R_n+R_n’$ where $R_n’ = R_n(\nu’,\pi)$. For this choose $P_i’ = \mathcal N(\mu_i+\lambda,1)$ with $\lambda>\Delta_i$ to be selected later. Further, we choose $A= \{ T_i(n)\ge n/2 \}$ and thus $A^c = \{ T_i(n) < n/2 \}$. These choices make $\PP_{\nu,\pi}(A)+\PP_{\nu',\pi}(A^c)$ small if $R_n$ and $R_n'$ are both small. Indeed, since $i$ is suboptimal in $\nu$ and $i$ is optimal in $\nu'$ and for any $j\ne i$, $\Delta_j'(\nu) \ge \lambda-\Delta_i$,
\begin{align*}
R_n \ge \frac{n\Delta_i}{2} \PP_{\nu,\pi}(T_i(n)\ge n/2 ) \, \text{ and }
R_n' \ge \frac{n (\lambda-\Delta_i) }{2} \PP_{\nu',\pi}( T_i(n) < n/2 )\,.
\end{align*}

Summing these up, lower bounding $\Delta_i/2$ and $(\lambda-\Delta_i)/2$ by $\kappa(\Delta_i,\lambda) \doteq \frac{\min( \Delta_i, \lambda-\Delta_i )}{2}$, and then reordering gives

\begin{align*}

\PP_{\nu,\pi}( T_i(n) \ge n/2 ) + \PP_{\nu’,\pi}( T_i(n) < n/2 )
\le \frac{R_n+R_n'}{ \kappa(\Delta_i,\lambda) \,n}
\end{align*}
as promised.
Combining this inequality with $\eqref{eq:infoplem}$ and $\eqref{eq:pinsker}$ and using $\KL(P_i,P'_i) = \lambda^2/2$ we get
\begin{align}
\frac{\lambda^2}{2}\E_{\nu,\pi}[T_i(n)] \ge
\log\left( \frac{\kappa(\Delta_i,\lambda)}{2} \frac{n}{R_n+R_n'} \right)
=
\log\left( \frac{\kappa(\Delta_i,\lambda)}{2}\right)
+ \log(n) - \log(R_n+R_n')\,.
\end{align}
Dividing both sides by $\log(n)$, we see that it remains to upper bound $\frac{\log(R_n+R_n')}{\log(n)}$. For this, note that for any $p>0$ $R_n+ R_n’ \le C n^p$ for all $n$ and some constant $C>0$. Hence, $\log(R_n+R_n’) \le \log(C) + p\log(n)$, from which we get that $\limsup_{n\to\infty} \frac{\log(R_n+R_n’)}{\log(n)} \le p$. Since $p>0$ was arbitrary, it follows that this also holds for $p=0$. Hence,

\begin{align}

\liminf_{n\to\infty}

\frac{\lambda^2}{2} \frac{\E_{\nu,\pi}[T_i(n)]}{\log(n)} \ge

1 – \limsup_{n\to\infty} \frac{\log(R_n+R_n’)}{\log(n)} \ge 1\,.

\end{align}

Taking the infimum of both sides over $\lambda>\Delta_i$ and reordering gives $\eqref{eq:armlowerbound}$, thus finishing the proof.

QED

# Instance-dependent finite-time lower bounds

Is it possible to go beyond asymptotic instance optimality? Yes, of course! All we have to do is to redefine what is meant by a “reasonable” algorithm. One idea is to call an algorithm reasonable when its finite-time(!) regret is close to minimax optimal. The situation is illustrated in the graph below:

In this part we will show instance-optimal finite-time lower bounds building on the ideas of the previous proof. This proof by and large uses the same ideas as the proof of the minimax result in the last post. This suggests that for future reference it may be useful to extract the common part, which summarizes what can be obtained by chaining the high-probability Pinsker inequality with the divergence decomposition lemma:

Lemma (Instance-dependent lower bound): Let $\nu = (P_i)_i,\nu’=(P’_i)$ be $K$-action stochastic bandit environments that differ only the distribution of the rewards for action $i\in [K]$. Further, assume that $i$ is suboptimal in $\nu$ and is optimal in $\nu’$, and in particular $i$ is the unique optimal action in $\nu’$. Let $\lambda = \mu_i(\nu’)-\mu_i(\nu)$. Then, for any policy $\pi$,

\begin{align}\label{eq:idalloclb}

D(P_i,P_i’) \E_{\nu,\pi} [T_i(n)] \ge \log\left( \frac{\min\set{\lambda – \Delta_i, \Delta_i}}{4}\right) + \log(n) – \log(R_n+R_n’)\,.

\end{align}

Note that the lemma holds for finite $n$ and any $\nu$ and can be used to derive *finite-time instance-dependent lower bounds* for any environment class $\cE$ that is rich enough to include a pair $\nu’$ for any $\nu \in \cE$ and for policies that are uniformly good over $\cE$. For illustration consider the Gaussian unit variance environments $\cE\doteq \cE_K(\mathcal N)$ and policies $\pi$ whose regret, on any $\nu\in \cE$, is bounded by $C n^p$ for some $C>0$ and $p>0$. Call this class $\Pi(\cE,C,n,p)$, the class of $p$-order policies. Note that UCB is in this class with $C = C’\sqrt{K \log(n)}$ and $p=1/2$ with some $C’>0$. Thus, for $p\ge 1/2$ and suitable $C$ this class is not empty.

As an immediate corollary of the previous lemma we get the following result:

Theorem (Finite-time, instance-dependent lower bound for $p$-order policies): Take $\cE$ and $\Pi(\cE,C,n,p)$ as in the previous paragraph. Then, for any $\pi \in \Pi(\cE,C,n,p)$ and $\nu \in \cE$, the regret of $\pi$ on $\nu$ satisfies

\begin{align}

R_n(\pi,\nu) \ge

\frac{1}{2} \sum_{i: \Delta_i>0} \left(\frac{(1-p)\log(n) + \log(\frac{\Delta_i}{8C})}{\Delta_i}\right)^+\,,

\label{eq:finiteinstancebound}

\end{align}

where $(x)^+ = \max(x,0)$ is the positive part of $x\in \R$.

**Proof**: Given $\nu=(\mathcal N(\mu_i,1))_i$, $i$ such that action $i$ is suboptimal in $\nu$, choose $\nu_i’ = (\mathcal N(\mu_j’,1))_j$ such that $\mu_j’=\mu_j$ unless $j=i$, in which make $\mu_i’ = \mu_i+\lambda$, $\lambda\le 2\Delta_i(\nu)$. Then, $\nu’\in \cE$ and from $\log(R_n+R_n’) \le \log(2C)+p\log(n)$ and $\eqref{eq:idalloclb}$, we get

\begin{align*}

\E_{\nu,\pi} [T_i(n)]

\ge \frac{2}{\lambda^2}\left(\log\left(\frac{n}{2Cn^p}\right) + \log\left(\frac{\min\set{\lambda – \Delta_i, \Delta_i}}{4}\right)\right)

\end{align*}

Choosing $\lambda = 2\Delta_i$ and plugging this into the basic regret decomposition identity gives the result.

QED

With $p=1/2$, $C = C’\sqrt{K}$ with $C’>0$ suitable so that $\Pi(\cE,C’\sqrt{K},n,1/2)\ne \emptyset$ we get for any policy $\pi$ that is “near-minimax optimal” for $\cE$ that

\begin{align}

R_n(\pi,\nu) \ge \frac{1}{2} \sum_{i: \Delta_i>0} \left(\frac{\frac12\log(n) + \log(\frac{\Delta_i}{8C’ \sqrt{K}})}{\Delta_i}\right)^+\,.

\label{eq:indnearminimaxlb}

\end{align}

The main difference to the asymptotic lower bound is twofold: First, when $\Delta_i$ is very small, the corresponding term is eliminated in the lower bound: For small $\Delta_i$, the impact of choosing for $n$ small action $i$ is negligible. Second, even when $\Delta_i$ are all relatively large, the leading term in this lower bound is approximately half that of $c^*(\nu)$. This effect may be real: The class of policies considered is larger than in the asymptotic lower bound, hence there is the possibility that the policy that is best tuned for a given environment achieves a smaller regret. At the same time, we only see a constant factor difference.

The lower bound $\eqref{eq:indnearminimaxlb}$ should be compared to the upper bound derived for UCB. In particular, recall that the simplified form of an upper bound was $\sum_{i:\Delta_i>0} \frac{C \log(n)}{\Delta_i}$. We see that the main difference to the above bound is the lack of the second term in $\eqref{eq:indnearminimaxlb}$. With some extra work, this gap can also be closed.

We now argue that the previous lower bound is not too weak in that in some sense it allows us to recover our previous minimax lower bound. In particular, choosing $\Delta_i= 8 \rho C n^{p-1}$ uniformly for all but one optimal action, we get $(1-p)\log(n)+\log(\frac{\Delta_i}{8C}) = \log(\rho)$. Plugging this into $\eqref{eq:finiteinstancebound}$ and lower bounding $\sup_{\rho>0}\log(\rho)/\rho = \exp(-1)$ by $0.35$ gives the following corollary:

Corollary (Finite-time lower bound for $p$-order policies): Take $\cE$ and $\Pi(\cE,C,n,p)$ as in the previous paragraph. Then, for any $\pi \in \Pi(\cE,C,n,p)$,

\begin{align*}

\sup_{\nu\in \cE} R_n(\pi,\nu) \ge \frac{0.35}{16} \frac{K-1}{C} n^{1-p} \,.

\end{align*}

When $C = C’\sqrt{K}$ and $p=1/2$, we get the familiar $\Omega( \sqrt{Kn} )$ lower bound. However, note the difference: Whereas the previous lower bound was true for any policy, this lower bound holds only for policies in $\Pi(\cE,C’\sqrt{K},n,1/2)$. Nevertheless, it is reassuring that the instance-dependent lower bound is able to recover the minimax lower bound for the “near-minimax” policies. And to emphasize the distinction even further. In our minimax lower bound we only showed there exists an environment for which the lower bound holds, while the result above holds no matter which arm we choose to be the optimal one. This is only possible because we restricted the class of algorithms.

# Summary

So UCB is close to minimax optimal (last post) and now we have seen that at least among the class of consistent strategies it is also asymptotically optimal and also almost optimal for *any instance* amongst the near-minimax optimal algorithm even in a non-asymptotic sense. One might ask if there is anything left to do, and like always the answer is: Yes, lots!

For one, we have only stated the results for the Gaussian case. Similar results are easily shown for alternative classes. A good exercise is to repeat the derivations in this post for the class of Bernoulli bandits, where the rewards of all arms are sampled from a Bernoulli distribution.

There is also the question of showing bounds that hold with high probability rather than in expectation. So far we have not discussed any properties of the *distribution of the regret* other than its mean, which we have upper bounded for UCB and lower bounded here and in the last post. In the next few posts we will develop algorithms for which the regret is bounded with high probability and so the lower bounds of this nature will be delayed for a little while.

# Notes, references

There are now a wide range of papers with lower bounds for bandits. For asymptotic results the first article below was the earliest that we know of, while others are generalizations.

- Lai and Robbins. Asymptotically Efficient Adaptive Allocation Rules. 1985
- Lai and Graves. Asymptotically Efficient Adaptive Choice of Control Laws in Controlled Markov Chains. 1997
- Burnetas and Katehakis. Optimal Adaptive Policies for Sequential Allocation Problems. 1996

For non-minimax results that hold in finite time there are very recently a variety of approaches (and one older one). They are:

- Garivier, Stoltz and Ménard. Explore First, Exploit Next: The True Shape of Regret in Bandit Problems. 2016
- Kulkarni and Lugosi. Finite-time lower bounds for the two-armed bandit problem. 2000
- Wu, György and Szepesvári Online Learning with Gaussian Payoffs and Side Observations, NIPS, 2015
- Lattimore. Regret Analysis of the Anytime Optimally Confident UCB. 2016

Finally, for the sake of completeness we note that minimax bounds are available in the following articles. The second of which deals with the high-probability case mentioned above.

- Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem, 1995
- Gerchinovitz and Lattimore. Refined Lower Bounds for Adversarial Bandits. 2016

Note 1. All of the above results treat all arms symmetrically. Sometimes one might want an algorithm that treats the arms differently and for which the regret guarantee depends on which arm is optimal. We hope eventually to discuss the most natural approach to this problem, which is to use a Bayesian strategy (though we will be interested in it for other reasons). Meanwhile, there are modifications of UCB that treat the actions asymmetrically so that its regret is small if some arms are optimal and large otherwise.

We simply refer the interested reader to a paper on the Pareto regret frontier for bandits, which fully characterizes the trade-offs available to asymmetric algorithms in terms of the worst-case expected regret when comparing to a fixed arm.

Note 2. The leading constant in the instance dependent bound is sub-optimal. This can be improved by a more careful choice of $\lambda$ that is closer to $\Delta_i$ than $2\Delta_i$ and dealing with the lower-order terms that this introduces.

# Adversarial bandits

A stochastic bandit with $K$ actions is completely determined by the distributions of rewards, $P_1,\dots,P_K$, of the respective actions. In particular, in round $t$, the distribution of the reward $X_t$ received by a learner choosing action $A_t\in [K]$ is $P_{A_t}$, regardless of the past rewards and actions. If one looks carefully at any “real-world” problem, let it be drug design, recommending items on the web, or anything else, we will soon find out that there are in fact no suitable distributions. First, it will be hard to argue that the reward is truly randomly generated and even if it was randomly generated, the rewards could be correlated over time. Taking account all these effects would make a stochastic model potentially quite complicated. An alternative is to take a pragmatic approach, where almost nothing is assumed about the mechanism that generates the rewards, while still keeping the objective of competing with the best action in hindsight. This leads to the so-called *adversarial bandit* model, the subject of this post. That we are still competing with the *single* best action in hindsight expresses the prior belief that the world is stationary, the assumption that we made quite explicit in the stochastic setting. Because a learning algorithm still competes with the best action in hindsight, as we shall see soon, learning is still possible in the sense that the regret can be kept sublinear. But, as we shall see, the algorithms need to be radically different than in the stochastic setting.

In the remainder of the post, we first introduce the formal framework of “adversarial bandits” and the appropriately adjusted concept of minimax regret. Next, we point out that adversarial bandits are at least as hard as stochastic bandits and as a result, immediately giving a lower bound on the minimax regret. The next question is whether there are algorithms that can meet the lower bound. Towards this goal we discuss the so-called Exp3 (read: “e-x-p 3”) algorithm and show a bound on its expected regret, which almost matches the lower bound, thus showing that, from a worst-case standpoint, adversarial bandits are effectively not harder than stochastic bandits. While having a small expected regret is great, Exp3 is randomizing and hence its regret will also be random. Could it be that the upper tail of the regret is much larger than its expectation? This question is investigated at the end of post, where we introduce Exp3-IX (Exp3 “with implicit exploration”), a variant of Exp3, which is shown to also enjoy a “small” regret with high probability (not only in expectation).

# Adversarial bandits

An adversarial bandit environment $\nu$ is given by an arbitrary sequence of reward vectors $x_1,\dots,x_n\in [0,1]^K$. Given the environment $\nu = (x_1,\dots,x_n)\in [0,1]^{Kn}$, the model says that in round $t$ the reward of action $i\in [K]$ is just $x_{ti}$ (this is the $i$th entry in vector $x_t$). Note that we are using lower-case letters for the rewards. This is because the reward table is just a big table of numbers: the rewards are non-random. In fact, since arbitrary sequences are allowed, random rewards would not increase generality, hence we just work with deterministic rewards.

The interconnection of a policy $\pi$ and an environment $\nu = (x_1,\dots,x_n)\in [0,1]^{nK}$ works as follows: A policy $\pi$ interacting with an adversarial bandit environment $\nu$ chooses actions in a sequential fashion in $n$ rounds. In the first round, the policy chooses $A_1 \sim p_1(\cdot)$ possibly in a random fashion (hence, the upper case for $A_1$). As a result, the reward $X_1 = x_{1,A_1}$ is “sent back” to the policy. Note that $X_1$ is random since $A_1$ is random. More generally, in round $t\in [n]$, $A_t$ is chosen based on $A_1,X_1,\dots,A_{t-1},X_{t-1}$: $A_t \sim \pi_t(\cdot| A_1,X_1,\dots,A_{t-1},X_{t-1} )$. The reward incurred as a result is then $X_t = x_{t,A_t}$. This is repeated $n$ times, where $n$ may or may not be given a priori.

The goal is to design a policy $\pi$ that keeps the regret small no matter what rewards $\nu = (x_1,\dots,x_n)$ are assigned to the actions. Here, the regret (more precisely, expected regret) of a policy $\pi$ choosing $A_1,\dots,A_n$ while interacting with the environment $\nu = (x_1,\dots,x_n)$ is defined as

\begin{align*}

R_n(\pi,\nu) = \EE{\max_{i\in [K]} \sum_{t=1}^n x_{ti} – \sum_{t=1}^n x_{t,A_t}}\,.

\end{align*}

That is, performance is measured by the expected revenue lost when compared with the best action in hindsight. When $\pi$ and $\nu$ are clear from the context, we may just write $R_n$ in place of $R_n(\pi,\nu)$. Now, the quality of a policy is defined through its *worst-case regret*

\begin{align*}

R_n^*(\pi) = \sup_{\nu\in [0,1]^{nK}} R_n(\pi,\nu)\,.

\end{align*}

Similarly to the stochastic case, the main question is whether there exists policies $\pi$ such that the growth of $R_n^*(\pi)$ (as a function of $n$) is sublinear.

The problem is only interesting when $K>1$, which we assume from now on. Then, it is clear that unless $\pi$ is randomizing, its worst-case regret can be forced to be equal to $n$, which is in fact the largest possible value that the regret can take. Indeed, if $\pi$ is not randomizing, in every round $t$, we can use the past actions $a_1,\dots,a_{t-1}$ and rewards $x_{1,a_1},\dots,x_{t-1,a_{t-1}}$ to query $\pi$ for the next action. If we find that this is $a_t$, we set $x_{t,a_t} = 0$ and $x_{t,i}=1$ for all other $i\ne a_t$. This way, we obtain a sequence $\nu= (x_1,\dots,x_n)$ such that $R_n(\pi,\nu) = n$, as promised.

When $\pi$ randomizes, the above “second-guessing” strategy does not work anymore. As we shall see soon, randomization will indeed result in policies with good worst-case guarantees. Below, we will show that a regret that is almost as small as in the stochastic setting is possible. How is this possible? The key observation is that the goal of learning is modest: No environment can simultaneously generate large rewards for some actions and prevent a suitable learner to detect which action these large rewards are associated with. This is because the rewards are bounded and hence a large total reward means that the reward of the action has to be large often enough, which makes it relatively easy for a learner to identify the action with such large rewards. Before discussing how this can be achieved, we should first compare stochastic and adversarial models in a little more detail.

# Relation between stochastic and adversarial bandits

We already noted that deterministic strategies will have linear worst-case regret. Thus, some strategies, including those that were seen to achieve a small worst-case regret for stochastic bandits, will give rise to large regret in the adversarial case. How about the other direction? Will an adversarial bandit strategy have small expected regret in the stochastic setting?

Note the subtle difference between the notions of regret in the stochastic and the adversarial cases: In the stochastic case, the total expected reward is compared to $\max_i \EE{ \sum_{t=1}^n X_{t,i}}$, the maximum expected reward, where $X_{t,i} \sim P_i$ iid, while in the adversarial case it is compared to the maximum reward. If the rewards are random themselves, the comparison is to the expectation of the maximum reward. Now, since $\EE{ \max_i \sum_{t=1}^n X_{t,i} } \ge \max_i \EE{ \sum_{t=1}^n X_{t,i}}$, we thus get that

\begin{align}

\EE{ \max_i \sum_{t=1}^n X_{t,i} – X_{t,A_t} } \ge \max_i \EE{ \sum_{t=1}^n X_{t,i} – X_{t,A_t} }\,.

\label{eq:stochadvregret}

\end{align}

(It is not hard to see that the distribution of $A_1,X_1,\dots,A_n,X_n$ with $X_t = X_{t,A_t}$ indeed satisfies the constraints that need to hold in the stochastic case, hence, the right hand-side is indeed the regret of the policy choosing $A_t$ in the environment $\nu = (P_i)_i$.)

The left-hand side of $\eqref{eq:stochadvregret}$ is the expected adversarial regret, while the right-hand side is the expected regret as defined for stochastic bandits. Thus, an algorithm designed to keep the adversarial regret small will achieve a small (worst-case) regret on stochastic problems, too. Reversely, the above inequality also implies that the worst-case regret for adversarial problems is lower bounded by the worst-case regret on stochastic problems. In particular, from our minimax lower bound for stochastic bandits, we get the following:

Theorem (Lower bound on worst-case adversarial regret): For any $n\ge K$, the minimax adversarial regret, $R_n^*= \inf_\pi \sup_{\nu\in [0,1]^{nK}} R_n(\pi,\nu)$, satisfies

\begin{align*}

R_n^* \ge c\sqrt{nK}\,,

\end{align*}

where $c>0$ is a universal constant.

The careful reader may notice that there is a problem with using the previous minimax lower bound developed for stochastic bandits in that this bound was developed for environments with Gaussian reward distributions. But Gaussian rewards are unbounded, and in the adversarial setting the rewards must lie in a bounded range. In fact, if in the adversarial setting the range of rewards is unbounded, all policies will suffer unbounded regret in the very first round! To fix the issue, one needs to re-prove the minimax lower bound for stochastic bandits using reward distributions that guarantee a bounded range for the rewards themselves. One possibility is to use Bernoulli rewards and this is indeed what is typically done. We leave the modification of the previous proof for the reader. One hint is in the proof instead of using zero means, one should use $0.5$, the reason being that the variance of Bernoulli variables decreases with their mean approaching zero, which makes learning about a mean close to zero easier. Keeping the means bounded away from zero (and one) will however allow our previous proof to go through with almost no changes.

# Trading off exploration and exploitation with Exp3

The most standard algorithm for the adversarial bandit setting is the so-called Exp3 algorithm, where “Exp3” stands for “**E**xponential-weight algorithm for **E**xploration and **E**xploitation”. The meaning of the name will become clear once we discussed how Exp3 works.

In every round, the computation that Exp3 does has the following three steps:

- Sampling an action $A_t$ from a previously computed distribution $P_{ti}$;
- Estimating rewards for all the actions based on the observed reward $X_t$;
- Using the estimated rewards to update $P_{ti}$, $i\in [K]$.

Initially, in round one, $P_{1i}$ is set to the uniform distribution: $P_{1i} = 1/K$, $i\in [K]$.

Sampling from $P_{ti}$ means to select $A_t$ randomly such that given the past $A_1,X_1,\dots,A_{t-1},X_{t-1}$, $\Prob{A_t=i|A_1,X_1,\dots,A_{t-1},X_{t-1}}=P_{ti}$. This can be implemented for example by generating a sequence of independent random variables $U_1,\dots,U_n$ uniformly distributed over $[0,1)$ and then in round $t$ choosing $A_t$ to be the unique index $i$ such that $U_t$ falls into the interval $[\sum_{j=1}^{i-1} P_{tj}, \sum_{j=1}^i P_{tj})$. How rewards are estimated and how they are used to update $P_{ti}$ are the subject of next two section.

## Reward estimation

We discuss reward estimation more generally as it is a useful building block of many algorithms. Thus, for some policy $\pi = (\pi_1,\dots,\pi_n)$, let $P_{ti} = \pi_t(i|A_1,X_1,\dots,A_{t-1},X_{t-1})$. (Note that $P_{ti}$ is also random as it depends on $A_1,X_1,\dots,A_{t-1},X_{t-1}$ which were random.) Assume that $P_{ti}>0$ almost surely (we will later see that Exp3 does satisfy this).

Perhaps surprisingly, reward estimation also benefits from randomization! How? Recall that $P_{ti} = \Prob{A_t = i| A_1,X_1,\dots,A_{t-1},X_{t-1}}$. Then, after $X_t$ is observed, we can use the so-called (vanilla) *importance sampling estimator*:

\begin{align}

\hat{X}_{ti} = \frac{\one{A_t=i}}{P_{ti}} \,X_t\,.

\label{eq:impestimate}

\end{align}

Note that $\hat{X}_{ti}\in \R$ is random (it depends on $A_t,P_t,X_t$, and all these are random). Is $\hat X_{ti}$ a “good” estimate of $x_{ti}$? A simple way to get a first impression of this is to calculate the mean and variance of $\hat X_{ti}$. Is the mean of $\hat{X}_{ti}$ close to $x_{ti}$? Does $\hat{X}_{ti}$ have a small variance? To study these, let us introduce the conditional expectation operator $\Et{\cdot}$: $\Et{Z} \doteq \EE{Z|A_1,X_1,\dots,A_{t-1},X_{t-1} }$. As far as the mean is concerned, we have

\begin{align}

\Et{ \hat{X}_{ti} } = x_{ti}\,,

\label{eq:unbiasedness}

\end{align}

i.e., $\hat{X}_{ti}$ is an *unbiased estimate* of $x_{ti}$. While it is not hard to see why $\eqref{eq:unbiasedness}$ holds, we include the detailed argument as some of the notation will be useful later. In particular, writing $A_{ti} \doteq \one{A_t=i}$, we have $X_t A_{ti} = x_{t,i} A_{ti}$ and thus,

\begin{align*}

\hat{X}_{ti} = \frac{A_{ti}}{P_{ti}} \,x_{ti}\,.

\end{align*}

Now, $\Et{A_{ti}} = P_{ti}$ (thus $A_{ti}$ is the “random $P_{ti}$”) and since $P_{ti}$ is a function of $A_1,X_1,\dots,A_{t-1},X_{t-1}$, we get

\begin{align*}

\Et{ \hat{X}_{ti} }

& =

\Et{ \frac{A_{ti}}{P_{ti}} \,x_{ti} }

=

\frac{x_{ti}}{P_{ti}} \, \Et{ A_{ti} }

=

\frac{x_{ti}}{P_{ti}} \, P_{ti} = x_{ti}\,,

\end{align*}

proving $\eqref{eq:unbiasedness}$. (Of course, $\eqref{eq:unbiasedness}$ also implies that $\EE{\hat X_{ti}}=x_{ti}$.) Let us now discuss the variance of $\hat X_{ti}$. Similarly to the mean, we will consider the *conditional* variance $\Vt{\hat X_{ti}}$ of $\hat X_{ti}$, where $\Vt{\cdot}$ is defined by $\Vt{U} \doteq \Et{ (U-\Et{U})^2 }.$ In particular, $\Vt{\hat X_{ti}}$ is the variance of the $\hat X_{ti}$ given the past, i.e., the variance due to the randomness of $A_t$ alone, disregarding the randomness of the history. The conditional variance is more meaningful than the full variance as the estimator itself has absolutely no effect on what happened in the *past*! Why would we want then to account for the variations due to the randomness of the history when discussing the quality of the estimator?

Hoping that the reader is convinced that it is the conditional variance that one should care of, let us see how big this quantity can be. As is well known, for $U$ random, $\Var(U) = \EE{U^2} – \EE{U}^2$. The same (trivially) holds for $\Vt{\cdot}.$ Hence, knowing that the estimate is unbiased, we see that we need to compute the second moment of $\hat X_{ti}$ only to get the conditional variance. Since $\hat{X}_{ti}^2 = \frac{A_{ti}}{P_{ti}^2} \,x_{ti}^2$, we have $\Et{ \hat{X}_{ti}^2 }=\frac{x_{ti}^2}{P_{ti}}$ and thus,

\begin{align}

\Vt{\hat{X}_{ti}} = x_{ti}^2 \frac{1-P_{ti}}{P_{ti}}\,.

\label{eq:vtimp}

\end{align}

In particular, we see that this can be quite large if $P_{ti}$ is small. We shall later see to what extent this can cause trouble.

While the estimate $\eqref{eq:impestimate}$ is perhaps the simplest one, this is not the only possibility. One possible alternative is to use

\begin{align}

\hat X_{ti} = 1- \frac{\one{A_t=i}}{P_{ti}}\,(1-X_t)\,.

\label{eq:lossestimate}

\end{align}

It is not hard to see that we still have $\Et{\hat X_{ti}} = x_{ti}$. Introducing $y_{ti} = 1-x_{ti}$, $Y_t = 1-X_t$, $\hat Y_{ti} = 1-\hat X_{ti}$, we can also see that the above formula can be written as

\begin{align*}

\hat Y_{ti} = \frac{\one{A_t=i}}{P_{ti}}\,Y_t\,.

\end{align*}

This is in fact the same formula as $\eqref{eq:impestimate}$ except that $X_t$ has been replaced by $Y_t$! Since $y_{ti}$, $Y_t$ and $\hat{Y}_{ti}$ have the natural interpretation of *losses*, we may notice that if we set up the framework with losses, this would have been the estimator that would have first come to mind. Hence, the estimator is called the *loss-based importance sampling estimator*. Unsurprisingly, the conditional variance of this estimator is also very similar to that of the previous one:

\begin{align*}

\Vt{\hat X_{ti}} = \Vt{\hat Y_{ti}} = y_{ti}^2 \frac{1-P_{ti}}{P_{ti}}\,.

\end{align*}

Since we expect $P_{ti}$ to be large for “good” actions (actions with large rewards, or small losses), based on this expression we see that the second estimator has smaller variance for “better” actions, while the first estimator has smaller variance for “poor” actions (actions with small rewards). Thus, the second estimator is more accurate for good actions, while the first is more accurate for bad actions. At this stage, one could be suspicious about the role of “zero” in all this argument. Can we change the estimator (either one of them) so that it is more accurate for actions whose reward is close to some specific value $v$? Of course! Just change the estimator so that $v$ is subtracted from the observed reward (or loss), then use the importance sampling formula, and then add back $v$.

While both estimators are unbiased, and apart from the small difference just mentioned, they seem to have similar variance, they are quite different in some other aspects. In particular, we may observe that the first estimator takes values in $[0,\infty)$, while the second estimator takes on values in $(-\infty,1]$. As we shall see, this will have a large impact on how useful these estimators are when used in Exp3.

## Probability computation

Let $\hat S_{ti}$ stand for the the total estimated reward by the end of round $t$:

\begin{align*}

\hat S_{ti} \doteq \sum_{s=1}^{t} \hat{X}_{si}\,.

\end{align*}

A natural idea to set the action-selection probabilities $(P_{ti})_i$ so that actions with higher total estimated reward $\hat S_{ti}$ receive larger probabilities. While there are many ways to turn $\hat S_{ti}$ into probabilities, one particularly simple and popular way is to use an exponential weighting scheme. The advantage of this scheme is that it allows the algorithm to quickly increase the probability of outstanding actions, while quickly reducing the probability of poor ones. This may be especially beneficial if the action-set is large.

The formula that follows the above idea is to set

\begin{align}

P_{ti} \doteq \frac{\exp(\eta \hat S_{t-1,i}) }{\sum_j \exp( \eta \hat S_{t-1,j} ) } \,.

\label{eq:exp3update}

\end{align}

Here $\eta>0$ is a tuning parameter that controls how aggressively $P_{ti}$ is pushed towards the “one-hot distribution” that concentrates on the action whose estimated total reward is the largest: As $\eta\to\infty$, the probability mass in $P_{t,\cdot}$ quickly concentrates on $\argmax_i \hat S_{t-1,i}$. While there are many schemes for setting the value of $\eta$, in this post, for the sake of simplicity, we will consider the simplest setting where $\eta$ is set based on the problem’s major parameters, such as $K$ and $n$. What we will learn about setting $\eta$ can then be generalized, for example, to the setting when $n$, the number of rounds, is not known in advance.

For practical implementations, and also for the proof that comes, it is useful to note that the action-selection probabilities can also be updated in an incremental fashion:

\begin{align}

P_{t+1,i} = \frac{P_{ti} \exp(\eta \hat X_{ti})}{\sum_j P_{tj} \exp(\eta \hat X_{tj})}\,.

\label{eq:probupdate}

\end{align}

The Exp3 algorithm is a bit tricky to implement in a numerically stable manner when the number of rounds and $K$ are both large: In particular, the normalizing term above requires care when there are many orders of magnitude differences between the smallest and the larges probabilities. An easy way to avoid numerical instability is to incrementally calculate $\tilde S_{ti} = \hat S_{ti} – \min_j \hat S_{tj}$ and then use that $P_{t+1,i} = \exp(\eta \tilde S_{ti})/\sum_j \exp(\eta \tilde S_{tj} )$, where again one needs to be careful when calculating $\sum_j \exp(\eta \tilde S_{tj} )$.

To summarize, Exp3 (exponential weights for exploration and exploitation), uses either the reward estimates based on $\eqref{eq:impestimate}$ or the estimates based on $\eqref{eq:lossestimate}$ to estimate the rewards in every round. Then, the reward estimates are used to update the action selection probabilities using the exponential weighting scheme given by $\eqref{eq:probupdate}$. Commonly, the probabilities $P_{1i}$ are initialized uniformly, though the algorithm and the theory can also work with non-uniform initializations, which can be beneficial to express prior beliefs.

# The expected regret of Exp3

In this section we will analyze the expected regret of Exp3. In all theorem statements here, Exp3 is assumed to be initialized with the uniform distribution, and it is assumed that Exp3 uses the “loss-based” reward estimates given by $\eqref{eq:lossestimate}$.

Our first result is as follows:

Theorem (Expected regret of Exp3):

For an arbitrary assignment $(x_{ti})_{ti}\in [0,1]^{nK}$ of rewards, the expected regret of Exp3 as described above and when $\eta$ is appropriately selected, satisfies

\begin{align*}

R_n \le 2 \sqrt{ n K \log(K) }\,.

\end{align*}

This, together with the previous result shows that up to a $\sqrt{\log(K)}$ factor, Exp3 achieves the minimax regret $R_n^*$.

** Proof**: Notice that it suffices to bound $R_{n,i} = \sum_{t=1}^n x_{ti} – \EE{ \sum_{t=1}^n X_t }$, the regret relative to action $i$ being used in all the rounds.

By the unbiasedness property of $\hat X_{ti}$, $\EE{ \hat S_{ni} } = \sum_{t=1}^n x_{ti}$. Also, $\Et{X_t} = \sum_i P_{ti} x_{ti} = \sum_i P_{ti} \Et{\hat X_{ti} }$ and thus $\EE{ \sum_{t=1}^n X_t } = \EE{ \sum_{t,i} P_{ti} \hat X_{ti} }$. Thus, defining $\hat S_n = \sum_{t,i} P_{ti} \hat X_{ti}$, we see that it suffices to bound the expectation of

\begin{align*}

\hat S_{ni} – \hat S_n\,.

\end{align*}

For this, we develop a bound on the exponentiated difference $\exp(\eta(\hat S_{ni} – \hat S_n))$. As we will see this is rather easy to bound. We start with bounding $\exp(\eta\hat S_{ni})$. First note that when $i$ is the index of the best arm, we won’t lose much by bounding this with $W_n \doteq \sum_{j} \exp(\eta(\hat S_{nj}))$. Define $\hat S_{0i} = 0$, so that $W_0 = K$. Then,

\begin{align*}

\exp(\eta \hat S_{ni} ) \le \sum_{j} \exp(\eta(\hat S_{nj})) = W_n = W_0 \frac{W_1}{W_0} \dots \frac{W_n}{W_{n-1}}\,.

\end{align*}

Thus, we need to study $\frac{W_t}{W_{t-1}}$:

\begin{align*}

\frac{W_t}{W_{t-1}} = \sum_j \frac{\exp(\eta \hat S_{t-1,j} )}{W_{t-1}} \exp(\eta \hat X_{tj} )

= \sum_j P_{tj} \exp(\eta \hat X_{tj} )\,.

\end{align*}

Now, since $\exp(x) \le 1 + x + x^2$ holds for any $x\le 1$ and $\hat X_{tj}$ by its construction is always below one (this is where we use for the first time that $\hat X_{tj}$ is defined by $\eqref{eq:lossestimate}$ and not by $\eqref{eq:impestimate}$), we get

\begin{align*}

\frac{W_t}{W_{t-1}} \le 1 + \eta \sum_j P_{tj} \hat X_{tj} + \eta^2 \sum_j P_{tj} \hat X_{tj}^2

\le \exp( \eta \sum_j P_{tj} \hat X_{tj} + \eta^2 \sum_j P_{tj} \hat X_{tj}^2 )\,,

\end{align*}

where we also used that $1+x \le \exp(x)$ which holds for any $x\in\R$. Putting the inequalities together we get

\begin{align*}

\exp( \eta \hat S_{ni} ) \le K \exp( \eta \hat S_n + \eta^2 \sum_{t,j} P_{tj} \hat X_{tj}^2 )\,.

\end{align*}

Taking the logarithm of both sides, dividing by $\eta>0$, and reordering gives

\begin{align}

\hat S_{ni} – \hat S_n \le \frac{\log(K)}{\eta} + \eta \sum_{t,j} P_{tj} \hat X_{tj}^2\,.

\label{eq:exp3core}

\end{align}

As noted earlier, the expectation of the left-hand side is $R_{ni}$, the regret against action $i.$ Thus, it remains to bound the expectation of $\sum_{t,j} P_{tj} \hat X_{tj}^2$. A somewhat lengthy calculation shows that $\Et{\sum_j P_{tj} \hat X_{tj}^2 } = \sum_j p_{tj}(1-2y_{tj}) + \sum_j y_{tj}^2 \le K$, where recall that $y_{tj} = 1-x_{tj}$ is the round $t$ “loss” of action $j$.

Hence,

\begin{align*}

R_{ni} \le \frac{\log(K)}{\eta} + \eta n K\,.

\end{align*}

Choosing $\eta = \sqrt{\log(K)/(nK)}$ gives the desired result.

QED

The heart of the proof is the use of the inequalities $\exp(x)\le 1 + x + x^2$ (true for $x\le 1$), followed by using $1+x\le \exp(x)$. We will now show an improved bound, which will have some additional benefits, as well. The idea is to replace the upper bound on $\exp(x)$ by

\begin{align}

\exp(x) \le 1 + x + \frac12 x^2\,,

\label{eq:expxupperbound}

\end{align}

which holds for $x\le 0$. The mentioned upper and lower bounds on $\exp(x)$ are shown on the figure below. From the figure, it is quite obvious that the second approximation is a great improvement over the first one when $x\le 0$. In fact, the second approximation is exactly the first two-terms of the Taylor series expansion of $\exp(x)$ and it is the best second order approximation of $\exp(x)$ at $x=0$.

Let us now put $\eqref{eq:expxupperbound}$ to a good use in proving a the tighter upper bound on the expected regret. By construction $\hat X_{tj}\le 1$, we compute

\begin{align*}

\exp(\eta \hat X_{tj} ) = \exp(\eta) \exp( \eta (\hat X_{tj}-1) )

\le \exp(\eta) \left\{1+ \eta (\hat X_{tj}-1) + \frac{\eta^2}{2} (\hat X_{tj}-1)^2\right\}\,,

\end{align*}

and thus, with more algebra and using $1+x\le \exp(x)$ again we get,

\begin{align*}

\sum_j P_{tj} \exp(\eta \hat X_{tj} )

&\le

%\exp(\eta) \sum_j P_{tj} \left(1+ \eta (\hat X_{tj}-1) + \frac{\eta^2}{2} (\hat X_{tj}-1)^2\right) \\

%&\le

%\exp(\eta) \left\{

%1+ \eta \sum_j P_{tj} (\hat X_{tj}-1) + \frac{\eta^2}{2} \sum_j P_{tj} (\hat X_{tj}-1)^2)

%\right\} \\

%&=

%\exp(\eta)\exp

%\left( \eta \sum_j P_{tj} (\hat X_{tj}-1) + \frac{\eta^2}{2} \sum_j P_{tj}(\hat X_{tj}-1)^2\right)\\

%&=

\exp\left( \eta \sum_j P_{tj} \hat X_{tj} + \frac{\eta^2}{2} \sum_j P_{tj}(\hat X_{tj}-1)^2\right)\,.

\end{align*}

We see that here we need to bound $\sum_j P_{tj} (\hat X_{tj}-1)^2$. To save on writing, it will be beneficial to switch to $\hat Y_{tj} = 1-\hat X_{tj}$ ($j\in [K]$), the estimates of the losses $y_{tj} = 1-x_{tj}$, $j\in [K]$. As done before, we also introduce $A_{tj} = \one{A_t=j}$ so that $\hat Y_{tj} = \frac{A_{tj}}{P_{tj}} y_{tj}$. Thus, $P_{tj} (\hat X_{tj}-1)^2 = P_{tj} \hat Y_{tj} \hat Y_{tj}$ and also $P_{tj} \hat Y_{tj} = A_{tj} y_{tj}\le 1$. Therefore, using that $\hat Y_{tj}\ge 0$, we find that $P_{tj} (\hat X_{tj}-1)^2 \le \hat Y_{tj}$.

This shows a second advantage of using $\eqref{eq:expxupperbound}$: When we use this inequality, the “second moment term” $\sum_j P_{tj} (\hat X_{tj}-1)^2$ can be seen to be bounded by $K$ in expectation (thanks to $\E_t[\hat Y_{tj}]\le 1$). Finishing as before, we get

\begin{align}

\hat S_{ni} – \hat S_n \le \frac{\log(K)}{\eta} + \frac{\eta}{2} \sum_{t,j} \hat Y_{tj}\,.

\label{eq:hpstartingpoint}

\end{align}

Now upper bounding $\sum_{t,j} \hat Y_{tj}$ by $nK$ and then taking expectations of both sides, followed by choosing $\eta$ to minimize the resulting right-hand side, we derive

\begin{align*}

R_n \le \sqrt{ 2 n K \log(K) }\,.

\end{align*}

In summary, the following improved result also holds:

Theorem (Improved upper bound on the expected regret of Exp3): For an arbitrary assignment $(x_{ti})_{ti}\in [0,1]^{nK}$ of rewards, the expected regret of Exp3 satisfies

\begin{align*}

R_n \le \sqrt{ 2 n K \log(K) }\,.

\end{align*}

The difference between this and our previous result is $\sqrt{2} = 1.4142\dots$, which shaves off approximately 30 percent of the previous bound!

# High-probability bound on the regret: Exp3-IX

While it is reassuring that the *expected* regret of Exp3 on an *adversarial* problem is of the same size as the worst-case expected regret on *stochastic* bandit problems with bounded rewards, however, does this also mean that the random regret

\begin{align*}

\hat R_n = \max_i \sum_{t=1}^n x_{t,i} – \sum_{t=1}^n X_{t}\,,

\end{align*}

experienced in a single run will be small? While no such guarantee can be given (in the worst case $\hat R_n=n$ can easily hold — for some very bad sequence of outcomes), at least we would expect that the random regret is small with high probability. (In fact, we should have asked the same question for UCB and the other algorithms developed for the stochastic setting, a problem we may return to later.)

Before “jumping” into discussing the details of how such a high-probability bound can be derived, let us discuss the type of the results we expect to see. Looking at the regret definition, we may notice that the first term is deterministic, while the second is the sum of $n$ random variables each lying in $[0,1]$. If these were independent of each other, they would be subgaussian, thus the sum would also be subgaussian and we would expect to see a “small” tail. Hence, we may perhaps expect that the tail of $\hat R_n$ is also subgaussian? The complication is that the random variables in the sequence $(X_t)_t$ are far from being independent, but in fact are highly “correlated”. Indeed, $X_t = x_{t,A_t}$, where $A_t$ depends on *all* the previous $X_1,\dots,X_{t-1}$ in some complicated manner.

As it happens, the mentioned correlations can and will often destroy the thin tail property of the regret of the vanilla Exp3 algorithm that was described above. To understand why this happens note that the magnitude of the fluctuations of the sum $\sum_t U_t$ of independent random variables $U_t$ is governed by $\Var(\sum_t U_t)= \sum_t \Var(U_t)$. While $(\hat X_{ti})_s$ is not an independent sequence, the size of $\sum_t \hat X_{ti}$ is also governed by a similar quantity, the sum of predictable variations, $\sum_t \Vt{\hat X_{ti}}$. As noted earlier, for both estimators considered so far the conditional variance in round $t$ is $\Vt{\hat X_{ti}}\sim 1/P_{ti}$, hence we expect the reward estimates to have high fluctuations when $P_{ti}$ gets small. As written, nothing in the algorithm prevents this (see here). If the reward estimates fluctuate wildly, so will the probabilities $(P_{ti})_{ti}$, which means that perhaps the random regret $\hat R_n$ will also be highly varying. How can this be avoided? One idea is to change the algorithm to make sure that $P_{ti}$ is never too small: This can be done by mixing $(P_{ti})_i$ with the uniform distribution (pulling $(P_{ti})_i$ towards the uniform distribution). Shifting $(P_{ti})_i$ towards the uniform distribution increases the randomness of the actions and as such it can be seen as an explicit way of “forcing exploration”. Another option, which we will explore here, and which leads to a better algorithm, is to change the reward estimates to control their predictable variations. (In fact, forcing a certain amount of exploration alone is still insufficient to tame the large variations in the regret.)

Before considering how the reward estimates should be changed, it will be worthwhile to summarize what we know of the behavior of the random regret of Exp3. For doing this, we will switch to “losses” as this will remove some clutter in some formulae later. We start by rewriting inequality $\eqref{eq:hpstartingpoint}$ (which holds regardless of how the reward/loss estimates are constructed!) in terms of losses:

\begin{align*}

\hat L_n – \hat L_{ni} \le \frac{\log(K)}{\eta} + \frac{\eta}{2} \sum_{j} \hat L_{nj}\,.

\end{align*}

Here, $ L_n$ and $\hat L_{ni}$ re defined by

\begin{align*}

\hat L_n = \sum_{t=1}^n \sum_{j=1}^K P_{t,j} \hat Y_{tj}\, \quad \text{ and }\,\quad

\hat L_{ni} = \sum_{t=1}^n \hat Y_{ti}\,

\end{align*}

where as usual $\hat Y_{tj} = 1-\hat X_{tj}$. Now, defining

\begin{align*}

\tilde L_n = \sum_{t=1}^n \hat Y_{t,A_t}\, \quad\text{ and }\,\quad

L_{ni} = \sum_{t=1}^n y_{ti}

\end{align*}

(recall that $y_{tj} = 1-x_{tj}$), the random regret $\hat R_{ni} = \sum_{t=1}^n x_{ti} – \sum_{t=1}^n X_t$ against the $i$th expert becomes $\tilde L_n – L_{ni}$, and thus,

\begin{align}

\hat R_{ni}

& = \tilde L_n – L_{ni}

= (\tilde L_n – \hat L_n) + (\hat L_n – \hat L_{ni}) + (\hat L_{ni} – L_{ni}) \nonumber \\

& \le \frac{\log(K)}{\eta} + \frac{\eta}{2}\, \sum_{j} L_{nj}\,\, +

(\tilde L_n – \hat L_n) + (\hat L_{ni} – L_{ni})

+ \frac{\eta}{2}\, \sum_j (\hat L_{nj}-L_{nj}) \,.

\label{eq:mainexp3ix}

\end{align}

Thus, to bound the random regret all we have to do is to bound $\tilde L_n – \hat L_n$ and the terms $\hat L_{nj}-L_{nj}$.

To do this, we we will need the modified reward estimates. Recalling that the goal was to control the estimate’s variance, and the large variance of $\hat X_{ti}$ was the result of dividing by a potentially small $P_{ti}$, a simple “fix” is to use

\begin{align*}

\hat X_{ti} = 1-\frac{\one{A_t=i}}{P_{ti}+\gamma} \, (1-X_t)\,,

\end{align*}

or, when using losses,

\begin{align*}

\hat Y_{ti} = \frac{\one{A_t=i}}{P_{ti}+\gamma} \, Y_t\,.

\end{align*}

Here $\gamma\ge 0$ is a small constant whose value, as usual, will be chosen later. When these estimates are used in the standard “exponential update” (cf. $\eqref{eq:probupdate}$) we get an algorithm called **Exp3-IX**. Here, suffix IX refers to that the estimator makes the algorithm e**x**plore in an **i**mplicit fashion, as will be explained next.

Since a larger value is used in the denominator, for $\gamma>0$, $\hat Y_{ti}$ is biased. In particular, the bias in $\hat Y_{ti}$ is “downwards”, i.e., $\Et{\hat Y_{ti}}$ will be a lower bound on $y_{ti}$. Symmetrically, the value of $\hat X_{ti}$ is inflated. In other words, the estimator will be *optimistically biased*. But will optimism will help? And did anyone say optimism is for fools? Anyhow, the previous story repeats: optimism facilitates exploration. In particular, as the estimates are optimistic, the probabilities in general will be pushed up. The effect is the largest for the smallest probabilities, thus increasing “exploration” in an implicit fashion. Hence the suffix “IX”.

Let us now return to developing a bound on the random regret. Thanks to the optimistic bias of the new estimator, we expect $\hat L_{ni}-L_{ni}$ to be negative. This is quite helpful given that this term appears multiple times in $\eqref{eq:mainexp3ix}$! But how about $\tilde L_n – \hat L_n$? Writing $Y_t = \sum_j A_{tj} y_{tj}$, we calculate

\begin{align*}

Y_t – \sum_j P_{tj} \hat Y_{tj}

& = \sum_j \left(1 – \frac{P_{tj}}{P_{tj}+\gamma} \right) \, A_{t,j} y_{t,j}

= \gamma \sum_j \frac{A_{t,j}}{P_{tj}+\gamma} y_{t,j} = \gamma \sum_j \hat Y_{tj}\,.

\end{align*}

Therefore, $\tilde L_n – \hat L_n = \gamma \sum_j \hat L_{nj} = \gamma \sum_j L_{nj} + (\hat L_{nj}-L_{nj})$. Hmm, $\hat L_{nj}-L_{nj}$ again! Plugging the expression developed into $\eqref{eq:mainexp3ix}$, we get

\begin{align}

\hat R_{ni}

& \le \frac{\log(K)}{\eta}

+ \left(\gamma+\frac{\eta}{2}\right)\, \sum_{j} L_{nj}\,\, + (\hat L_{ni} – L_{ni})

+ \left(\gamma+\frac{\eta}{2}\right)\, \sum_j (\hat L_{nj}-L_{nj}) \\

& \le \frac{\log(K)}{\eta}

+ \left(\gamma+\frac{\eta}{2}\right) nK \,\, + (\hat L_{ni} – L_{ni})

+ \left(\gamma+\frac{\eta}{2}\right)\, \sum_j (\hat L_{nj}-L_{nj}) \,. \label{eq:exp3hp3}

\end{align}

The plan to bound $\hat L_{ni} – L_{ni}$ is to use an adaptation of Chernoff’s method, which we discussed earlier for bounding the tails of subgaussian random variables $X$. In particular, according to Chernoff’s method, $\Prob{X>\epsilon}\le \EE{\exp(X)} \exp(-\epsilon)$ and so to bound the tail of $X$ it suffices to bound $\EE{\exp(X)}$. Considering that $\exp(\hat L_{ni} – L_{ni}) = \prod_{t=1}^n M_{ti}$ where $M_{ti}=\exp( \hat Y_{ti}-y_{ti})$, we see that the key is to understand the magnitude of the conditional expectations of the terms $\exp( \hat Y_{ti}) = \exp(\frac{A_{ti}y_{ti}}{P_{ti}+\gamma})$. As usual, we aim to use a polynomial upper bound. In this case, we consider a linear upper bound:

Lemma: For any $0\le x \le 2\lambda$,

\begin{align*}

\exp( \frac{x}{1+\lambda} ) \le 1 + x\,.

\end{align*}

Note that $1+x\le \exp(x)$. What the lemma shows is that by slightly discounting the argument of the exponential function, in a bounded neighborhood of zero, $1+x$ can be an upper bound, or, equivalently, slightly inflating the linear term in $1+x$, the linear lower bound becomes an upper bound.

**Proof**: We rely on two well-known inequalities. The first inequality is $\frac{2u}{1+u} \le \log(1+ 2u)$ which holds for $u\ge 0$ (note that as $u\to 0+$, the inequality becomes tight). Thanks to this inequality,

\begin{align*}

\frac{x}{1+\lambda}

= \frac{x}{2\lambda} \, \frac{2\lambda}{1+\lambda}

\le \frac{x}{2\lambda} \, \log(1+2\lambda)

\le \log\left( 1+ 2 \lambda \frac{x}{2\lambda} \right) = \log(1+x)\,,

\end{align*}

where the second inequality uses $x \log(1+y) \le \log(1+xy)$, which holds for any $x\in [0,1]$ and $y>-1$.

QED.

Based on this, we are ready to prove the following result, which shows that the upper tail of a negatively biased sum is small. The result is tweaked so that it can save us some constant factors later:

Lemma (Upper tail of negatively biased sums): Let $(\cF_t)_{1\le t \le n}$ be a filtration and for $i\in [K]$ let $(\tilde Y_{ti})_t$ be $(\cF_t)_t$-adapted (i.e., for each $t$, $\tilde Y_{ti}$ is $\cF_t$-measurable) such that for

any $S\subset [K]$ with $|S|>1$,

$\E_{t-1}[ \prod_{i\in S} \tilde Y_{ti} ] \le 0$, while for any $i\in [K]$, $\E_{t-1}[\tilde Y_{ti}]= y_{ti}$.

Further, let $(\alpha_{ti})_{ti}$, $(\lambda_{ti})_{ti}$ be real-valued predictable random sequences (i.e., $\alpha_{ti}$ and $\lambda_{ti}$ are $\cF_{t-1}$-measurable). Assume further that for all $t,i$, $0\le \alpha_{ti} \tilde Y_{ti} \le 2\lambda_{ti}$. Then, for any $0\le \delta \le 1$, with probability at least $1-\delta$,

\begin{align*}

\sum_{t,i} \alpha_{ti} \left( \frac{\tilde Y_{ti}}{1+\lambda_{ti}} – y_{ti} \right) \le \log\left( \frac1\delta \right)\,.

\end{align*}

The proof, which is postponed to the end of the post, just combines Chernoff’s method and the previous lemma as suggested above. Equipped with this result, we can easily bound the terms $\hat L_{ni}-L_{ni}$:

Lemma (Loss estimate tail upper bound):

Fix $0\le \delta \le 1$. Then the probability that the inequalities

\begin{align}

\max_i \hat L_{ni} – L_{ni} \le \frac{\log(\frac{K+1}{\delta})}{2\gamma}\,, \qquad \text{and} \qquad

\sum_i (\hat L_{ni} – L_{ni}) \le \frac{\log(\frac{K+1}{\delta}) }{2\gamma}\,

\label{eq:losstailbounds}

\end{align}

both hold is at least $1-\delta$.

**Proof**: Fix $0\le \delta’ \le 1$ to be chosen later. We have

\begin{align*}

\sum_i (\hat L_{ni} – L_{ni} )

& = \sum_{ti} \frac{A_{ti}y_{ti}}{ P_{ti}+\gamma } – y_{ti}

= \frac{1}{2\gamma} \,

\sum_{ti} 2\gamma \left( \frac{1}{1+\frac{\gamma}{P_{ti}}}\frac{A_{ti} y_{ti}}{ P_{ti} } – y_{ti}\right)\,.

\end{align*}

Introduce $\lambda_{ti} = \frac{\gamma}{P_{ti}}$, $\tilde Y_{ti} = \frac{A_{ti} y_{ti}}{ P_{ti} }$ and $\alpha_{ti} = 2\gamma$. It is not hard to see then that the conditions of the previous lemma are satisfied (in particular, for $S\subset [K]$, if $|S|>1$ and $i,j\in S$ such that $i\ne j$ then

$\tilde Y_{ti} \tilde Y_{tj} = 0$ thanks to that for $i\ne j$, $A_{ti}A_{tj}=0$ holds), while if $|S|=1$, $\E_{t-1}[\tilde Y_{ti}]=y_{ti}$. Therefore, outside of an event $\cE_0$ that has a probability of at most $\delta’$,

\begin{align}\label{eq:sumbound}

\sum_i (\hat L_{ni} – L_{ni} ) \le \frac{\log(1/\delta’)}{2\gamma}.

\end{align}

Similarly, we get that for any fixed $i$, outside of an event of probability $\cE_i$ that has a probability of at most $\delta’$,

\begin{align}

\hat L_{ni} – L_{ni}\le \frac{\log(1/\delta’)}{2\gamma}\,.

\label{eq:indbound}

\end{align}

To see this latter just use the previous argument but now set $\alpha_{tj}=\one{j=i} 2\gamma$. Thus, outside of the event, $\cE =\cE_0 \cup \cE_1 \cup \dots \cup \cE_K$, all of $\eqref{eq:sumbound}$ and $\eqref{eq:indbound}$ hold (the latter for all $i$). The total probability of $\cE$ is $\Prob{\cE}\le \sum_{i=0}^K \Prob{\cE_i}\le (K+1)\delta’$. Choosing $\delta’ = \delta/(K+1)$ gives the result.

QED.

Putting everything together, we get the main result of this section:

Theorem (High-probability regret of Exp3-IX):

Let $\hat R_n$ be the random regret of Exp3-IX run with $\gamma = \eta/2$. Then, choosing $\eta = \sqrt{\frac{2\log(K+1)}{nK}}$, for any $0\le \delta \le 1$, the inequality

\begin{align}

\label{eq:exp3unifhp}

\hat R_n \le \sqrt{8.5 nK\log(K+1)} +\left(\sqrt{ \frac{nK}{2\log(K+1)} } +1\right)\,

\log(1/\delta)\,

\end{align}

hold with probability at least $1-\delta$. Further, for any $0\le \delta \le 1$, if $\eta = \sqrt{\frac{\log(K)+\log(\frac{K+1}{\delta})}{nK}}$, then

\begin{align}

\label{eq:exp3deltadephp}

\hat R_{n}

\le 2 \sqrt{ (2\log(K+1) + \log(1/\delta) ) nK } + \log\left(\frac{K+1}{\delta}\right)

\end{align}

holds with probability at least $1-\delta$.

Note that the choice of $\eta$ and $\gamma$ that leads to $\eqref{eq:exp3unifhp}$ is independent of $\delta$, while the choice that leads to $\eqref{eq:exp3deltadephp}$ depends on the value of $\delta$. As expected, the second bound is tighter.

**Proof**: Introduce $B = \log( \frac{K+1}{\delta} )$. Consider the event when $\eqref{eq:losstailbounds}$ holds. On this event, thanks to $\eqref{eq:exp3hp3}$, simultaneously, for all $i\in [K]$,

\begin{align*}

\hat R_{ni}

& \le \frac{\log(K)}{\eta}

+ \left(\gamma+\frac{\eta}{2}\right) nK

+ \left(\gamma+\frac{\eta}{2}+1\right)\, \frac{B}{2\gamma}.

%\label{eq:exp3ixhp4}

\end{align*}

Choose $\gamma=\eta/2$ and note that $\hat R_n = \max R_{ni}$. Hence, $\hat R_{n} \le \frac{\log(K)+B}{\eta} + \eta\, nK + B$. Choosing $\eta = \sqrt{\frac{\log(K)+B}{nK}}$, we get

\begin{align*}

\hat R_{n} & \le 2 \sqrt{ (\log(K)+B) nK } + B

\le 2 \sqrt{ (2\log(K+1) + \log(1/\delta) ) nK } + \log(\frac{K+1}{\delta})\,,

\end{align*}

while choosing $\eta = \sqrt{\frac{2\log(K+1)}{nK}}$, we get

\begin{align*}

\hat R_{n}

& \le (2\log(K+1)+\log(1/\delta)) \sqrt{ \frac{nK}{2\log(K+1)} }

+ \sqrt{ \frac{2\log(K+1)}{nK} }\, nK + \log(\frac{K+1}{\delta}) \\

& = 2\sqrt{2nK\log(K+1)} +\left(\sqrt{ \frac{nK}{2\log(K+1)} } +1\right)\,

\log(\frac{K+1}{\delta})\,.

\end{align*}

By the previous lemma, the event when $\eqref{eq:losstailbounds}$ holds has a probability of at least $1-\delta$, thus finishing the proof.

QED.

It is apparent from the proof that the bound can be slightly improved by choosing $\eta$ and $\gamma$ to optimize the upper bounds. A more pressing issue though is that the choice of these parameters is tied to the number of rounds. Our earlier comments apply: The result can be re-proved for decreasing sequences $(\eta_t)$, $(\gamma_t)$, and by appropriately selecting these we can derive a bound which is only slightly worse than the one presented here. An upper bound on the expected regret of Exp3-IX can be easily obtained by “tail integration” (i.e., by using that for $X\ge 0$, $\E{X} = \int_0^\infty \Prob{X>x} dx$). The upper bound that can be derived this way is only slightly larger than that derived for Exp3. In practice, Exp3-IX is highly competitive and is expected to be almost always better than Exp3.

# Summary

In this post we introduced the adversarial bandit model as a way of deriving guarantees that do not depend on the specifics of how the rewards observed are generated. In particular, the only assumption made was that the rewards belong to a known bounded range. As before, performance is quantified as the loss due to missing to choose the action that has the highest total reward. As expected, algorithms developed for the adversarial model, will also “deliver” in the stochastic bandit model. In fact, the avdersarial minimax regret is lower bounded by the minimax regret of stochastic problems, which resulted in a lower bound of $\Omega(\sqrt{nK})$. The expected regret of Exp3, an algorithm which combines reward estimation and exponential weighting, was seen to match this up to a $\sqrt{\log K}$ factor. We discussed multiple variations of Exp3, comparing benefits and pitfalls of using various reward estimates. In particular, we discussed Exp3-IX, a variant that uses optimistic estimates for the rewards. The primary benefit of using optimistic estimates is a better control of the variance of the reward estimates and as a result Exp3-IX has a better control of the upper tail of the random regret.

This post, however, only scratched the surface of all the research that is happening on adversarial bandits. In the next post we will consider Exp4, which builds on the ideas presented here, but addresses a more challenging a practical problem. In addition we also hope to discuss lower bounds on the high probability regret, which we have omitted so far.

# The remaining proof

Here we prove the lemma on the upper tail of negatively biased sums.

**Proof**: Let $\Et{\cdot}$ be the conditional expectation with respect to $\cF_t$: $\Et{U}\doteq \EE{U|\cF_t}$. Thanks to $\exp(x/(1+\lambda))\le 1+x$ which was shown to hold for $0\le x \le 2\lambda$ and the assumptions that $0\le \alpha_{ti} \tilde Y_{ti} \le 2\lambda_{ti}$, we have

\begin{align}

\E_{t-1} \, \exp\left(\sum_i \frac{\alpha_{ti} \tilde Y_{ti}}{1+\lambda_{ti}} \right)

& \le \E_{t-1} \prod_i (1+\alpha_{ti} \tilde Y_{ti})\nonumber \\

& \le 1+\E_{t-1} \sum_i \alpha_{ti} \tilde Y_{ti}\nonumber \\

& = 1+ \sum_i \alpha_{ti} y_{ti} \nonumber \\

& \le \exp(\sum_i \alpha_{ti} y_{ti})\,.

\label{eq:basicexpineq}

\end{align}

Here, the second inequality follows from the assumption that for $S\subset [K]$, $|S|>1$, $\E_{t-1} \prod_{i\in S}\tilde Y_{ti} \le 0$, the third one follows from the assumption that $\E_{t-1} \tilde Y_{ti} = y_{ti}$, while the last one follows from $1+x\le \exp(x)$.

Define

\begin{align*}

Z_t = \exp\left( \sum_i \alpha_{ti} \left(\frac{\tilde Y_{ti}}{1+\lambda_{ti}}-y_{ti}\right) \right)\,

\end{align*}

and let $M_t = Z_1 \dots Z_t$, $t\in [n]$ with $M_0 = 1$. By $\eqref{eq:basicexpineq}$,

$\E_{t-1}[Z_t]\le 1$. Therefore,

\begin{align*}

\E[M_t] = \E[ \E_{t-1}[M_t] ] = \E[ M_{t-1} \E_{t-1}[Z_t] ] \le \E[ M_{t-1} ] \le \dots \le \E[ M_0 ] =1\,.

\end{align*}

Applying this with $t=n$, we get

$\Prob{ \log(M_n) \ge \log(1/\delta) } =\Prob{ M_n \delta \ge 1} \le \EE{M_n} \delta \le \delta$.

QED.

# References

The main reference, which introduced Exp3 (and some variations of it) is by Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund and Robert E. Schapire:

- Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, Robert E. Schapire: The Nonstochastic Multiarmed Bandit Problem. SIAM Journal on Computing, Volume, 2003, pp. 48-77.

The Exp3-IX algorithm is quite new and is due to Gergely Neu. The main reference to this paper is:

- Gergely Neu: Explore no more: Improved high-probability regret bounds for non-stochastic bandits, NIPS 2015. https://arxiv.org/pdf/1506.03271.pdf

# Notes

Note 1: There are many other ways that to relax the rigidity of stationary stochastic environments. Some alternatives, other than considering a fully adversarial setting include making mixing assumptions about the rewards, assuming hidden or visible states, or drifts just to mention a few. Drifts (nonstationarity) can and is studies in the adversarial setting, too.

Note 2: What happens when the range of the rewards is unknown? This has been studied e.g., by Allenberg et al. See: Allenberg, C., Auer, P., Gyorfi, L., and Ottucsak, Gy. (2006). Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In ALT, pages 229–243

Note 3: A perhaps even more basic problem than the one considered here is when the learner receives all $(x_{t,i})_i$ by the end of round $t$. This is sometimes called the full-information setting. The algorithm can simply fall back to just using exponential weighting with the total rewards. This is sometimes called the Hedge algorithm, or the “Exponential Weights Algorithm” (EWA). The underlying problem is sometimes called “prediction with expert advice”. The proof as written goes through, but one should replace the polynomial upper bound on $\exp(x)$ with Hoeffding’s lemma. This analysis gives a regret of $\sqrt{\frac{1}{2} n \log(K)}$, which is optimal in an asymptotic sense.

Note 4: As noted earlier, it is possible to remove the $\sqrt{\log(K)}$ factor from the upper bound on the expected regret. For details, see this paper by Sebastian Bubeck and Jean-Yves Audibert.

Note 5: By initializing $(P_{0i})_i$ to be biased towards some actions, the regret of Exp3 improves under the favorable environments when the “prior” guess is correct in that the action that $(P_{0i})_i$ is biased towards has a large total reward. However, this does not come for free, as shown in this paper by Tor.

Note 6: An alternative to the model considered is to let the environment choose $(x_{ti})$ based on the learner’s past actions. This is a harsher environment model. Nevertheless, the result we obtained can be generalized to this setting with no problems. It is another question whether in such “reactive” environments, the usual regret definition makes sense. Sometimes it does (when the environment arises as part of a “reduction”, i.e., it is made up by an algorithm itself operating in a bigger context). Sometimes the environments that do not react are called oblivious, while the reactive ones are said to be non-oblivious.

Note 7: A common misconception in connection to the adversarial framework is that it is a good fit for nonstationary environments. While the adversarial framework does not rule out nonstationary environments, the regret concept used has stationarity built into it and an algorithm that keeps the regret (as defined here) small will typically be unsuitable for real nonstationary environments, where conditions change or evolve. What happens in a truly nonstationary setting is that the single best action in hindsight will not collect much reward. Hence, the goal here should be to compete with the best action sequences computed in hindsight. A reasonable goal is to design algorithms which compete simultaneously with many action sequences of varying complexity and make the regret degrade in a graceful manner as the complexity of the competing action sequence increases (complexity can be measured by the number of action-switches over time).

Note 8: As noted before, regardless of which estimator is used to estimate the rewards, $\Vt{\hat X_{ti}} \sim 1/P_{ti}$. The problem then is that if $P_{ti}$ gets very very small, the predictable variance will be huge. It is instructive to think through whether and how $P_{ti}$ can take on very small values. Consider first the loss-based estimator given by $\eqref{eq:lossestimate}$. For this estimator, when $P_{t,A_t}$ and $X_t$ are both small, $\hat X_{t,A_t}$ can take on a large negative value. Through the update formula $\eqref{eq:exp3update}$ this then translates into $P_{t+1,A_t}$ being squashed aggressively towards zero. A similar issue arises with the reward-based estimator given by $\eqref{eq:impestimate}$. The difference is that now it will be a “positive surprise” ($P_{t,A_t}$ small, $X_t$ large) that pushes the probabilities towards zero. In fact, for such positive surprises, the probabilities of *all actions but $A_t$* will be pushed towards zero. This means that in this version dangerously small probabilities can be expected to be seen even more frequently.

Note 9: The argument used in the last steps of proving the bound on the upper tail of the loss estimates had the following logic: Outside of $\cE_i$, some desired relation $R_i$ holds. If the probability of $\cE_i$ is bounded by $\delta$, then outside of $\cup_{i\in [m]} \cE_i$, all of $R_1,\dots, R_m$ hold true. Thus, with probability at least $1-m\delta$, $R_1,\dots, R_m$ are simultaneously true. This is called the “union bound argument”. In the future we will routinely use this argument without naming the error events $\cE_i$ and skipping the details.

Note 10: It was mentioned that the size of fluctuations of the sum of a sequence of $(\cF_t)_t$-adapted random variables is governed by the sum of predictable variations. This is best accounted for in the tail inequalities named after Bernstein and Freedman. The relevant paper of Freedman can be found here.

# High probability lower bounds

In the post on adversarial bandits we proved two high probability upper bounds on the regret of Exp-IX. Specifically, we showed:

Theorem: There exists a policy $\pi$ such that for all $\delta \in (0,1)$ for any adversarial environment $\nu\in [0,1]^{nK}$, with probability at least $1-\delta$

\begin{align}

\label{eq:high-prob}

\hat R_n(\pi,\nu) = O\left(\sqrt{Kn \log(K)} + \sqrt{\frac{Kn}{\log(K)}} \log\left(\frac{1}{\delta}\right)\right)\,.

\end{align}

We also gave a version of the algorithm that depended on $\delta \in (0,1)$:

Theorem: For all $\delta \in (0,1)$ there exists a policy $\pi$ such that for any adversarial environment $\nu\in [0,1]^{nK}$, with probability at least $1- \delta$,

\begin{align}

\label{eq:high-prob2}

\hat R_n(\pi,\nu) = O\,\left(\sqrt{Kn \log\left(\frac{K}{\delta}\right)}\right)\,.

\end{align}

The key difference between these results is the order of quantifiers. In the first we have a single algorithm and a high-probability guarantee that holds simultaneously for any confidence level. For the second result the confidence level must be specified in advance. The price for using the generic algorithm appears to be $\sqrt{\log(1/\delta)}$, which is usually quite small but not totally insignificant. The purpose of this post is to show that both bounds are tight up to constant factors, which implies that algorithms knowing the confidence level in advance really do have an advantage in terms of the achievable regret.

One reason why choosing the confidence level in advance is not ideal is that the resulting high-probability bound cannot be integrated to prove a bound in expectation. For algorithms satisfying (\ref{eq:high-prob}) the expected regret can be bounded by

\begin{align}

R_n \leq \int^\infty_0 \Prob{\hat R_n \geq x} dx = O(\sqrt{Kn \log(K)})\,.

\label{eq:integratedbound}

\end{align}

On the other hand, if the high-probability bound only holds for a single $\delta$ as in \eqref{eq:high-prob2}, then it seems hard to do much better than

\begin{align*}

R_n \leq n \delta + O\left(\sqrt{Kn \log\left(\frac{K}{\delta}\right)}\right)\,,

\end{align*}

which with the best choice of $\delta$ leads to a bound where an extra $\log(n)$ factor appears compared to \eqref{eq:integratedbound}. In fact, it turns out that this argument cannot be strengthened and algorithms with the strong high-probability regret cannot be near-optimal in expectation.

Our approach for proving this fact is very similar to what we did for the minimax lower bounds for stochastic bandits in a previous post. There are two differences between the adversarial setting and the stochastic setting that force us to work a little harder. The first is that for the adversarial-bandit upper bounds we have assumed the rewards are bounded in $[0,1]$, which was necessary in order to say anything at all. This means that our lower bounds should also satisfy this requirement, while in the stochastic lower bounds we used Gaussian rewards with an unbounded range. The second difference comes from the fact that rather than for the regret, the stochastic lower bounds were given for what is known as the pseudo-regret in the adversarial framework and which reads

\begin{align*}

\bar R_n \doteq \max_i \EE{ \sum_{t=1}^n X_{ti} – \sum_{t=1}^n X_t }\,.

\end{align*}

In the stochastic setting, we have $\bar R_n = \sum_{i=1}^K \E[T_i(n)] \Delta_i$ and thus bounds on the pseudo-regret are possible by lower bounding the number of times an algorithm chooses sub-optimal arms on expectation. This is not enough to bound the regret, which depends also on the actual samples.

# A lower bound on tail probabilities of pseudo-regret in stochastic bandits

Before we overcome these technicalities we describe the simple intuition by returning to the stochastic setting, using the pseudo-regret and relaxing the assumption that the rewards are bounded. It is important to remember that $\bar R_n$ is not a random variable because all the randomness is integrated away by the expectation. This means that it does not make sense to talk about high-probability results for $\bar R_n$, so we introduce another quantity,

\begin{align*}

\tilde R_n = \sum_{i=1}^K T_i(n) \Delta_i\,,

\end{align*}

which is a random variable through the pull counts $T_i(n)$ and which, for lack of a better name, we call the *random pseudo-regret*. For the subsequent results we let $\cE$ denote the set of Gaussian bandits with sub-optimality gaps bounded by one.

Theorem: Fix the horizon $n>0$, number of arms $K>1$, a constant $C>0$ and a policy $\pi$. Assume that for any $\nu \in \cE$ bandit environment,

\begin{align*}

R_n(\pi,\nu) \leq C \sqrt{(K-1)n}\,,

\end{align*}

Let $\delta \in (0,1)$. Then, there exists a bandit $\nu$ in $\cE$ such that

\begin{align*}

\Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4}\min\set{n,\, \frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}} \geq \delta\,.

\end{align*}

It follows that if the result can be transferred to the adversarial setting, there will be little or no room for improving \eqref{eq:high-prob}.

**Proof**: Let $\Delta \in (0, 1/2]$ be a constant to be tuned subsequently and define

\begin{align*}

\mu_i = \begin{cases}

\Delta, & \text{if } i = 1\,; \\

0, & \text{otherwise}\,,

\end{cases}

\end{align*}

As usual, let $R_n = R_n(\pi,\nu)$ for $\nu = (\cN(\mu_i,1))_{i\in [K]}$. Let $\PP=\PP_{\nu,\pi}$ and $\E=\E_{\nu,\pi}$. Let $i = \argmin_{j>1} \E[T_j(n)]$. Then, thanks to

\begin{align*}

C \sqrt{(K-1)n} \ge R_n = \Delta \sum_{i>1} \E[T_i(n)] \ge \Delta (K-1)\min_i \E[T_i(n)]

\end{align*}

we find that

\begin{align}

\E[T_i] \leq \frac{C}{\Delta} \sqrt{\frac{n}{K-1}}\,.

\label{eq:hpproofarmusage}

\end{align}

Define $\mu’ \in \R^K$ by

\begin{align*}

\mu’_j = \begin{cases}

\mu_j\,, & \text{if } j \neq i; \\

2\Delta\,, & \text{otherwise}

\end{cases}

\end{align*}

and let $\nu’=(\cN(\mu_j’,1))_{j\in [K]}$ be the Gaussian bandit with means $\mu’$ and abbreviate $\PP’=\PP_{\nu’,\pi}$ and $\E’=\E_{\nu’,\pi}$. Thus in $\nu’$ action $i$ is better than any other action by at least $\Delta$. Let $\tilde R_n = \sum_{j=1}^K T_j(n) \Delta_j$ and $\tilde R_n’ = \sum_{j=1}^n T_j(n) \Delta_j’$ be the random pseudo-regret in $\nu$ and $\nu’$ respectively, where $\Delta_j = \max_k \mu_k-\mu_j = \one{j\ne 1}\Delta$ and $\Delta_j’=\max_k \mu_k’-\mu_j \ge \one{i\ne j} \Delta$. Hence,

\begin{align*}

\tilde R_n & \ge T_i(n) \Delta_i \ge \one{T_i(n)\ge n/2} \frac{\Delta n}{2}\,, \qquad\text{ and }\\

\tilde R_n’ & \ge \Delta \sum_{j\ne i} T_j(n) = \Delta (n-T_i(n)) \ge \one{T_i(n) < n/2} \frac{\Delta n}{2}\,.
\end{align*}
Hence, $T_i(n)\ge n/2 \Rightarrow \tilde R_n \ge \frac{\Delta n}{2}$ and $T_i(n) < n/2 \Rightarrow \tilde R_n' \ge \frac{\Delta n}{2}$, implying that
\begin{align}
\max\left(\Prob{ \tilde R_n \ge \frac{\Delta n}{2} },
\PP'\left(\tilde R_n' \ge \frac{\Delta n}{2}
\right)\right)
\ge \frac12 \left( \Prob{T_i(n) \ge n/2} +\PP'\left(T_i(n) < n/2 \right) \right)\,.
\label{eq:hprlb}
\end{align}
By the high probability Pinsker lemma, the divergence decomposition identity (earlier this was called this the information processing lemma) and \eqref{eq:hpproofarmusage} we have

\begin{align}

\Prob{T_i(n) \geq n/2} + \mathbb P’\left(T_i(n) < n/2\right)

&\geq \frac{1}{2} \exp\left(-\KL(\mathbb P, \mathbb P’)\right) \nonumber \\

&= \frac{1}{2} \exp\left(-2\E[T_i(n)] \Delta^2\right) \nonumber \\

&\geq \frac{1}{2} \exp\left(-2C \Delta\sqrt{\frac{n}{K-1}}\right)\,.

\label{eq:hppinskerlb}

\end{align}

To enforce that the right-hand side of the above display is at least $2\delta$, we choose

\begin{align*}

\Delta = \min\set{\frac{1}{2},\,\frac{1}{2C} \sqrt{\frac{K-1}{n}} \log\left(\frac{1}{4\delta}\right)}\,.

\end{align*}

Putting \eqref{eq:hprlb} and \eqref{eq:hppinskerlb} together we find that either

\begin{align*}

&\Prob{\tilde R_n \geq \frac{1}{4}\min\set{n,\, \frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}} \geq \delta \\

\text{or}\qquad &

\mathbb P’\left(\tilde R’_n \geq \frac{1}{4}\min\set{n,\,\frac{1}{C} \sqrt{(K-1)n} \log\left(\frac{1}{4\delta}\right)}\right) \geq \delta\,.

\end{align*}

QED.

From this theorem we can derive two useful corollaries.

Corollary: Fix $n>0$, $K>1$. For any policy $\pi$ and $\delta \in (0,1)$ small enough that

\begin{align}

n\delta \leq \sqrt{n (K-1) \log\left(\frac{1}{4\delta}\right)} \label{eq:hplbdelta}

\end{align}

there exists a bandit problem $\nu\in\cE$ such that

\begin{align}

\Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4}\min\set{n,\, \sqrt{\frac{n(K-1)}{2} \log\left(\frac{1}{4\delta}\right)}}} \geq \delta\,.

\label{eq:hplbsmalldelta}

\end{align}

**Proof**: We prove the result by contradiction. Assume that the conclusion does not holds for $\pi$. We will derive a contradiction. Take $\delta$ that satisfies \eqref{eq:hplbdelta}. Then, for any bandit problem $\nu\in \cE$ the expected regret of $\pi$ is bounded by

\begin{align*}

R_n(\pi,\nu) \leq n\delta + \sqrt{\frac{n(K-1)}{2} \log\left(\frac{1}{4\delta}\right)}

\leq \sqrt{2n(K-1) \log\left(\frac{1}{4\delta}\right)}\,.

\end{align*}

Therefore $\pi$ satisfies the conditions of the previous theorem with $C =\sqrt{2 \log(\frac{1}{4\delta})}$, which implies that there exists some bandit problem $\nu\in\cE$ such that \eqref{eq:hplbsmalldelta} holds, contradicting our assumption.

QED

Corollary: Fix any $K>1$, $p \in (0,1)$ and $C > 0$. Then, for any policy $\pi$ there exists $n>0$ large enough, $\delta\in(0,1)$ small enough and a bandit environment $\nu\in \cE$ such that

\begin{align*}

\Prob{\tilde R_n(\pi,\nu) \geq C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)} \geq \delta\,.

\end{align*}

**Proof**: Again, we proceed by contradiction. Suppose that a policy $\pi$ exists for which the conclusion does not hold. Then, for any $n>0$ and environment $\nu\in \cE$,

\begin{align}

\Prob{\tilde R_n(\pi,\nu) \geq C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)} < \delta
\label{eq:hpexplbp}
\end{align}
and therefore, for any $n>0$, the expected $n$-round regret of $\pi$ on $\nu$ is bounded by

\begin{align*}

R_n(\pi,\nu)

\leq \int^\infty_0 \Prob{ \tilde R_n(\pi,\nu) \geq x} dx

\leq C \sqrt{n(K-1)} \int^\infty_0 \exp\left(-x^{1/p}\right) dx

\leq C \sqrt{n(K-1)}\,.

\end{align*} Therefore, by the previous theorem, for any $n>0$, $\delta\in (0,1)$ there exists a bandit $\nu_{n,\delta}\in \cE$ such that

\begin{align*}

\Prob{\tilde R_n(\pi,\nu_{n,\delta}) \geq \frac{1}{4} \min\set{n, \frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)}} \geq \delta\,.

\end{align*}

For $\delta$ small enough, $\frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right) \ge C \sqrt{(K-1)n} \log^p\left(\frac{1}{\delta}\right)$ and then choosing $n$ large enough so that $\frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)\le n$, we find that on the environment $\nu=\nu_{n,\delta}$,

\begin{align*}

\delta &\le \Prob{\tilde R_n(\pi,\nu) \geq \frac{1}{4} \min\set{n, \frac{1}{C} \sqrt{n(K-1)} \log\left(\frac{1}{4\delta}\right)}} \\

& \le \Prob{ \tilde R_n(\pi,\nu) \geq C\sqrt{(K-1)n} \log^p\left(\frac1n \right) }\,,

\end{align*}

contradicting \eqref{eq:hpexplbp}.

QED

# A lower bound on tail probabilities of regret in adversarial bandits

So how do we transfer this argument to the case where the rewards are bounded and the regret is used, rather than the pseudo-regret? For the first we can simply shift the means to be close to $1/2$ and truncate or “clip” the rewards that (by misfortune) end up outside the allowed range. To deal with the regret there are two options. Either one adds an additional step to show that the regret and the pseudo-regret concentrate sufficiently fast (which they do), or one *correlates the losses* across the actions. The latter is the strategy that we will follow here.

We start with an observation. Our goal is to show that there exist a reward sequence $x=(x_1,\dots,x_n)\in [0,1]^{nK}$ such that the regret $\hat R_n=\max_i \sum_t x_{ti} – x_{t,A_t}$ is above some threshold $u>0$ with a probability exceeding a prespecified value $\delta\in (0,1)$. For this we want to argue that it suffices to show this when the rewards are randomly chosen. Similarly to the stochastic case we define the “extended” canonical bandit probability space. Since the regret in adversarial bandits depends on non-observed rewards, the outcome space of the extended canonical probability space is $\Omega_n = \R^{nK}\times [K]^n$ and now $X_t,A_t: \Omega_n \to \R$ are $X_t(x,a) = x_t$ and $A_t(x,a) = a_t$ where we use the convention that $x= (x_1,\dots,x_n)$ and $a=(a_1,\dots,a_n)$. We also let $\hat R_n = \max_i \sum_{t=1}^n X_{ti} – \sum_{t=1}^n X_{t,A_t}$ and define $\PP_{Q,\pi}$ to be joint of $(X_1,\dots,X_n,A_1,\dots,A_n)$ arising from the interaction of $\pi$ with $X\sim Q$. Finally, as we often need it, for a fixed $\nu\in \R^{nK}$ we abbreviate $\PP_{\delta_\nu,\pi}$ to $\PP_{\nu,\pi}$ where $\delta_\nu$ is the Dirac (probability) measure over $\R^{nK}$ (i.e., $\delta_{\nu}(U) = \one{\nu \in U}$ for $U\subset \R^{nK}$ Borel).

Lemma (Randomization device): For any $Q$ probability measure over $\R^{nK}$, any policy $\pi$, $u\in \R$ and $\delta\in (0,1)$,

\begin{align}\label{eq:pqpidelta}

\PP_{Q,\pi}( \hat R_n \geq u ) \geq \delta \implies

\exists \nu \in\mathrm{support}(Q) \text{ such that } \PP_{\nu,\pi}(\hat R_n \geq u) \geq \delta\,.

\end{align}

The lemma is proved by noting that $\PP_{Q,\pi}$ can be disintegrated into the “product” of $Q$ and $\{\PP_{\nu,\pi}\}_{\nu}$. The proof is given at the end of the post.

Given this machinery, let us get into the proof. Fix a policy $\pi$, $n>0$, $K>1$ and a $\delta\in (0,1)$. Our goal is to find some $u>0$ and a reward sequence $x\in [0,1]^{nK}$ such that the random regret of $\pi$ while interacting with $x$ is above $u$ with probability exceeding $\delta$. For this, we will define two reward distributions $Q$ and $Q’$, and show for (at least) one of $\PP_{Q,\pi}$ or $\PP_{Q’,\pi}$ that the probability of $\hat R_n \ge u$ exceeds $\delta$.

Instead of the canonical probability models we will find it more convenient to work with two sequences $(X_t,A_t)_t$ and $(X_t’,A_t’)_t$ of reward-action pairs defined over a common probability space. These are constructed as follows: We let $(\eta_t)_t$ be an i.i.d. sequence of $\mathcal N(0,\sigma^2)$ Gaussian variables and then let

\begin{align*}

X_{tj} = \clip( \mu_j + \eta_t )\,, \qquad X_{tj}’ = \clip( \mu_j’ + \eta_t)\, \qquad (t\in [n],j\in [K])\,,

\end{align*}

where $\clip(x) = \max(\min(x,1),0)$ clips its argument to $[0,1]$, and for some $\Delta\in (0,1/4]$ to be chosen later,

\begin{align*}

\mu_j = \frac12 + \one{j=1} \Delta, \qquad (j\in [K])\,.

\end{align*}

The “means” $(\mu_j’)_j$ will also be chosen later. Note that apart from clipping, $(X_{ti})_t$ (and also $(X_{ti}’)_t$) is merely a shifted version of $(\eta_t)_t$. In particular, thanks to $\Delta>0$, $X_{t1}\ge X_{tj}$ for any $t,j$. Moreover, $X_{t1}$ exceeds $X_{tj}$ by $\Delta$ when none of them is clipped:

\begin{align}

X_{t1}\ge X_{tj} + \Delta\, \one{\eta_t\in [-1/2,1/2-\Delta]}\,, \quad t\in [n], j\ne 1\,.

\label{eq:hplb_rewardgapxt}

\end{align}

Now, define $(A_t)_t$ to be the random actions that arise from the interaction of $\pi$ and $(X_t)_t$ and let $i = \argmin_{j>1} \EE{ T_j(n) }$ where $T_i(n) = \sum_{t=1}^n \one{A_t=i}$. As before, $\EE{T_i(n)}\le n/(K-1)$. Choose

\begin{align*}

\mu_j’ = \mu_j + \one{j=i}\, 2\Delta\,, \quad j\in [K]\,

\end{align*}

so that $X_{ti}’\ge X_{tj}’$ for $j\in [K]$ and furthermore

\begin{align}

X_{ti}’\ge X_{tj}’ + \Delta\, \one{\eta_t\in [-1/2,1/2-2\Delta]}\,, \quad t\in [n], j\ne i\,.

\label{eq:hplb_rewardgapxtp}

\end{align}

Denote by $\hat R_n = \max_j \sum_t X_{tj} – \sum_t X_{t,A_t}$ the random regret of $\pi$ when interacting with $X = (X_1,\dots,X_n)$ and let $\hat R_n’ = \max_j \sum_t X’_{tj} – \sum_t X’_{t,A_t’}$ the random regret of $\pi$ when interacting with $X’ = (X_1′,\dots,X_n’)$.

Thus, it suffices to prove that either $\Prob{ \hat R_n \ge u }\ge \delta$ or $\Prob{ \hat R_n’ \ge u} \ge \delta$.

By our earlier remarks, $\hat R_n = \sum_t X_{t1} – \sum_t X_{t,A_t}$ and $\hat R_n’ = \sum_t X_{ti}’ – \sum_t X_{t,A_t}’$. Define $U_t =\one{\eta_t\in [-1/2,1/2-\Delta]}$, $U_t’=\one{\eta_t\in [-1/2,1/2-2\Delta]}$, $A_{tj} = \one{A_t=j}$ and $A_{tj}’ = \one{A_t’=j}$. From \eqref{eq:hplb_rewardgapxt} we see that

\begin{align*}

\hat R_n \ge \Delta\, \sum_t \one{A_t\ne 1} U_t =\Delta\, \sum_t (1-A_{t1}) U_t \ge \Delta\,(U – T_1(n)) \ge \Delta\,(U + T_i(n) – n)\,,

\end{align*}

where we also defined $U = \sum_t U_t$ and used that $T_1(n)+T_i(n)\le n$. Note that $U_t=1$ indicates that $(X_{tj})_j$ are “unclipped”. Similarly, from \eqref{eq:hplb_rewardgapxtp} we see that

\begin{align*}

\hat R_n’ \ge \Delta \, \sum_t \one{A_t’\ne i} U_t’ =\Delta\, \sum_t (1-A_{ti}’) U_t’ \ge \Delta\,( U’ – T_i'(n)) \,,

\end{align*}

where $T_i'(n)=\sum_t A_{ti}’$ and $U’ = \sum_t U_t’$. Based on the lower bounds on $\hat R_n$ and $\hat R_n’$ we thus see that if $T_i(n)\ge n/2$ and $U\ge 3n/4$ then $\hat R_n \ge u\doteq \frac{n\Delta}{4}$ and if $T_i'(n) < n/2$ and $U'\ge 3n/4$ then $\hat R_n' \ge u$ holds, too. Thus, from union bounds,
\begin{align*}
\Probng{ \hat R_n \ge u } &\ge \Prob{ T_i(n)\ge n/2, U\ge 3n/4 } \ge \Prob{ T_i(n)\ge n/2 } - \Prob{U < 3n/4}\,,\\
\Probng{ \hat R_n' \ge u } &\ge \Prob{ T_i'(n) < n/2, U'\ge 3n/4 } \ge \Prob{ T_i'(n) < n/2 } - \Prob{U' < 3n/4}\,.
\end{align*}
Noticing that $U'\le U$ and hence $\Prob{U<3n/4}\le \Prob{U' <3n/4}$, we get
\begin{align*}
\max(\Probng{ \hat R_n \ge u },\Probng{ \hat R_n' \ge u })
& \ge \frac12 \Bigl(\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }\Bigr) - \Prob{U' < 3n/4}\,.
\end{align*}
The sum $\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }$ will be lower bounded with the help of the high-probability Pinsker inequality. For an upper bound on $\Prob{U’ < 3n/4}$, we have the following technical lemma:

Lemma (Control of the number of unclipped rounds): Assume that $\Delta\le 1/8$. If $n \ge 32 \log(2/\delta)$ and $\sigma\le 1/10$ then $\Prob{U’<3n/4} \le \delta/2$.

The proof is based on bounding the tail probability of the mean the i.i.d. $(U_t’)_t$ Bernoulli variables using Hoeffding’s inequality and is given later. Intuitively, it is clear that by making $\sigma^2$ small, the number of times $\eta_t$ falls inside $[-1/2,1/4]\subset [-1/2,1/2-2\Delta]$ can be made arbitrary high with arbitrary high probability.

Our goal now is to lower bound $\Prob{ T_i(n)\ge n/2 } + \Prob{ T_i'(n) < n/2 }$ by $3\delta$. As suggested before, we aim to use the high-probability Pinsker inequality. One difficulty that we face is that the events $\{T_i(n)\ge n/2\}$ and $\{T_i'(n) < n/2\}$ may not be complementary as they are defined in terms of a potentially distinct set of random variables. This will be overcome by rewriting the above probabilities using the canonical bandit probability spaces. In fact, we will use the non-extended version of these probability spaces (as defined earlier in the context of stochastic bandits). The reason of this is that after Pinsker, we plan to apply the divergence decomposition identity, which decomposes the divergence between distributions of action-reward sequences.

To properly write things let $Q_j$ denote the distribution of $X_{tj}$ and similarly let $Q_j’$ be the distribution of $X_{tj}’$. These are well defined (why?). Define the stochastic bandits $\beta=(Q_1,\dots,Q_K)$ and $\beta’=(Q’_1,\dots,Q’_K)$. Let $\Omega_n = ([K]\times \R)^n$ and let $\tilde Y_t,\tilde A_t:\Omega_n \to \R$ be the usual coordinate projection functions: $\tilde Y_t(a_1,y_1,\dots,a_n,y_n) = y_t$ and $\tilde A_t(a_1,y_1,\dots,a_n,y_n)=a_t$. Also, let $\tilde T_i(n) = \sum_{t=1}^n \one{\tilde A_t=i}$. Recall that $\PP_{\beta,\pi}$ denotes the probability measure over $\Omega_n$ that arises from the interaction of $\pi$ and $\beta$ (detto for $\PP_{\beta’,\pi}$). Now, since $T_i(n)$ is only a function of $(A_1,X_{1,A_1},\dots,A_n,X_{n,A_n})$ whose probability distribution is exactly $\PP_{\beta,\pi}$, we have

\begin{align*}

\Prob{ T_i(n) \ge n/2 } = \PP_{\beta,\pi}( \tilde T_i(n) \ge n/2 )\,.

\end{align*}

Similarly,

\begin{align*}

\Prob{ T_i'(n) < n/2 } = \PP_{\beta',\pi}( \tilde T_i(n) < n/2 )\,.
\end{align*}
Now, by the high-probability Pinsker inequality and the divergence decomposition lemma,
\begin{align*}
\Prob{ T_i(n) \ge n/2 } + \Prob{ T_i'(n) < n/2 }
& =
\PP_{\beta,\pi}( \tilde T_i(n) \ge n/2 ) + \PP_{\beta',\pi}( \tilde T_i(n) < n/2 ) \\
& \ge
\frac12 \exp\left(- \KL( \PP_{\beta,\pi}, \PP_{\beta',\pi} ) \right) \\
& =
\frac12 \exp\left(- \E_{\beta,\pi}[\tilde T_i(n)] \KL( Q_i, Q_i' ) \right) \\
& \ge
\frac12 \exp\left(- \EE{ T_i(n) } \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2) ) \right)\,,
\end{align*}
where in the last equality we used that $Q_j=Q_j'$ unless $j=i$, while in the last step we used $\E_{\beta,\pi}[\tilde T_i(n)] = \EE{ T_i(n) }$ and also that $\KL( Q_i, Q_i' ) \le \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2) )$. From where does the last inequality come, one might ask. The answer is the truncation, which always *reduces information*. More precisely, let $P$ and $Q$ be probability measures on the same probability space $(\Omega, \cF)$. Let $X:\Omega \to \R$ be a random variable and $P_X$ and $Q_X$ be the laws of $X$ with under $P$ and $Q$ respectively. Then $\KL(P_X, Q_X) \leq \KL(P, Q)$.

Now, by the choice of $i$,

\begin{align*}

\EE{ T_i(n) } \KL( \mathcal N(\tfrac12,\sigma^2), \mathcal N(\tfrac12 + 2\Delta,\sigma^2)

\le \frac{n}{K-1} \frac{2\Delta^2}{\sigma^2}\,.

\end{align*}

Plugging this into the previous display we get that if

\begin{align*}

\Delta = \sigma \sqrt{\frac{K-1}{2n} \log\frac{1}{6\delta}}

\end{align*}

then $\Prob{ T_i(n) \ge n/2 } + \Prob{ T_i'(n) < n/2 }\ge 3\delta$ and thus $\max(\Prob{ \hat R_n \ge u },\Prob{ \hat R_n'\ge u} )\ge \delta$. Recalling the definition $u = n\Delta/4$ and choosing $\sigma=1/10$ gives the following result:

Theorem (High probability lower bound for adversarial bandits): Let $K>1$, $n>0$ and $\delta\in (0,1)$ such that

\begin{align*}

n\ge \max\left( 32 \log \frac2{\delta}, (0.8)^2 \frac{K-1}{2} \log \frac{1}{6\delta}\right)

\end{align*}

holds. Then, for any bandit policy $\pi$ there exists a reward sequence $\nu = (x_1,\dots,x_n)\in [0,1]^{nK}$ such that if $\hat R_n$ is the random regret of $\pi$ when interacting with the environment $\nu$ then

\begin{align*}

\Prob{ \hat R_n \ge 0.025\sqrt{\frac{n(K-1)}{2}\log \frac{1}{6\delta}} } \ge \delta\,.

\end{align*}

So what can one take away from this post? The main thing is to recognize that the upper bounds we proved in the previous post cannot be improved very much, at least in this worst case sense. This includes the important difference between the high-probability regret that is achievable when the confidence level $\delta$ is chosen in advance and what is possible if a single strategy must satisfy a high-probability regret guarantee for all confidence levels simultaneously.

Besides this result we also introduced some new techniques that will be revisited in the future, especially the randomization device lemma. The advantage of using clipped and correlated Gaussian rewards is that it ensures the same arm is always optimal, no matter how the noise behaves.

# Technicalities

The purpose of this section is to lay to rest the two technical results required in the main body. The first a proof of the lemma which gives us the randomization technique or “device” and afterwards the proof of the proof of the lemma that controls the number of unclipped rounds.

**Proof of the randomization device lemma**

The argument underlying this goes as follows: If $A=(A_1,\dots,A_n)\in [K]^n$ are the actions of a stochastic policy $\pi=(\pi_1,\dots,\pi_n)$ when interacting with the environment where the rewards $X=(X_1,\dots,X_n)\in \R^{nK}$ are drawn from $Q$ then for $t=1,\dots,n$, $A_t\sim \pi_t(\cdot|A_1,X_{1,A_1},\dots,A_{t-1},X_{t-1,A_{t-1}})$ and thus the distribution $\PP_{Q,\pi}$ of $(X,A)$ satisfies

\begin{align*}

d\PP_{Q,\pi}(x,a)

&= \pi(a|x) d\rho^{\otimes n}(a) dQ(x) \,,

\end{align*}

where $a=(a_1,\dots,a_n)\in [K]^n$, $x\in (x_1,\dots,x_n)\in \R^{nK}$,

\begin{align*}

\pi(a|x)\doteq

\pi_1(a_1) \pi_2(a_2|a_1,x_{1,a_1}) \cdots \pi_n(a_n|a_1,x_{1,a_1}, \dots,a_{n-1},x_{n-1,a_{n-1}})

\end{align*}

and $\rho^{\otimes n}$ is the $n$-fold product $\rho$ with itself, where $\rho$ is the counting measure $\rho$ on $[K]$. Letting $\delta_x$ be the Dirac (probability) measure on $\R^{nK}$ concentrated at $x$ (i.e., $\delta_x(U) = \one{x\in U}$), we have that $\PP_{Q,\pi}$ can be *disintegrated* into $Q$ and $\{\PP_{\delta_x,\pi}\}_x$. In particular, a direct calculation verifies that

\begin{align}

d\PP_{Q,\pi}(x,a) = \int_{y\in \R^{nK}} dQ(y) \, d\PP_{\delta_y,\pi}(x,a) \,.

\label{eq:disintegration}

\end{align}

Let $(X_t,A_t)$ be the reward and action of round $t$ in the extended canonical bandit probability space and $\hat R_n$ the random regret defined in terms of these random variables. For any Borel $U\subset \R$,

\begin{align*}

\PP_{Q,\pi}( \hat R_n \in U )

&= \int \one{\hat R_n(x,a)\in U} d\PP_{Q,\pi}(x,a) \\

&= \int_{\R^{nK}} dQ(y) \left( \int_{\R^{nK}\times [K]^n} \one{\hat R_n(x,a)\in U} d\PP_{\delta_y,\pi}(x,a) \right)\\

&= \int_{\R^{nK}} dQ(y) \PP_{\delta_y,\pi}( \hat R_n\in U )\,,

\end{align*}

where the the second equality uses \eqref{eq:disintegration} and Fubini. From the above equality it is obvious that it is not possible that $\PP_{Q,\pi}( \hat R_n \in U )\ge \delta$ while for all $y\in \mathrm{support}(Q)$, $\PP_{\delta_y,\pi}( \hat R_n\in U )<\delta$, thus finishing the proof.
QED

**Proof of lemma controlling number of clipped rounds**: First note that $U’\le U$ and thus $\Prob{U < 3n/4}\le \Prob{U '< 3n/4}$ hence it suffices to control the latter.
Since $\Delta\le 1/8$ and $\eta_t$ is a Gaussian with zero mean and variance $\sigma^2$, and in particular $\eta_t$ is $\sigma^2$-subgaussian, we have
\begin{align*}
1 - p = \Prob{U_t' = 0}
& \leq \Prob{ |\eta_t| > 1/2-2\Delta }

\leq 2 \exp\left(-\frac{\left(1/2 – 2\Delta\right)^2}{2\sigma^2}\right)\\

& \leq 2 \exp\left(-\frac{1}{2 (4)^2 \sigma^2}\right)

\le \frac{1}{8}\,,

\end{align*}

where the last inequality follows whenever $\sigma^2 \le \frac{1}{32 \log 16}$ which is larger than $0.01$. Therefore $p \ge 7/8$ and

\begin{align*}

\Prob{\sum_{t=1}^n U_t’ < \frac{3n}{4}}
&= \Prob{\frac{1}{n} \sum_{t=1}^n ( U_t' - p) < -(p-\frac{3}{4}) } \\
&\le \Prob{\frac{1}{n} \sum_{t=1}^n ( U_t' - p) \le - \frac{1}{8}} \\
&\leq \exp\left(-\frac{n}{32}\right) \leq \frac{\delta}{2}\,,
\end{align*}
where the second last inequality uses Hoeffding’s bound together with that $U_t’-p$ is $1/4$-subgaussian, and the last holds by our assumption on $n$.

QED

# Notes

Note 1: It so happens that the counter-example construction we used means that the same arm has the best reward in every round (not just the best mean). It is perhaps a little surprising that algorithms cannot exploit this fact, in contrast the experts setting where this knowledge enormously improves the achievable regret.

Note 2: Adaptivity is all the rage right now. Can you design an adversarial bandit algorithm that exploits “easy data” when available? For example, if the rewards don’t change much over time, or lie in a small range. There are still a lot of open questions in this area. The paper referenced below gives lower bounds for some of these situations.

# References

Sebastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits. 2016

# Contextual Bandits and the Exp4 Algorithm

In most bandit problems there is likely to be some additional information available at the beginning of rounds and often this information can potentially help with the action choices. For example, in a web article recommendation system, where the goal is to keep the visitors engaged with the website, *contextual information* about the visitor of the website, the time of day, information on what is trendy, etc., can likely improve the choice of the article to be put on the “front-page”. For example, a science-oriented article is more likely to grab the attention of a science geek, and a baseball fan may care little about European soccer.

If we used a standard bandit algorithm (like Exp3, Exp3-IX, or UCB), the one-size fits all approach implicitly taken by these algorithms which aim just finding the single most catching article is likely to disappoint an unnecessarily large portion of the site’s visitors. In situations like this, since the benchmark that the bandit algorithms aim to approach performs poorly by omitting available information, it is better to change the problem and redefine the benchmark! It is important to realize though that there is an inherent difficulty here:

Competing with a poor benchmark does not make sense since even an algorithm that perfectly matches the benchmark will perform poorly. At the same time, competing with a better benchmark can be harder from a learning point of view and in a specific scenario the gain from a better benchmark may very well be offset by the fact that algorithms that compete with stronger benchmark have to search in a larger space of possibilities.

The tradeoff just described is fundamental to all machine learning problems. In statistical estimation, the analogue tradeoff is known as the *bias-variance tradeoff*.

We will not attempt to answer the question of how to resolve this tradeoff in this post because first we need to see ways ways effectively competing with improved benchmarks. First, let’s talk about possible improvements to the benchmarks.

# Contextual bandits: One bandit per context

In a contextual bandit problem everything works the same as in a bandit problem, except that the learner receives a context at the beginning of the round, before it needs to select its action. The promise, as discussed, is that perhaps specializing the action taken to the context can help to collect more reward.

Assuming that the set $\cC$ of all possible contexts is finite, one simple approach then is to set up a separate bandit algorithm, such as Exp3, for each context. Indeed, if we do this then the collection of bandit algorithms should be able to compete with the best context-dependent action. In particular, the best total reward that we can achieve in $n$ rounds if we are allowed to adjust the action to the context is

\begin{align*}

S_n = \sum_{c\in \cC} \max_{k\in [K]} \sum_{t: c_t=c} x_{t,k}\,,

\end{align*}

where $c_t\in \cC$ is the context received at the beginning of round $t$. For future reference note that we can also write

\begin{align}

S_n = \max_{\phi: \cC \to [K]} \sum_{t=1}^n x_{t,\phi(c_t)}\,.

\label{eq:maxrewardunrestricted}

\end{align}

Then, the regret of a learner who incurs a reward $X_t$ in round $t$ is $R_n=S_n – \sum_t X_t$, which satisfies

\begin{align*}

R_n = \sum_{c\in \cC} \EE{\max_{k\in [K]} \sum_{t: c_t=c} (x_{t,k}-X_t)}\,.

\end{align*}

That is, $R_n$ is just the sum of the regrets suffered by the bandits assigned to the individual contexts. Let $T^c(n)$ be the number of times context $c\in \cC$ is seen during the first $n$ rounds and let $R^c(s)$ be the regret of the instance of Exp3 associated with $c$ at the end of the round when this instance is used $s$ times. With this notation we can thus write

\begin{align*}

R_n = \sum_{c\in \cC} \EE{ R^c(T^c(n)) }\,.

\end{align*}

Since $T^c(n)$ may vary from context to context and the distribution of $T^c(n)$ may be uneven, it would be wasteful to use a version of Exp3 that is tuned to achieve a small regret after a fixed number of rounds as such a version of Exp3 may suffer a larger regret when the actual number of rounds $T^c(n)$ is vastly different from the anticipated horizon. Luckily, the single parameter $\eta$ of Exp3 can be chosen in a way that does not depend on the anticipated horizon without losing much on the regret. In particular, as we hinted on this before, such an *anytime* version of Exp3 can be created if we let the $\eta$ parameter of Exp3 depend on the round index. In particular, when an Exp3 instance is used the $s$th time, if we set $\eta$ to $\eta_s = \sqrt{\log(K)/(sK)}$, then one can show that $R^c(s) \le 2\sqrt{sK \log(K)}$ will hold for any $s\le 1$.

Plugging this into the previous display, we get

\begin{align}

R_n \le 2 \sqrt{K \log(K)} \, \sum_{c\in \cC} \sqrt{ T_c(n) }\,.

\label{eq:exp3perc}

\end{align}

How big the right-hand side is depends on how skewed the context distribution $(T_c(n)/n)_{c\in \cC}$ is.

The best case is when only one context is seen, in which case $\sum_{c\in \cC} \sqrt{ T_c(n) }=\sqrt{n}$. Note that in this case the benchmark won’t improve either, but nevertheless the algorithm that keeps one Exp3 instance for each context does not lose anything compared to the algorithm that would run a single Exp3 instance, trying to compete with the single best action. This is good. The worst-case is when all context are seen equally often, in which case case $T_c(n) = n/|\cC|$ (assume for simplicity that this is an integer). In this case, $\eqref{eq:exp3perc}$ becomes

\begin{align}

R_n \le 2 \sqrt{ K \log(K) |\cC| n }\,.

\label{eq:exp3perc2}

\end{align}

It is instructive to consider what this means for the total reward:

\begin{align*}

\EE{\sum_{t=1}^n X_t} \ge S_n – 2 \sqrt{ K \log(K) |\cC| n }\,.

\end{align*}

While the benchmark $S_n$ may have improved, the second term will always exceed the first one for the first $4 K \log(K) |\cC|$ rounds! Thus, the guarantee on the total reward will be vacuous for all earlier time steps! When $|\cC|$ is large, we conclude that for a long time, the per-instance Exp3 algorithm may have a much worse total reward than if we just ran a single instance of Exp3, demonstrating the tradeoff mentioned above.

# Bandits with expert advice

For large context sets, using one bandit algorithm per context will almost always be a poor choice because the additional precision is wasted unless the amount of data is enormous. Very often the contexts themselves have some kind of internal structure that we may hope to exploit. There are many different kinds of structure. For example, we expect that a person who is deeply into computer science may share common interests with people who are deeply into math, and people who are into sport acrobatics may enjoy gymnastics and vice versa. This gives the idea to group the contexts in some way, to reduce their number, and then assign a bandit to the individual groups. Assume that we chose a suitable partitioning of contexts, which we denote by $\cP \subset 2^{\cC}$. Thus, the elements of $\cP$ are disjoint subsets of $\cC$ such that jointly they cover $\cC$: $\cup \cP = \cC$. Then, the maximum total reward that can be achieved if for every element $P$ of the partitioning $\cP$ we can select a single action is

\begin{align*}

S_n = \sum_{P\in \cP} \max_{k\in [K]} \sum_{t: c_t\in P} x_{t,k}\,.

\end{align*}

It may be worthwhile to put this into an alternate form. Defining $\Phi(\cP) = \{\phi:\cC\to[K]\,;\, \forall c,c’\in \cC \text{ s.t. } c,c’\in P \text{ for some } P\in \cP, \phi(c) = \phi(c’)\}$ to be the set of functions the map contexts in the same partition to the same action (the astute reader may recognize $\Phi(\cP)$ as the set of all $\sigma(\cP)$-measurable functions from $\cC$ to $[K]$), we can rewrite $S_n$ as

\begin{align*}

S_n = \max_{\phi\in \Phi(\cP)} \sum_{t=1}^n x_{t,\phi(c_t)}\,.

\end{align*}

Compared to $\eqref{eq:maxrewardunrestricted}$ we see that what changed is that the set of functions that we are maximizing over became smaller. The reader may readily verify that the regret of the composite algorithm where an Exp3 with a varying parameter sequence as described above is used for each element of $\cP$ is exactly of the form $\eqref{eq:exp3perc}$ except that $c\in \cC$ has to be changed $P\in \cP$ and the definition of $T_c$ also needs to be adjusted accordingly.

Of course, $\Phi(\cP)$ is not the only set of functions that come to mind. We may consider of course different partitions. Or we may bring in extra structure of $\cC$. Why not, for example, use a similarity function of $\cC$ and then consider all functions which tend to assign identical actions to contexts that are more similar. For example, if $s:\cC\times \cC \to [0,1]$ is a *similarity function*, we may consider all functions $\phi: \cC\to[K]$ such that the average dissimilarity,

\begin{align*}

\frac{1}{|\cC|^2} \sum_{c,d\in \cC} (1-s(c,d)) \one{ \phi(c)\ne \phi(d) },

\end{align*}

is below a user-tuned threshold $\theta\in (0,1)$.

Another options is to run your favorite supervised learning method, training on some batch data to find a few predictors $\phi_1,\dots,\phi_M: \cC \to [K]$ (in machine learning terms, these would be called classifiers since the range space is finite). Then we could use a bandit algorithm to compete with the “best” of these in an online fashion. This has the advantage that the offline training procedure can bring in the power of batch data and the whole army of supervised learning, without relying on potentially inaccurate evaluation methods that aim to pick the best of the pack. And why pick if one does not need to?

The possibilities are endless, but in any case, we would end up with a set of functions $\Phi$ with the goal of competing with the best of them. This gives the idea that perhaps we should think more generally about some subset $\Phi$ of functions without necessarily considering the internal structure of $\Phi$. This is the viewpoint that we will take. In fact, we will bring this one or two steps further, leading to what is called *bandits with expert advice*.

In this model, there are $M$ experts that we wish to compete with. At the beginning of each round, the experts announce their predictions of which actions are the most promising. For the sake of generality, we allow the experts to report not only a single prediction, but a probability distribution over the actions. The interpretation of this probability distribution is that the expert, if the decision was left to it, would choose the action for the round at random from the probability distribution that it reported. As discussed before, in an adversarial setting, it is natural to consider randomized algorithms, hence one should not be too surprised that the experts are also allowed to randomize. In any case, this can only increase generality. For reasons that will become clear later, it will be useful to collect the advice of the $M$ experts for round $t$ into an $M\times K$ matrix $E^{(t)}$ such that the $m$th row of $E^{(t)}$, $E_m^{(t)}$ is the probability distribution that expert $m$ recommends for round $t$. Denote by $E_{mi}^{(t)}$ the $i$th entry of the row vector $E_m^{(t)}$, we thus have $\sum_i E_{mi}^{(t)}=1$ and $E_{mi}^{(t)}\ge 0$ for every $(m,t,i)\in [M]\times \N \times [K]$. After receiving $E^{(t)}$, the learner then chooses $A_t\in [K]$ and as before observes the reward $X_t = x_{t,A_t}$, where $x_t=(x_{ti})_i$ is the $K$-dimensional vector of rewards of the individual actions. For a real-world application, see the figure below.

The regret of the learner is with respect to the total expected reward of the best *expert*:

\begin{align}

R_n = \EE{ \max_m \sum_{t=1}^n E_{m}^{(t)} x_{t} – \sum_{t=1}^n X_t }\,.

\label{eq:expertregret}

\end{align}

Below, we may allow the expert advice $E^{(t)}$ of round $t$ to depend on all the information up the beginning of round $t$. While this does allow “learning” experts, the regret definition above is not really meaningful if the experts would learn from the feedback $(A_t,X_t)$. For dealing with learning experts, it is more appropriate to measure regret as done in *reinforcement learning* where an agent controls the state of the environment, but the agent’s reward is compared to the best total reward that any other (simple) policy would incur, regardless of the “trajectory” of the agent. We will discuss reinforcement learning in some later post.

# Can it go higher? Exp4

Exp4 is actually not just an increased version number, but it stands for **E**xponential weighting for **E**xploration and **E**xplotation with **E**xperts. The idea of the algorithm is very simple: Since exponential weighting worked so well in the standard bandit problem, we should adopt it to the problem at hand. However, since now the goal is to compete with the best expert in hindsight, it is not the actions that we should score, but the experts. Hence, the algorithm will keep a probability distribution, which we will denote by $Q_t$, over the experts and use this to come up with the next action. Once the action is chosen, we can use our favorite reward estimation procedure to estimate the rewards for all the actions, which can then be used to estimate how much total reward the individual experts would have made so far, which in the end can be used to update $Q_t$.

Formally, the procedure is as follows: First, the distribution $Q_1$ is initialized to the uniform distribution $(1/M,\dots,1/M)$ (the $Q_t$ are treated as row vectors). Then, some values of $\eta,\gamma\ge 0$ are selected.

In round $t=1,2,\dots$, the following things happen:

- The advice $E^{(t)}$ is received
- Choose the action $A_t\sim P_{t,\cdot}$ at random, where $P_t = Q_t E^{(t)}$
- The reward $X_t = x_{t,A_t}$ is received
- The rewards of all the actions are estimated; say: $\hat X_{ti} = 1- \frac{\one{A_t=i}}{P_{ti}+\gamma} (1-X_t)$
- Propagate the rewards to the experts: $\tilde X_t = E^{(t)} \hat X_t$
- The distribution $Q_t$ is updated using exponential weighting:

$Q_{t+1,i} = \frac{\exp( \eta \tilde X_{ti} ) Q_{ti}}{\sum_j \exp( \eta \tilde X_{tj}) Q_{tj} }$, $i\in [M]$

Note that $A_t$ can be chosen in two steps, first sampling $M_t$ from $Q_{t}$ and then choosing $A_t$ from $E^{(t)}_{M_t,\cdot}$. The reader can verify that (given the past) the probability distribution of the so-selected action is also $P_{t}$.

# A bound on the expected regret of Exp4

The algorithm when we set $\gamma=0$ is actually what is most commonly known as Exp4 and the algorithm when $\gamma>0$ is the “IX” version of Exp4. As in the case of Exp3, setting $\gamma>0$ helps concentrating the regret. Here, for the sake of brevity we only consider the expected regret of Exp4; the analysis of the tail properties of Exp4-IX with appropriately tuned $\eta,\gamma$ is left as an exercise for the reader.

To bound the expected regret of Exp4, we apply the analysis of Exp3. In particular, the following lemma can be extracted from our earlier analysis of Exp3:

Lemma (regret of exponential weighting):Let $(\hat X_{ti})_{ti}$ and $(P_{ti})_{ti}$ satisfy for all $t\in [n]$ and $i\in [K]$ the relations $\hat X_{ti}\le 1$ and

\begin{align*}

P_{ti} = \frac{\exp( \eta \sum_{s=1}^t \hat X_{ti} )}{\sum_j\exp( \eta \sum_{s=1}^t \hat X_{tj} )}\,.

\end{align*}

Then, for any $i\in [K]$,

\begin{align*}

\sum_{t=1}^n \hat X_{ti} – \sum_{t=1}^n \sum_{j=1}^K P_{tj} \hat X_{tj} \le \frac{\log(M)}{\eta} + \frac{\eta}{2} \sum_{t,j} P_{tj} (1-\hat X_{tj})^2\,.

\end{align*}

Based on this lemma, we immediately get that for any $m\in [M]$,

\begin{align*}

\sum_{t=1}^n \tilde X_{tm} – \sum_{t=1}^n \sum_{m’} Q_{t,m’} \tilde X_{tm’} \le \frac{\log(M)}{\eta}

+ \frac{\eta}{2}\, \sum_{t,m’} Q_{t,m’} (1-\tilde X_{tm’})^2\,.

\end{align*}

Note that by our earlier argument $\EE{\hat X_{ti}} = x_{ti}$ and hence the expected value of the left-hand of the above display is the regret $R_n$ defined in $\eqref{eq:expertregret}$. Hence, to derive a regret bound it remains to bound the expectation of the right-hand side. For this, as before, we will find it useful to introduce the loss estimates $\hat Y_{ti} = 1-\hat X_{ti}$, the losses, $y_{ti} = 1-x_{ti}$, and also the loss estimates of experts: $\tilde Y_{tm} = 1-\tilde X_{tm}$, $m\in [M]$. Note that $\tilde Y_t = E^{(t)} \hat Y_t$, thanks to $E^{(t)} \mathbf{1} = \mathbf{1}$, where $\mathbf{1}$ is the vector whose components are all equal to one. Defining $A_{ti} = \one{A_t=i}$, we have $\hat Y_{ti} = \frac{A_{ti}}{P_{ti}} y_{ti}$.

Hence, using $\E_{t-1}[\cdot]$ to denote expectation conditioned on the past $(E^{(1)},A_1,X_1,\dots,E^{(t-1)},A_{t-1},X_{t-1},E^{(t)})$, we have

\begin{align*}

\E_{t-1}[ \tilde Y_{tm}^2 ] =\E_{t-1}\left[ \left(\frac{E^{(t)}_{m,A_t} y_{t,A_t}}{P_{t,A_t}}\right)^2\right]

= \sum_i \frac{(E^{(t)}_{m,i})^2 y_{t,i}^2}{P_{ti}} \le \sum_i \frac{(E^{(t)}_{m,i})^2 }{P_{ti}}\,.

\end{align*}

Therefore, \(\require{cancel}\)

\begin{align*}

\E_{t-1}\left[ \sum_m Q_{tm} \tilde Y_{tm}^2 \right]

&\le \sum_m Q_{tm} \sum_i \frac{(E^{(t)}_{m,i})^2 }{P_{ti}}

\le \sum_{i} \left(\max_{m’} E_{m’,i}^{(t)}\right) \frac{ \cancel{\sum_m Q_{tm} E^{(t)}_{m,i}} }{\cancel{P_{ti}}} \,.

\end{align*}

Defining

\begin{align*}

E^*_n = \sum_{t,i} \left(\max_{m’} E_{m’,i}^{(t)}\right)\,,

\end{align*}

we get the following result:

Theorem (Regret of Exp4):If $\eta$ is chosen appropriately, the regret of Exp4 defined in $\eqref{eq:expertregret}$ satisfies

\begin{align}

R_n \le \EE{\sqrt{ 2 \log(M) E^*_n }}\,.

\label{eq:exp4bound}

\end{align}

Note that to get this result, one should set $\eta = \sqrt{(2\log M)/E^*_n}$, which is an infeasible choice when $E^{(1)},\dots,E^{(n)}$ are not available in advance, as is typically the case. Hence, in a practical implementation, one sets $\eta_t = \sqrt{\log M/E^*_t}$, which is a decreasing sequence (the factor of $2$ inside the square root will be lost due to some technical reasons). Then, as discussed before, a regret bound of the above form with a slightly larger constant will hold.

To assess the quality of the bound in the above theorem we consider a few upper bounds on $\eqref{eq:exp4bound}$. First, the expression in \eqref{eq:exp4bound} is the smallest when all experts agree: $E_m^{(t)} = E_{m’}^{(t)}$ for all $t,m,m’$. In this case $E_n^*= n$ and the right-hand side of $\eqref{eq:exp4bound}$ becomes $\sqrt{ 2 \log(M) n }$. We see that the price we pay for not knowing in advance that the experts will agree is $O(\sqrt{\log(M)})$. This case also highlights that in a way $E_n^*$ measures the amount of agreement between the experts.

In general, we cannot know whether the experts will agree. In any case, from $E_{m,i}^{(t)}\le 1$, we get $E_n^*\le Kn$, leading to

\begin{align}

R_n \leq \sqrt{ 2n K \log(M) }\,.

\label{eq:fewactions}

\end{align}

This shows that the price paid for adding experts is relatively low, as long as we have few actions.

How about the opposite case when the number of experts is low? Using $\max_m E_{mi}^{(t)} \le \sum_m E_{mi}^{(t)}\le M$, we get $E_n^* \le M n$ and thus

\begin{align*}

R_n \leq \sqrt{ 2n M \log(M) }\,.

\end{align*}

This shows that the number of actions can be very high, yet the regret will be low if we have only a few experts. This should make intuitive sense.

The above two bounds can also be summarized as (as they hold simultaneously):

\begin{align*}

R_n \leq \sqrt{ 2n (M\wedge K) \log(M) }\,.

\end{align*}

In a way, Exp4 adapts to whether the number of experts, or the number of actions is small (and in fact adapts to the alignment between the experts).

How does this bound compare to our earlier bounds, e.g., $\eqref{eq:exp3perc2}$? The number $M$ of all possible maps of $\cC$ to $[K]$ is $|[K]^{\cC}| = K^{|\cC|}$. Taking $\log$, we see that $\log(M) =|\cC| \log(K)$. From $\eqref{eq:fewactions}$, we conclude that $R_n \le \sqrt{ 2n K |\cC| \log(K) }$, which is the same as $\eqref{eq:exp3perc2}$, up to a constant factor, which would be lost anyways if we properly bounded the regret of Exp4 that uses changing parameters. While we get back the earlier bound, we won’t be able to implement the Exp4 algorithm for even moderate context and action sizes. In particular, the memory requirement of Exp4 is $O(K^{|\cC|})$, while the memory requirement of the specialized method where an Exp3 algorithm is used with every instance is $O(|\cC| K)$. The setting when Exp4 is practically useful is when $M$ is small and $\cC$ is large.

# Conclusions

In this post we have introduced the contextual bandit setting and the Exp4 category of strategies. Perhaps the most important point of this post beyond the algorithms is to understand that there are trade-offs between having a larger comparison class and a more meaningful definition of the regret that this entails.

There are many points we have not developed in detail. One is high probability bounds, which we saw in the previous post for Exp-IX and can also be derived here. We also have not mentioned lower bounds. As it turns out, the results we have given are essentially tight, at least in some instances and we hope to return to this topic in future posts. Finally there is the “hot-topic” of adaptation. For example, what is gained in terms of the regret if the experts are often in agreement.

# References

- For a history of contextual bandits see: Tewari and Murphy. From Ads to Interventions: Contextual Bandits in Mobile Health, 2016
- The Exp4 algorithm was introduced by Auer et al. The Non-stochastic Multiarmed Bandit Problem, 2002.
- For tighter bounds when the experts agree to some extent: McMahan and Streeter. Tighter Bounds for Multi-Armed Bandits with Expert Advice, 2009.
- For the Exp4-IX: Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits, 2015.
- Another approach for high-probability bounds generalizing Exp3.P: Beygelzimer et al. Contextual Bandit Algorithms with Supervised Learning Guarantees, 2011

# Stochastic Linear Bandits and UCB

Recall that in the adversarial contextual $K$-action bandit problem, at the beginning of each round $t$ a context $c_t\in \Ctx$ is observed. The idea is that the context $c_t$ may help the learner to choose a better action. This led us to change the benchmark in the definition of regret. In this post we start with reviewing how contextual bandit problems can be defined in the stochastic setting. We use this setting to motivate the introduction of stochastic linear bandits, a fascinatingly rich model with much structure and which will be the topic of a few of the next posts. Besides defining stochastic linear bandits we also cover how UCB can be generalized to this setting.

# Stochastic contextual bandits

In the standard $K$-action stochastic contextual bandit problem at the beginning of round $t$ the learner observes a context $C_t\in \Ctx$. The context may or may not be random. Next, the learner chooses its action $A_t\in [K]$ based on the information available. So far these is no difference to the adversarial setting. The difference comes from the assumption that the reward $X_t$ which is incurred satisfies

\begin{align*}

X_t = r(C_t,A_t) + \eta_t\,,

\end{align*}

where $r:\Ctx\times [K]\to \R$ is the so-called *reward function*, which is unknown to the learner, while $\eta_t$ is random noise.

In particular, the assumption on the noise is as follows: Let

\begin{align*}

\cF_t = \sigma(C_1,A_1,X_1,\dots,C_{t-1},A_{t-1},X_{t-1},C_t,A_t)

\end{align*}

be the $\sigma$-field summarizing the information available just before $X_t$ is observed. Then, given the past, we assume that $\eta_t$ is conditionally $1$-subgaussian:

\begin{align*}

\EE{ \exp( \lambda \eta_t ) | \cF_t } \le \exp(\frac{\lambda^2}{2})\,.

\end{align*}

(The constant $1$ is chosen to minimize the number of symbols; there is no difficulty considering the more general case of conditionally $\sigma^2$-subgaussian noise.) As discussed beforehand, subgaussian random variables have zero mean, hence the above assumption also implies that $\EE{\eta_t|\cF_t}=0$, or $\EE{X_t|\cF_t}=r(C_t,A_t)$. In words, for any given $(c,a)\in \Ctx\times [K]$ context-action pair, $r(c,a)$ gives the *mean reward* of action $a$ in context $c$.

If $r$ was given then the learner wishing to maximize the total expected reward would choose the action $A_t^* = \argmax_{a\in [K]} r(C_t,a)$ in round $t$ (if multiple maximizers exist, choose one). The loss due to the lack of knowledge of $r$ makes the learner incur the (expected) regret

\begin{align*}

R_n = \EE{ \sum_{t=1}^n \max_{a\in[K]} r(C_t,a) – \sum_{t=1}^n X_t }\,.

\end{align*}

# Towards linear bandits

To act eventually optimally, the learner may estimate $r(c,a)$ for each $(c,a)\in \Ctx\times [K]$ pair. Similarly to what happens in the adversarial setting, this is ineffective when the number of context-action pairs is large in that in this case the regret can be very high for a long time. In particular, just like in the adversarial case the worst-case regret over all possible contextual problems with $M$ contexts and mean reward in $[0,1]$ is at least $\Omega( \sqrt{nMK} )$. One refinement of the bound is to replace $M$ by the number of effective contexts, i.e., contexts that appear frequently. But if context has a lot of detailed information, this still may very high. However, if the reward function enjoys additional structure, this worst case argument will fail. The additional structure may come in many different forms. “Smoothness”, which was also mentioned previously when we discussed adversarial contextual bandits, is one example.

An alternative (but related) assumption uses the linear structure of the set $\R^{\Ctx\times [K]}$ of all possible reward functions (recall that real-valued functions form a vector space over the reals). The so-called **linearity assumption** postulates that $r$ belongs to a *low-dimensional *linear subspace $\cS$ of $\cV \doteq \R^{\Ctx\times [K]}$.

A somewhat finer condition is to assume a specific “parameterization” of the subspace $\cS$. This works as follows: It is assumed that the learner is given access to a map $\psi: \Ctx \times [K] \to \R^d$ and that with some unknown parameter vector $\theta_*\in \R^d$,

\begin{align*}

r(c,a) = \theta_*^\top \psi(c,a)\,, \qquad \forall (c,a)\in \Ctx\times [K]\,.

\end{align*}

The map $\psi$, in line with the machine learning literature, is called a **feature-map**. Assuming the knowledge of $\psi$ is equivalent to knowing the linear subspace $\cS$. The refinement of the subspace condition comes from extra assumptions that one puts on $\theta_*$, such as that the magnitude of $\theta_*$ as measured in a certain norm $\norm{\cdot}$ is “small”, or that even $\norm{\theta_*}\le B$ with $B$ known, or, more generally, that $\theta_*\in \Theta$ for some known (small) set $\Theta\subset \R^d$ or a mix of these.

The idea of **feature maps** is best illustrated with an example: If the context denotes the visitor of a website selling books, the actions are books to recommend, the reward is the revenue on a book sold then the features could indicate the interests of the visitors as well as the domain and topic of the books. If the visitors and books are assigned to finitely many categories, indicator variables of all possible combinations of these categories could be used to create the feature map. Of course, many other possibilities exist. One such possibility is to train a neural network (deep or not) on historical data to predict the revenue and use the nonlinear map that we obtained by removing the last layer of the neural network. The subspace $\Psi$ spanned by the **feature vectors** $\{\psi(c,a)\}_{c,a}$ in $\R^d$ is called the **feature-space**.

An assumption on $\norm{\theta_*}$ encodes **smoothness** of $r$. In particular, from Hölder’s inequality,

\begin{align*}

|r(c,a)-r(c’,a’)| \le \norm{\theta_*} \norm{\psi(c,a)-\psi(c’,a’)}_*\,,

\end{align*}

where $\norm{\cdot}_*$ denotes the dual of $\norm{\cdot}$. This is how $\psi$ implicitly encodes “smoothness”. Restrictions on $\norm{\theta_*}$ have a similar effect to assuming that the dimensionality $d$ of the subspace $\cS$ is small. In fact, one may push this to the extreme and allow $d$ to be infinite, which can buy tremendous flexibility. With this much flexibility the linearity assumption perhaps feels less limiting.

Another assumption which is similar yet different from the previously mentioned ones is to assume that $\theta_*$ is **sparse**. By this we mean that many entries of $\theta_*$ are zero. Sometimes this is written as that the $0$-norm of $\theta_*$ is small. Here, the zero-norm, $\norm{\theta_*}_0$, just counts the number of nonzero entries of $\theta_*$: $\norm{\theta_*}_0 = |\{ i \,:\, \theta_{*,i}\ne 0 \}|$. The point of the sparsity assumption is to remove the burden from the designer of the feature map to leave out components that are unnecessary to get a good approximation of $r$: The designer may then include any component that may be relevant in predicting the reward, enforcing the selection of the “relevant” features by imposing a constraint on the number of nonzero entries of $\theta_*$.

# Stochastic linear bandits

**Stochastic linear bandits** arise from realizing that when the reward is linear in the feature vectors, the identity of the actions becomes secondary and we rather let the algorithms choose the feature vectors directly: the identity of the actions adds no information or structure to the problem. This results in the following model: In round $t$, the learner is given the decision set $\cD_t\subset \R^d$ that it has to choose an action from. If the learner chooses $A_t\in \cD_t$, it incurs the reward

\begin{align*}

X_t = \ip{A_t,\theta_*} + \eta_t\,,

\end{align*}

where $\eta_t$ is $1$-subgaussian given $\cD_1,A_1,X_1,\dots,\cD_{t-1},A_{t-1},X_{t-1},\cD_t$ and $A_t$. The regret suffered is

\begin{align*}

R_n = \EE{ \sum_{t=1}^n \max_{a\in \cD_t} \ip{a,\theta_*} – \sum_{t=1}^n X_t }\,.

\end{align*}

With $\cD_t = \{ \phi(c_t,a) \,:\, a\in [K] \}$ the model reproduces contextual bandits. When $\cD_t = \{e_1,\dots,e_d\}$ (where $e_1,\dots,e_d$ are the unit vectors in the standard Euclidean basis), stochastic linear bandits reproduce finite action stochastic bandits. Linear stochastic bandits also arise naturally with **combinatorial action sets**, i.e., when $\cD \subset \{0,1\}^d$: Many combinatorial problems (such as matching, least-cost problems in directed graphs, choosing spanning trees, etc.) can be written as linear optimization over some combinatorial set $\cD$ obtained from considering incidence vectors often associated with some graph. We hope to cover some of these fun problems later.

# UCB for linear bandits

The UCB algorithm is a very attractive algorithm for finite-action stochastic bandits: It is near-minimax optimal and is also almost instance optimal for any finite horizon and even asymptotically. It is thus quite natural to attempt to generalize UCB to the linear settings.

The generalization is based on the view that UCB implements the optimism in the face of uncertainty principle, according to which one should choose the actions as if the environment (in our case the linear bandit environment) was as nice as plausible possible. In finite-action stochastic bandit problems the principle dictates to choose the action with the largest upper confidence bound. In the case of linear bandit problems this still holds, but now to calculate the upper confidence bounds one should also better take into account the information conveyed by all the rewards observed because all the data $(A_1,X_1,\dots,A_{t-1},X_{t-1})$ is now connected through the unknown parameter vector.

One idea is to construct a “confidence set” $\cC_{t}$ based on $(A_1,X_1,\dots,A_{t-1},X_{t-1})$ that contains the unknown parameter vector $\theta_*$ with high probability. Leaving details of how the confidence set is constructed aside for a moment but assuming that the confidence set indeed contains $\theta_*$, for any given action $a\in \R^d$,

\begin{align}

\mathrm{UCB}_t(a) = \max_{\theta\in \cC_t} \ip{a,\theta}

\label{eq:UCBCt}

\end{align}

will be an upper bound on the mean payoff of $a$. The UCB algorithm that uses the confidence set $\cC_t$ at time $t$ then selects

\begin{align}

A_t = \argmax_{a\in \cD_t} \mathrm{UCB}_t(a)\,.

\label{eq:UCBgencDt}

\end{align}

In fact, the last equation is what we will take as the definition of UCB regardless of how the $\UCB_t(\cdot)$ values are defined. Of course the naming is justified only when the UCB values are indeed upper bounds on the mean payoffs of the actions.

Depending on the authors, UCB applied to linear bandits is known by many names, including but not limited to LinRel (after perhaps **Lin**ear **R**einforcement **L**earning), LinUCB (an obvious choice), and OFUL (**O**ptimism in the **F**ace of **U**ncertainty for **L**inear bandits), just to mention a few.

## Computation

Note that as long as $\cD_t$ has a few vectors in it, and the linear optimization problem \eqref{eq:UCBCt} can be efficiently solved (such as when $\cC_t$ convex), the computation is efficient. To discuss the computation cost further note that the computation of $A_t$ can also be written as

\begin{align}

(A_t,\tilde \theta_t) = \argmax_{(a,\theta)\in \cD_t\times \cC_t} \ip{a,\theta}\,.

\label{eq:UCBgenjoint}

\end{align}

This is a bilinear optimization problem over the set $\cD_t \times \cC_t$. In general, nothing much can be said about the computational efficiency of solving this problem. One special case when a solution can be found efficiently is when *(i)* the linear optimization problem $\max_{a\in \cD} \ip{a,\theta}$ can be efficiently solved for any $\theta$ (this holds in many combinatorial settings) and *(ii)* $\cC_t$ is the convex hull of a handful of vertices: $\cC_t = \mathrm{co}(c_{t1},\dots,c_{tp})$. Indeed, in this case for any $a\in \cD_t$, $\argmax_{\theta\in \cC_t} \ip{a,\theta}$ is one of $c_{t1},\dots,c_{tp}$. Hence, the solution to \eqref{eq:UCBgenjoint} can be obtained by solving $\max( \max_{a\in \cD_t} \ip{a,c_{t1}}, \dots, \max_{a\in \cD_t} \ip{a,c_{tp}} )$. A special case of this is when $\cC_t$ is the skewed and shifted $\ell^1$-ball, i.e. when for some nonsingular matrix $A$ and vector $\theta_0\in \R^d$, $\cC_t = \{ \theta\,:\, \norm{A(\theta – \theta_0)}_1 \le \beta_t \}$. Note that in this case $p=2d$.

Another notable case is when $\cC_t$ is an ellipsoid. To minimize clutter in writing the definition of an ellipsoid, let us introduce the notation $\norm{x}^2_V$ which is defined for a $V$ $d\times d$ positive definite matrix and its value is $x^\top V x$. The notation is justified since $\norm{\cdot}_{V}$ is indeed a norm. We will call $\norm{x}_V$ the $V$-norm of $x$. With this, choosing some center $\hat \theta\in \R^d$, $V\succ 0$ positive definite matrix and “radius” $\beta$, an ellipsoidal confidence set takes the form $\cC_t = \{\theta\in \R^d \,:\, \norm{\theta-\hat \theta}^2_{V}\le \beta \}$ (in general, $\hat \theta$, $V$ and $\beta$ will be dependent on past observations, and the reason of not absorbing $\beta$ into the definition of $V$ will become clear later).

When $\cC_t$ is of this ellipsoidal form, the UCB values are particularly simple. To see this, first we rewrite $\cC_t$ into an alternate, equivalent form. Defining $B_2 = \{x\in \R^d\,:\, \norm{x}_2\le 1\}$ to be the unit ball with respect to the Euclidean norm, it is easy to see that $\cC_t = \hat \theta + \beta^{1/2} V^{-1/2} B_2$. Using this, a short direct calculation gives

\begin{align}\label{eq:linucb}

\mathrm{UCB}_t(a) = \ip{a,\hat \theta} + \beta^{1/2} \norm{a}_{V^{-1}}\,.

\end{align}

Note the similarity to the standard finite-action UCB algorithm: Interpreting $\hat \theta$ as the estimate of $\theta_*$, $\ip{a,\hat \theta}$ can be seen as the estimate of the mean reward of $a$, while $\beta^{1/2} \norm{a}_{V^{-1}}$ is a bonus term. As we shall see later the confidence values could be defined by setting $\hat \theta$ to be the **least–squares estimator** (LSE) of $\theta_*$ with perhaps an appropriate regularization, while $V$ could be set to $V_t$, the “regularized Grammian” matrix defined using

\begin{align}

V_t=V_0 + \sum_{s=1}^{t} A_s A_s^\top, \qquad t\in [n]\,,

\label{eq:linbanditreggrammian}

\end{align}

where $V_0\succ 0$ is a fixed positive semidefinite matrix, which is often set to $\lambda I$ with some $\lambda>0$ tuning parameter.

When $\cD_t = \{e_1,\dots,e_d\}$ for all $t\in [n]$, $\hat \theta_i$ becomes the empirical mean of action $e_i$ (which could in this case be meaningfully called the $i$th action), while with $V_0=0$ the matrix $V_t$ is diagonal with its $i$ diagonal entry being the number of times action $e_i$ is used up to and including round $t$.

Soon we will see why the choice of $(V_t)_t$ is natural. The impatient reader may check that in the special case when $\eta_t$ is an i.i.d. zero-mean Gaussian sequence and the vectors $(A_t)_t$ are deterministically chosen (which is clearly far from what happens in UCB), $V_t$ with $V_0=0$ comes up naturally when defining a confidence set for $\theta_*$. The behavior of UCB is illustrated on the figure below.

# A Generic Bound on the Regret of UCB

In this section we bound the regret of UCB as a function of the width of the confidence intervals used by UCB without explicitly specifying how the confidence bounds are constructed, though we make a specific assumption about the form of the width. Later we will specialize this result to two different settings. In particular, our assumptions are as follows:

- Bounded scalar mean reward: $|\ip{a,\theta_*}|\le 1$ for any $a\in \cup_t \cD_t$;
- Bounded actions: for any $a\in \cup_t \cD_t$, $\norm{a}_2 \le L$;
- Honest confidence intervals: With probability $1-\delta$, for all $t\in [n]$, $a\in \cD_t$, $\ip{a,\theta_*}\in [\UCB_t(a)-2\sqrt{\beta_{t-1}} \norm{a}_{V_{t-1}^{-1}},\UCB_t(a)]$ where $(V_{t})_t$ are given by \eqref{eq:linbanditreggrammian}.

Note that the half-width $\sqrt{\beta_{t-1}} \norm{a}_{V_{t-1}^{-1}}$ used in the assumption is the same as one gets when a confidence set $\cC_t$ is used which satisfies

\begin{align}

\cC_t \subset

\cE_t \doteq \{ \theta \in \R^d \,:\,

\norm{\theta-\hat \theta_{t-1}}_{ V_{t-1}}^2 \le \beta_{t-1} \}\,

\label{eq:ellconf}

\end{align}

with some $\hat \theta_{t-1}\in \R^d$.

Our first main result is as follows:

Theorem (Regret of UCB for Linear Bandits): Let the conditions listed above hold. Further, assume that $(\beta_t)_t$ is nondecreasing and $\beta_{n-1}\ge 1$. Then, with probability $1-\delta$, the pseudo-regret $\hat R_n = \sum_{t=1}^n \max_{a\in \cD_t} \ip{a,\theta_*} – \ip{A_t,\theta_*}$ of UCB as defined by \eqref{eq:UCBgenjoint} satisfies

\begin{align*}

\hat R_n

\le \sqrt{ 8 n \beta_{n-1} \, \log \frac{\det V_{n}}{ \det V_0 } }

\le \sqrt{ 8 d n \beta_{n-1} \, \log \frac{\trace(V_0)+n L^2}{ d\det^{1/d} V_0 } }\,.

\end{align*}

Note that the condition on $(\beta_t)_t$ is not really restrictive because we can always arrange for it to hold at the price of increasing $\beta_1,\dots,\beta_n$. We also see from the result that to get an $\tilde O(\sqrt{n})$ regret bound one needs to show that $\beta_n$ has a polylogarithmic growth (here, $\tilde O(f(n))$ means $O( \log^p(n) f(n) )$ with some $p>0$). We can also get a bound on the (expected) regret $R_n$ if $\delta\le c/\sqrt{n}$ by combining the bound of the theorem with $\hat R_n \le 2n$, which follows from our assumption that the magnitude of the immediate reward is bounded by one.

**Proof**: It suffices to prove the bound on the event when the confidence intervals contain the expected rewards of all the actions available in all the respective rounds. Hence, in the remainder of the proof we will assume that this holds. Let $r_t = \max_{a\in \cD_t} \ip{a,\theta_*} – \ip{A_t,\theta_*}$ be the immediate pseudo-regret suffered in round $t\in [n]$.

Let $A_t^* = \argmax_{a\in \cD_t} \ip{a,\theta_*}$ be an optimal action for round $t$. Thanks to the choice of $A_t$, $\ip{A_t^*,\theta_*}\le \UCB_t(A_t^*) \le \UCB_t(A_t)$. By assumption $\ip{A_t,\theta_*}\ge \UCB_t(A_t)-2\beta_{t-1}^{1/2} \norm{A_t}_{V_{t-1}^{-1}}$. Combining these inequalities we get

\begin{align*}

r_t \le 2 \sqrt{\beta_{t-1}} \norm{A_t}_{V_{t-1}^{-1}}\,.

\end{align*}

By the assumption that the mean absolute reward is bounded by one we also get that $r_t\le 2$. This combined with $\beta_n \ge \max(\beta_t,1)$ gives

\begin{align*}

r_t

& \le 2 \wedge 2 \sqrt{\beta_{t-1}} \norm{A_t}_{V_{t-1}^{-1}}

\le 2 \sqrt{\beta_{n-1}} (1 \wedge \norm{A_t}_{V_{t-1}^{-1}})\,,

\end{align*}

where, as before, $a\wedge b = \min(a,b)$.

Jensen’s inequality, or Cauchy-Schwarz gives $R_n \le \sqrt{n \sum_t r_t^2 }$. So it remains to bound $\sum_t r_t^2$. This is done by applying the following technical lemma:

Lemma (Elliptical Potential): Let $x_1,\dots,x_n\in \R^d$, $V_t = V_0 + \sum_{s=1}^t x_s x_s^\top$, $t\in [n]$, $v_0 = \trace(V_0)$ and $L\ge \max_t \norm{x_t}_2$. Then,

\begin{align*}

\sum_{t=1}^n 1 \wedge \norm{x_t}_{V_{t-1}^{-1}}^2 \le 2 \log \frac{\det V_{n}}{\det V_0} \le d \log \frac{v_0+n L^2}{d\det^{1/d} V_0}\,.

\end{align*}

The idea underlying the proof of this lemma is to first use that $u\wedge 1 \le 2 \ln(1+u)$ and then use an algebraic identity to prove that the sum of the logarithmic terms is equal to $\log \frac{\det V_n}{\det V_0}$. Then one can bound $\log \det V_{n}$ by noting that $\det V_{n}$ is the product of the eigenvalues of $V_{n}$, while $\trace V_{n}$ is the sum of the eigenvalues and then using the AM-GM inequality. The full proof is given at the end of the post.

Applying the bound and putting everything together gives the bound stated in the theorem.

QED.

In the next post we will look into how to construct tight ellipsoidal confidence sets. The theorem just proven will then be used to derive a regret bound.

# Proof of the Elliptical Potential Lemma

Recall that we want to prove that

\begin{align*}

\sum_{t=1}^n 1 \wedge \norm{x_t}_{V_{t-1}^{-1}}^2 \le 2 \log \frac{\det V_{n}}{\det V_0} \le d \log \frac{v_0+n L^2}{d\det^{1/d} V_0}\,.

\end{align*}

where $x_1,\dots,x_n\in \R^d$, $V_t = V_0+\sum_{s=1}^t x_s x_s^\top$, $t\in [n]$, $v_0=\trace V_0$, and $\max_t \norm{x_t}_2\le L$ (cf. here).

Using that for any $u\in [0,1]$, $u \wedge 1 \le 2 \ln(1+u)$, we get

\begin{align*}

\sum_{t=1}^n 1 \wedge \norm{x_t}_{V_{t-1}^{-1}}^2 \le 2 \sum_t \log(1+\norm{x_t}_{V_{t-1}^{-1}}^2)\,.

\end{align*}

Now, we argue that this last expression is $\log \frac{\det V_{n}}{\det V_0}$. For $t \ge 1$ we have $V_{t} = V_{t-1} + x_t x_t^\top = V_{t-1}^{1/2}(I+V_{t-1}^{-1/2} x_t x_t^\top V_{t-1}^{-1/2}) V_{t-1}^{1/2}$. Hence, $\det V_t = \det(V_{t-1}) \det (I+V_{t-1}^{-1/2} x_t x_t^\top V_{t-1}^{-1/2})$. Now, it is easy to check that the only eigenvalues of $I + y y^\top$ are $1+\norm{y}_2$ and $1$ (this is because this matrix is symmetric, hence its eigenvectors are orthogonal to each other and $y$ is an eigenvector). Putting things together we see that $\det V_{n} = \det (V_0) \prod_{t=1}^{n} (1+ \norm{x_t}^2_{V_{t-1}^{-1}})$, which is equivalent to the first inequality that we wanted to prove. To get the second inequality note that by the AM-GM inequality, $\det V_n = \prod_{i=1}^d \lambda_i \le (\frac1d \trace V_n)^d \le ((v_0+n L^2)/d)^d$, where $\lambda_1,\dots,\lambda_d$ are the eigenvalues of $V_n$.

# References

The literature of linear bandits is rather large. Here we restrict ourselves to the works most relevant to this post.

Stochastic linear bandits were introduced into the machine learning literature by Abe and Long:

- Naoki Abe, Alan W. Biermann, and Philip M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263-293, 2003.

(An earlier version of this paper from 1999 can be found here.)

The first paper to consider UCB-style algorithms is by Peter Auer:

- Peter Auer: Using confidence bounds for exploitation-exploration trade-offs. The Journal of Machine Learning Research. 3:397-422, 2002.

This paper considered the case when the number of actions is finite. The core ideas of the analysis of optimistic algorithms (and more) is already present in this paper.

An algorithm that is based on a confidence ellipsoid is described in the paper by Varsha Dani, Thomas Hayes and Sham Kakade:

- V. Dani, T. Hayes and S. Kakade: Stochastic Linear Optimization under Bandit Feedback, COLT-2008.

The regret analysis presented here, just like our discussion of the computational questions (e.g., the use of $\ell^1$-confidence balls) is largely based on this paper. The paper also stresses that an expected regret of $\tilde O( d \sqrt{n} )$ can be achieved regardless of the shape of the decision sets $\cD_t$ as long as the immediate reward belongs to a bounded interval (this also holds for the result presented here). They also give a lower bound for the expected regret which shows that in the worst case the regret must scale with $\Omega(\min(n,d \sqrt{n}))$. In a later post we will give an alternate construction.

A variant of SupLinRel that is based on ridge regression (as opposed to LinRel, which is based on truncating the smallest eigenvalue of the Grammian) is described in

- Wei Chu, Lihong Li, Lev Reyzin, Robert E. Schapire: Contextual Bandits with Linear Payoff Functions, AISTATS, pp. 208-214, 2011.

The authors of this paper call the UCB algorithm described in this post LinUCB, while the previous paper calls an essentially identical algorithm OFUL (after optimism in the face of uncertainty for linear bandits).

The paper by Paat Rusmevichientong and John N. Tsitsiklis considers both optimistic and explore-then-commit strategies (which they call PEGE):

- Paat Rusmevichientong, John N. Tsitsiklis: Linearly Parameterized Bandits, MOR, 35:395-411

PEGE is shown to be optimal up to logarithmic factors for the unit ball.

The observation that explore-then-commit works for the unit ball and in general for sets with a smooth boundary was independently made in

- Yasin Abbasi-Yadkori, Andras Antos and Csaba Szepesvari:

Forced-Exploration Based Algorithms for Playing in Stochastic Linear Bandits, COLT Workshop on Online Learning with Limited Feedback, 2009

and was also included in the Masters thesis of Yasin Abbasi-Yadkori.

# Notes

Note: There is entire line of research for studying bandits with smooth reward functions, contextual or not. See, e.g., here, here, here, here, here or here.

Note: Nonlinear structured bandits where the payoff function belongs to a known set is studied e.g. here, here and here.

Note: It was mentioned that the feature map $\psi$ may map its arguments to an infinite dimensional space. The question then is whether computationally efficient methods exist at all. The answer is yes when $\Psi$ is equipped with an inner product $\ip{\cdot,\cdot}$ such that for any $(c,a),(c’,a’)$ context-action pairs $\ip{\psi(c,a),\psi(c’,a’)} = \kappa( (c,a), (c’,a’) )$ for some **kernel function** $\kappa: (\cC\times [K]) \times (\cC\times [K]) \to \R$ that can quickly be evaluated for any arguments of interest. Learning algorithms that use such an implicit calculation of inner products by means of evaluating some “kernel function” are said to use the “kernel trick”. Such methods in fact never calculate $\psi$, which is only implicitly defined in terms of the kernel function $\kappa$. The choice of the kernel function is in this way analogous to choosing a map $\psi$.

# Lower Bounds for Stochastic Linear Bandits

Lower bounds for linear bandits turn out to be more nuanced than the finite-armed case. The big difference is that for linear bandits the shape of the action-set plays a role in the form of the regret, not just the distribution of the noise. This should not come as a big surprise because the stochastic finite-armed bandit problem can be modeled as a linear bandit with actions being the standard basis vectors, $\cA = \set{e_1,\ldots,e_K}$. In this case the actions are orthogonal, which means that samples from one action do not give information about the rewards for other actions. Other action sets such as the sphere ($\cA = S^d = \{x \in \R^d : \norm{x}_2 = 1\}$) do not share this property. For example, if $d = 2$ and $\cA = S^d$ and an algorithm chooses actions $e_1 = (1,0)$ and $e_2 = (0,1)$ many times, then it can deduce the reward it would obtain from choosing any other action.

We will prove a variety of lower bounds under different assumptions. The first three have a worst-case flavor showing what is (not) achievable in general, or under a sparsity constraint, or if the realizable assumption is not satisfied. All of these are proven using the same information-theoretic tools that we have seen for previous results, combined with careful choices of action sets and environment classes. The difficulty is always in guessing what is the worst case, which is followed by simply turning the cranks on the usual machinery.

Besides the worst-case results we also give an optimal asymptotic lower bound for finite action sets that generalizes the asymptotic lower bound for finite-armed stochastic bandits give in a previous post. The proof of this result is somewhat more technical, but follows the same general flavor as the previous asymptotic lower bounds.

We use a simple model with Gaussian noise. For action $A_t \in \cA \subseteq \R^d$ the reward is $X_t = \shortinner{A_t, \theta} + \eta_t$ where $\eta_t\sim \mathcal N(0,1)$ is the standard Gaussian noise term and $\theta \in \Theta \subset \R^d$. The regret of a strategy is:

\begin{align*}

R_n(\cA, \theta) = \max_{x \in \cA} \E\left[\sum_{t=1}^n \shortinner{x^* – A_t, \theta}\right]\,,

\end{align*}

where $x^* = \argmax_x \shortinner{x, \theta}$ is the optimal action. Note that the arguments to the regret function differ from those used in some previous posts. In general we include the quantities of interest, which in this case (with a fixed strategy and noise model) are the action set and the unknown parameter. As for finite-armed bandits we define the sub-optimality gap of arm $x \in \cA$ by $\Delta_x = \max_{y \in \cA} \shortinner{y – x, \theta}$ and $\Delta_{\min} = \inf \set{\Delta_x : x \in \cA \text{ and } \Delta_x > 0}$. Note that the latter quantity can be zero if $\cA$ is infinitely large, but is non-zero if there are finitely many arms and the problem is non-trivial (there exists a sub-optimal arm). If $A$ is a matrix, then $\lambda_{\min}(A)$ is its smallest eigenvalue. We also recall the notation used for finite-armed bandits by defining $T_x(t) = \sum_{s=1}^t \one{A_s = x}$.

# Worst case bounds

Our worst case bound relies on a specific action set and shows that the $\tilde O(d \sqrt{n})$ upper bound for linear version of UCB cannot be improved in general except (most likely) for the logarithmic factors.

TheoremLet the action set be $\mathcal A = \set{-1, 1}^d$. Then for any strategy there exists a $\theta \in \Theta = \set{-\sqrt{1/n}, \sqrt{1/n}}^d$ such that

\begin{align*}

R_n(\cA, \theta) \geq \frac{\exp(-2)}{4} \cdot d \sqrt{n}\,.

\end{align*}

**Proof**

For $\theta \in \Theta$ we denote $\mathbb{P}_\theta$ to be the measure on outcomes $A_1, X_1,\ldots,A_n,X_n$ induced by the interaction of the strategy and the bandit determined by $\theta$. By the relative entropy identity we have for $\theta, \theta’ \in \Theta$ that

\begin{align}

\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’}) = \frac{1}{2} \sum_{t=1}^n \E\left[\shortinner{A_t, \theta – \theta’}^2\right]\,.

\label{eq:linear-kl}

\end{align}

For $1 \leq i \leq d$ and $\theta \in \Theta$ define

\begin{align*}

p_{\theta,i} = \mathbb{P}_\theta\left(\sum_{t=1}^n \one{\sign(A_{t,i}) \neq \sign(\theta_i)} \geq n/2\right)\,.

\end{align*}

Now let $1 \leq i \leq d$ and $\theta \in \Theta$ be fixed and let $\theta’ = \theta$ except for $\theta’_i = -\theta_i$. Then by the high probability version of Pinsker’s inequality and (\ref{eq:linear-kl}) we have

\begin{align}

\label{eq:sphere-kl}

p_{\theta,i} + p_{\theta’,i}

&\geq \frac{1}{2} \exp\left(-\frac{1}{2}\sum_{t=1}^n \shortinner{A_t, \theta – \theta’}^2\right)

= \frac{1}{2} \exp\left(-2\right)\,.

\end{align}

Therefore using the notation $\sum_{\theta_{-i}}$ as an abbreviation for $\sum_{\theta_1,\ldots,\theta_{i-1},\theta_{i+1},\ldots,\theta_d \in \set{\pm \sqrt{1/n}}^{d-1}}$,

\begin{align*}

\sum_{\theta \in \Theta} 2^{-d} \sum_{i=1}^d p_{\theta,i}

&= \sum_{i=1}^d \sum_{\theta_{-i}} 2^{-d} \sum_{\theta_i \in \set{\pm \sqrt{1/n}}} p_{\theta,i} \\

&\geq \sum_{i=1}^d \sum_{\theta_{-i}} 2^{-d} \cdot \frac{1}{2}\exp\left(-2\right) \\

&= \frac{d}{4} \exp\left(-2\right)\,.

\end{align*}

Therefore there exists a $\theta \in \Theta$ such that

\begin{align*}

\sum_{i=1}^d p_{\theta,i} \geq \frac{d}{4} \exp\left(-2\right)\,.

\end{align*}

Let $x^* = \argmax_{x \in \cA} \shortinner{x^*, \theta}$. Then by the definition of $p_{\theta,i}$, the regret for this choice of $\theta$ is at least

\begin{align*}

R_n(\cA, \theta)

&= \sum_{t=1}^n \E_\theta\left[\shortinner{x^* – A_t, \theta}\right] \\

&= 2\sqrt{\frac{1}{n}} \sum_{i=1}^d \sum_{t=1}^n \mathbb{P}_\theta\left(\sign(A_{t,i}) \neq \sign(\theta_i)\right) \\

&\geq \sqrt{n} \sum_{i=1}^d \mathbb{P}_\theta\left(\sum_{t=1}^n \one{\sign(A_{t,i}) \neq \sign(\theta_i)} \geq n/2 \right) \\

&= \sqrt{n} \sum_{i=1}^d p_{\theta,i}

\geq \frac{\exp(-2)}{4} \cdot d \sqrt{n}\,.

\end{align*}

QED

# Sparse case

We now tackle the sparse case where the underlying parameter $\theta$ is assumed to have $\norm{\theta}_0 = \sum_{i=1}^d \one{\theta_i > 0} \leq p$ for some $p$ that is usually much smaller than $d$. An extreme case is when $p= 1$, which essentially reduces to the finite-armed bandit problem where we observe the regret is at least $\Omega(\sqrt{dn})$ in the worst case. For this reason we cannot expect too much from sparsity. It turns out that the best one can hope for (in the worst case) is $\Omega(\sqrt{p d n})$, so again the lower bound is nearly matching the upper bound for an existing algorithm.

Theorem

Let $2 \leq p\leq d$ be even and define the set of actions $\cA$ by

\begin{align*}

\mathcal A = \set{ x \in \set{0, 1}^d : \sum_{i=1}^d \one{x_i > 0} = \frac{p}{2}}.

\end{align*}

Then for any strategy there exists a $\theta$ with $\norm{\theta}_0 \leq p$ such that

\begin{align*}

R_n \geq \frac{\sqrt{2d p n}}{16} \exp(-1)\,.

\end{align*}

The assumption that $p$ is even is non-restrictive, since in case it is not even the following proof goes through for $p – 1$ and the regret only changes by a very small constant factor. The proof relies on a slightly different construction than the previous result, and is fractionally more complicated because of it.

**Proof**

Let $\epsilon = \sqrt{2d/(p n)}$ and $\theta$ be given by

\begin{align*}

\theta_i = \begin{cases}

\epsilon\,, & \text{if } i \leq p / 2\,; \\

0\,, & \text{otherwise}\,.

\end{cases}

\end{align*}

Given $S \subseteq \{p/2+1,\ldots,d\}$ with $|S| = p/2$ define $\theta’$ by

\begin{align*}

\theta’_i = \begin{cases}

\epsilon\,, & \text{if } i \leq p / 2\,; \\

2\epsilon\,, & \text{if } i \in S \,; \\

0\,, & \text{otherwise}\,.

\end{cases}

\end{align*}

Let $\mathbb{P}_\theta$ and $\mathbb{P}_{\theta’}$ be the measures on the sequence of observations when a fixed strategy interacts with the bandits induced by $\theta$ and $\theta’$ respectively. Then the usual computation shows that

\begin{align*}

\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’}) = 2\epsilon^2 \E\left[\sum_{t=1}^n \sum_{i \in S} \one{A_{ti} \neq 0}\right]\,.

\end{align*}

By the pigeonhole principle we can choose an $S \subseteq \{p/2+1,\ldots,d\}$ with $|S| = p/2$ in such a way that

\begin{align*}

\E\left[\sum_{t=1}^n \sum_{i \in S} \one{A_{ti} \neq 0}\right] \leq \frac{n p}{d}\,.

\end{align*}

Therefore using this $S$ and with the high probability pinsker we have for any event $A$ that

\begin{align*}

\mathbb{P}_\theta(A) + \mathbb{P}_{\theta’}(A^c) \geq \frac{1}{2} \exp\left(-\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’})\right)

\geq \frac{1}{2} \exp\left(-\frac{np\epsilon^2}{2d}\right)

\geq \frac{1}{2} \exp\left(-1\right)

\end{align*}

Choosing $A = \set{\sum_{t=1}^n \sum_{i \in S} \one{A_{ti} > 0} \geq np/4}$ leads to

\begin{align*}

R_n(\mathcal A, \theta) &\geq \frac{n\epsilon p}{4} \mathbb{P}_{\theta}(A) &

R_n(\mathcal A, \theta’) &\geq \frac{n\epsilon p}{4} \mathbb{P}_{\theta’}(A^c)\,.

\end{align*}

Therefore

\begin{align*}

\max\set{R_n(\cA, \theta),\, R_n(\cA, \theta’)}

\geq \frac{n\epsilon p}{8} \left(\mathbb{P}_{\theta}(A) + \mathbb{P}_{\theta’}(A^c)\right)

\geq \frac{\sqrt{2ndp}}{16} \exp(-1)\,.

\end{align*}

QED

# Unrealizable case

An important generalization of the linear model is the **unrealizable** case where the mean rewards are not assumed to follow a linear model exactly. Suppose that $\cA \subset \R^d$ and the mean reward is $\E[X_t|A_t = x] = \mu_x$ does not necessarily satisfy a linear model. It would be very pleasant to have an algorithm such that if $\mu_x = \shortinner{x, \theta}$ for all $x$, then

\begin{align*}

R_n(\cA, \mu) = \tilde O(d \sqrt{n})\,,

\end{align*}

while if there exists an $x \in \cA$ such that $\mu_x \neq \shortinner{x, \theta}$, then $R_n(\cA, \mu) = \tilde O(\sqrt{nK})$ recovers the UCB bound. That is, an algorithm that enjoys the bound of OFUL if the the linear model is correct, but recovers the regret of UCB otherwise. Of course one could hope for something even stronger, for example that

\begin{align}

R_n(\cA, \mu) = \tilde O\left(\min\set{\sqrt{Kn},\, d\sqrt{n} + n\epsilon}\right)\,, \label{eq:hope}

\end{align}

where $\epsilon = \min_{\theta \in \R^d} \max_{x \in \cA} |\mu_x – \shortinner{x, \theta}|$ is called the **approximation error** of the class of linear models. Unfortunately it turns out that results of this kind are not achievable. To show this we will prove a generic bound for the classical finite-armed bandit problem, and then show how this implies a lower bound on the ability to be adaptive to a linear model if possible and have acceptable regret if not.

Theorem

Let $\cA = \set{e_1,\ldots,e_K}$ be the standard basis vectors. Now define sets $\Theta, \Theta’ \subset \R^{K}$ by

\begin{align*}

\Theta &= \set{\theta \in [0,1]^K : \theta_i = 0 \text{ for } i > 1} \\

\Theta’ &= \set{\theta \in [0,1]^K}\,.

\end{align*}

If $2(K-1) \leq V \leq \sqrt{n(K-1)\exp(-2)/8}$ and $\sup_{\theta \in \Theta} R_n(\cA, \theta) \leq V$, then

\begin{align*}

\sup_{\theta’ \in \Theta’} R_n(\cA, \theta’) \geq \frac{n(K-1)}{8V} \exp(-2)\,.

\end{align*}

**Proof**

Let $\theta \in \Theta$ be given by $\theta_1 = \Delta = (K-1)/V \leq 1/2$. Therefore

\begin{align*}

\sum_{i=2}^K \E[T_i(n)] \leq \frac{V}{\Delta}

\end{align*}

and so by the pigeonhole principle there exists an $i > 1$ such that

\begin{align*}

\E[T_i(n)] \leq \frac{V}{(K-1)\Delta} = \frac{1}{\Delta^2}\,.

\end{align*}

Then define $\theta’ \in \Theta’$ by

\begin{align*}

\theta’_j = \begin{cases}

\Delta & \text{if } j = 1 \\

2\Delta & \text{if } j = i \\

0 & \text{otherwise}\,.

\end{cases}

\end{align*}

Then by the usual argument for any event $A$ we have

\begin{align*}

\mathbb{P}_\theta(A) + \mathbb{P}_{\theta’}(A^c)

\geq \frac{1}{2} \exp\left(\KL(\mathbb{P}_\theta, \mathbb{P}_{\theta’})\right)

= \frac{1}{2} \exp\left(-2 \Delta^2 \E[T_i(n)]\right)

\geq \frac{1}{2} \exp\left(-2\right)\,.

\end{align*}

Therefore

\begin{align*}

R_n(\mathcal A, \theta) + R_n(\mathcal A, \theta’)

\geq \frac{n\Delta}{4} \exp(-2) = \frac{n(K-1)}{4V} \exp(-2)

\end{align*}

Therefore by the assumption that $R_n(\mathcal A, \theta) \leq V \leq \sqrt{n(K-1) \exp(-2)/8}$ we have

\begin{align*}

R_n(\mathcal A, \theta’) \geq \frac{n(K-1)}{8V} \exp(-2)\,.

\end{align*}

Therefore $R_n(\cA, \theta) R_n(\cA, \theta’) \geq \frac{n(K-1)}{8} \exp(-2)$ as required.

QED

As promised we now relate this to the unrealizable linear bandits. Suppose that $d = 1$ (an absurd case) and that there are $K$ arms $\cA = \set{x_1, x_2,\ldots, x_{K}}$ where $x_1 = (1)$ and $x_i = (0)$ for $i > 1$. Clearly if the reward is really linear and $\theta > 0$, then the first arm is optimal, while otherwise any of the other arms have the same expected reward (of just $0$). Now simply add $K-1$ coordinates to each action so that the error in the 1-dimensional linear model can be modeled in a higher dimension and we have exactly the model used in the previous theorem. So $\cA = \set{e_1,e_2,\ldots,e_K}$. Then the theorem shows that (\ref{eq:hope}) is a pipe dream. If $R_n(\cA, \theta) = \tilde O(\sqrt{n})$ for all $\theta \in \Theta’$ (the realizable case), then there exists a $\theta’ \in \Theta’$ such that $R_n(\cA, \theta’) = \tilde \Omega(K \sqrt{n})$. To our knowledge it is still an open question of what is possible on this front. Our conjecture is that there is an algorithm for which

\begin{align*}

R_n(\cA, \theta) = \tilde O\left(\min\set{d\sqrt{n} + \epsilon n,\, \frac{K}{d}\sqrt{n}}\right)\,.

\end{align*}

In fact, it is not hard to design an algorithm that tries to achieve this bound by assuming the problem is realizable, but using some additional time to explore the remaining arms up to some accuracy to confirm the hypothesis. We hope to write a post on this in the future, but leave the claim as a conjecture for now.

# Asymptotic lower bounds

Like in the finite-armed case, the asymptotic result is proven only for *consistent* strategies. Recall that a strategy is consistent in some class if the regret is sub-polynomial for any bandit in that class.

TheoremLet $\cA \subset \R^d$ be a finite set that spans $\R^d$ and suppose a strategy satisfies

\begin{align*}

\text{for all } \theta \in \R^d \text{ and } p > 0 \qquad R_n(\cA, \theta) = o(n^p)\,.

\end{align*}

Let $\theta \in \R^d$ be any parameter such that there is a unique optimal action and let $\bar G_n = \E_\theta \left[\sum_{t=1}^n A_t A_t^\top\right]$ be the expected Gram matrix when the strategy interacts with the bandit determined by $\theta$. Then $\liminf_{n\to\infty} \lambda_{\min}(\bar G_n) / \log(n) > 0$ (which implies that $\bar G_n$ is eventually non-singular). Furthermore, for any $x \in \cA$ it holds that:

\begin{align*}

\limsup_{n\to\infty} \log(n) \norm{x}_{\bar G_n^{-1}}^2 \leq \frac{\Delta_x^2}{2}\,.

\end{align*}

The reader should recognize $\norm{x}_{\bar G_n^{-1}}^2$ as the key term in the width of the confidence interval for the least squares estimator. This is quite intuitive. The theorem is saying that any consistent algorithm must prove statistically that all sub-optimal arms are indeed sub-optimal. Before the proof of this result we give a corollary that characterizes the asymptotic regret that must be endured by any consistent strategy.

Corollary

Let $\cA \subset \R^d$ be a finite set that spans $\R^d$ and $\theta \in \R^d$ be such that there is a unique optimal action. Then for any consistent strategy

\begin{align*}

\liminf_{n\to\infty} \frac{R_n(\cA, \theta)}{\log(n)} \geq c(\cA, \theta)\,,

\end{align*}

where $c(\cA, \theta)$ is defined as

\begin{align*}

&c(\cA, \theta) = \inf_{\alpha \in [0,\infty)^{\cA}} \sum_{x \in \cA} \alpha(x) \Delta_x \\

&\quad\text{ subject to } \norm{x}_{H_\alpha^{-1}}^2 \leq \frac{\Delta_x^2}{2} \text{ for all } x \in \cA \text{ with } \Delta_x > 0\,,

\end{align*}

where $H = \sum_{x \in \cA} \alpha(x) x x^\top$.

**Proof of Theorem**

The proof of the first part is simply omitted (see the reference below for details). It follows along similar lines to what follows, essentially that if $G_n$ is not “sufficiently large” in every direction, then some alternative parameter is not sufficiently identifiable. Let $\theta’ \in \R^d$ be an alternative parameter and let $\mathbb{P}$ and $\mathbb{P}’$ be the measures on the sequence of outcomes $A_1,Y_1,\ldots,A_n,Y_n$ induced by the interaction between the strategy and the bandit determined by $\theta$ and $\theta’$ respectively. Then for any event $E$ we have

\begin{align}

\Prob{E} + \mathbb{P}'(E^c)

&\geq \frac{1}{2} \exp\left(-\KL(\mathbb{P}, \mathbb{P}’)\right) \nonumber \\

&= \frac{1}{2} \exp\left(-\frac{1}{2} \E\left[\sum_{t=1}^n \inner{A_t, \theta – \theta’}^2\right]\right)

= \frac{1}{2} \exp\left(-\frac{1}{2} \norm{\theta – \theta’}_{\bar G_n}^2\right)\,. \label{eq:linear-asy-kl}

\end{align}

A simple re-arrangement shows that

\begin{align*}

\frac{1}{2} \norm{\theta – \theta’}_{\bar G_n}^2 \geq \log\left(\frac{1}{2 \Prob{E} + 2 \mathbb{P}'(E^c)}\right)

\end{align*}

Now we follow the usual plan of choosing $\theta’$ to be close to $\theta$, but so that the optimal action in the bandit determined by $\theta’$ is not $x^*$. Let $\epsilon \in (0, \Delta_{\min})$ and $H$ be a positive definite matrix to be chosen later such that $\norm{x – x^*}_H^2 > 0$. Then define

\begin{align*}

\theta’ = \theta + \frac{\Delta_x + \epsilon}{\norm{x – x^*}^2_H} H(x – x^*)\,,

\end{align*}

which is chosen so that

\begin{align*}

\shortinner{x – x^*, \theta’} = \shortinner{x – x^*, \theta} + \Delta_x + \epsilon = \epsilon\,.

\end{align*}

This means that $x^*$ is not the optimal action for bandit $\theta’$, and in fact is $\epsilon$-suboptimal. We abbreviate $R_n = R_n(\cA, \theta)$ and $R_n’ = R_n(\cA, \theta’)$. Then

\begin{align*}

R_n

&= \E_\theta\left[\sum_{x \in \cA} T_x(n) \Delta_x\right]

\geq \frac{n\Delta_{\min}}{2} \Prob{T_{x^*}(n) < n/2}
\geq \frac{n\epsilon}{2} \Prob{T_{x^*}(n) < n/2}\,.
\end{align*}
Similarly, $x^*$ is at least $\epsilon$-suboptimal in bandit $\theta'$ so that
\begin{align*}
R_n' \geq \frac{n\epsilon}{2} \mathbb{P}'\left(T_{x^*}(n) \geq n/2\right)\,.
\end{align*}
Therefore
\begin{align}
\Prob{T_{x^*}(n) < n/2} + \mathbb{P}'\left(T_{x^*}(n) \geq n/2\right) \leq \frac{2}{n\epsilon} \left(R_n + R_n'\right)\,. \label{eq:regret-sum}
\end{align}
Note that this holds for practically any choice of $H$ as long as $\norm{x - x^*}_H > 0$. The logical next step is to select that $H$ (which determines $\theta’$) in such a way that (\ref{eq:linear-asy-kl}) is as large as possible. The main difficulty is that this depends on $n$, so instead we aim to choose an $H$ so the quantity is large enough infinitely often. We starting by just re-arranging things:

\begin{align*}

\frac{1}{2} \norm{\theta – \theta’}_{\bar G_n}^2

= \frac{(\Delta_x + \epsilon)^2}{2} \cdot \frac{\norm{x – x^*}_{H \bar G_n H}^2}{\norm{x-x^*}_H^4}

= \frac{(\Delta_x + \epsilon)^2}{2 \norm{x – x^*}_{\bar G_n^{-1}}^2} \rho_n(H)\,,

\end{align*}

where we introduced

\begin{align*}

\rho_n(H) = \frac{\norm{x – x^*}_{\bar G_n^{-1}}^2 \norm{x – x^*}_{H \bar G_n H}^2}{\norm{x – x^*}_H^4}\,.

\end{align*}

Therefore by choosing $E$ to be the event that $T_{x^*}(n) < n/2$ and using (\ref{eq:regret-sum}) and (\ref{eq:linear-asy-kl}) we have
\begin{align*}
\frac{(\Delta_x + \epsilon)^2}{2\norm{x - x^*}_{\bar G_n^{-1}}^2} \rho_n(H) \geq \log\left(\frac{n \epsilon}{4R_n + 4R_n'}\right)\,,
\end{align*}
which after re-arrangement leads to
\begin{align*}
\frac{(\Delta_x + \epsilon)^2}{2\log(n)\norm{x - x^*}_{\bar G_n^{-1}}^2} \rho_n(H) \geq 1 - \frac{\log((4R_n + 4R_n')/\epsilon)}{\log(n)}\,.
\end{align*}
The definition of consistency means that $R_n$ and $R_n'$ are both sub-polynomial, which implies that the second term in the previous expression tends to zero for large $n$ and so by sending $\epsilon$ to zero we see that
\begin{align}
\label{eq:lin-lower-liminf} \liminf_{n\to\infty} \frac{\rho_n(H)}{\log(n) \norm{x - x^*}_{\bar G_n^{-1}}^2} \geq \frac{2}{\Delta_x^2}\,.
\end{align}
We complete the result using proof by contradiction. Suppose that
\begin{align}
\limsup_{n\to\infty} \log(n) \norm{x - x^*}_{\bar G_n^{-1}}^2 > \frac{\Delta_x^2}{2}\,. \label{eq:linear-lower-ass}

\end{align}

Then there exists an $\epsilon > 0$ and infinite set $S \subseteq \N$ such that

\begin{align*}

\log(n) \norm{x – x^*}_{\bar G_n^{-1}}^2 \geq \frac{(\Delta_x + \epsilon)^2}{2} \quad \text{ for all } n \in S\,.

\end{align*}

Therefore by (\ref{eq:lin-lower-liminf}),

\begin{align*}

\liminf_{n \in S} \rho_n(H) > 1\,.

\end{align*}

We now choose $H$ to be a cluster point of the sequence $\{\bar G_n^{-1} / \norm{\bar G_n^{-1}}\}_{n \in S}$ where $\norm{\bar G_n^{-1}}$ is the spectral norm of the matrix $\bar G_n^{-1}$. Such a point must exist, since matrices in this sequence have unit spectral norm by definition, and the set of matrices with bounded spectral norm is compact. We let $S’ \subseteq S$ be a subset so that $\bar G_n^{-1} / \norm{\bar G_n^{-1}}$ converges to $H$ on $n \in S’$. We now check that $\norm{x – x^*}_H > 0$.

\begin{align*}

\norm{x – x^*}_H^2 = \lim_{n \in S} \frac{\norm{x – x^*}^2_{\bar G_n^{-1}}}{\norm{\bar G_n^{-1}}}

> 0\,,

\end{align*}

where the last inequality follows from the assumption in (\ref{eq:linear-lower-ass}) and the first part of the theorem. Therefore

\begin{align*}

1 < \liminf_{n \in S} \rho_n(H) \leq \liminf_{n \in S'} \frac{\norm{x - x^*}^2_{\bar G_n^{-1}} \norm{x - x^*}^2_{H \bar G_n^{-1}H}}{\norm{x - x^*}_H^4} = 1\,,
\end{align*}
which is a contradiction, and so we conclude that (\ref{eq:linear-lower-ass}) does not hold and so
\begin{align*}
\limsup_{n\to\infty} \log(n) \norm{x - x^*}_{\bar G_n^{-1}}^2 \leq \frac{\Delta_x^2}{2}\,.
\end{align*}
QED

We leave the proof of the corollary as an exercise for the reader. Essentially though, any consistent algorithm must choose its actions so that in expectation

\begin{align*}

\norm{x – x^*}^2_{\bar G_n^{-1}} \leq (1 + o(1)) \frac{\Delta_x^2}{2 \log(n)}\,.

\end{align*}

Now since $x^*$ will be chosen linearly often it is easily shown for sub-optimal $x$ that $\lim_{n\to\infty} \norm{x – x^*}_{\bar G_n^{-1}} / \norm{x}_{\bar G_n^{-1}} \to 1$. This leads to the required constraint on the actions of the algorithm, and the optimization problem in the corollary is derived by minimizing the regret subject to this constraint.

# Clouds looming for optimism

The theorem and its corollary have disturbing implications for strategies based on the principle of optimism in the face of uncertainty, which is that they can never be asymptotically optimal! The reason is that these strategies do not choose actions for which they have collected enough statistics to prove they are sub-optimal, but in the linear setting it can still be worthwhile playing these actions in case they are very informative about

*other*actions for which the statistics are not yet so clear. A problematic example appears in the simplest case where any information sharing between the arms occurs at all. Namely, when the dimension is $d = 2$ and there are $K = 3$ arms.

Specifically, let $\cA = \set{x_1, x_2, x_3}$ where $x_1 = e_1$ and $x_2 = e_2$ and $x_3 = (1-\epsilon, \gamma \epsilon)$ where $\gamma \geq 1$ and $\epsilon > 0$ is small. Let $\theta = (1, 0)$ so that the optimal action is $x^* = x_1$ and $\Delta_{x_2} = 1$ and $\Delta_{x_3} = \epsilon$. Clearly if $\epsilon$ is very small, then $x_1$ and $x_3$ point in nearly the same direction and so choosing only these arms does not yield which of $x_1$ or $x_3$ is optimal. On the other hand, $x_2$ and $x_1 – x_3$ point in very different directions and so choosing $x_2$ allows a learning agent to quickly identify that $x_1$ is in fact optimal.

We now show how the theorem and corollary demonstrate this. First we calculate what is the optimal solution to the optimization problem. Recall we are trying to minimize

\begin{align*}

\sum_{x \in \cA} \alpha(x) \Delta_x \qquad \text{subject to } \norm{x}^2_{H(\alpha)^{-1}} \leq \frac{\Delta_x^2}{2} \text{for all } x \in \cA\,,

\end{align*}

where $H = \sum_{x \in \cA} \alpha(x) x x^\top$. Clearly we should choose $\alpha(x_1)$ arbitrarily large, then a computation shows that

\begin{align*}

\lim_{\alpha(x_1) \to \infty} H(\alpha)^{-1} =

\left[\begin{array}{cc}

0 & 0 \\

0 & \frac{1}{\alpha(x_3)\epsilon^2 \gamma^2 + \alpha(x_2)}

\end{array}\right]\,.

\end{align*}

Then for the constraints mean that

\begin{align*}

&\frac{1}{\alpha(x_3)\epsilon^2 \gamma^2 + \alpha(x_2)} = \lim_{\alpha(x_1) \to \infty} \norm{x_2}^2_{H(\alpha)^{-1}} \leq \frac{1}{2} \\

&\frac{\gamma^2 \epsilon^2}{\alpha(x_3) \epsilon^2 \gamma^2 + \alpha(x_2)} = \lim_{\alpha(x_1) \to \infty} \norm{x_3}^2_{H(\alpha)^{-1}} \leq \frac{\epsilon^2}{2}\,.

\end{align*}

Provided that $\gamma \geq 1$ this reduces simply to the constraint that

\begin{align*}

\alpha(x_3) \epsilon^2 + \alpha(x_2) \geq 2\gamma^2\,.

\end{align*}

Since we are minimizing $\alpha(x_2) + \epsilon \alpha(x_3)$ we can easily see that $\alpha(x_2) = 2\gamma^2$ and $\alpha(x_3) = 0$ provided that $2\gamma^2 \leq 2/\epsilon$. Therefore if $\epsilon$ is chosen sufficiently small relative to $\gamma$, then the optimal rate of the regret is $c(\cA, \theta) = 2\gamma^2$ and so there exists a strategy such that

\begin{align*}

\limsup_{n\to\infty} \frac{R_n(\cA, \theta)}{\log(n)} = 2\gamma^2\,.

\end{align*}

Now we argue that for $\gamma$ sufficiently large and $\epsilon$ arbitrarily small that the regret for any consistent optimistic algorithm is at least

\begin{align*}

\limsup_{n\to\infty} \frac{R_n(\cA, \theta)}{\log(n)} = \Omega(1/\epsilon)\,,

\end{align*}

which can be arbitrarily worse than the optimal rate! So why is this so? Recall that optimistic algorithms choose

\begin{align*}

A_t = \argmax_{x \in \cA} \max_{\tilde \theta \in \cC_t} \inner{x, \tilde \theta}\,,

\end{align*}

where $\cC_t \subset \R^d$ is a confidence set that we assume contains the true $\theta$ with high probability. So far this does not greatly restrict the class of algorithms that we might call optimistic. We now assume that there exists a constant $c > 0$ such that

\begin{align*}

\cC_t \subseteq \set{\tilde \theta : \norm{\hat \theta_t – \tilde \theta}_{G_t} \leq c \sqrt{\log(n)}}\,.

\end{align*}

So now we ask how often can we expect the optimistic algorithm to choose action $x_2 = e_2$ in the example described above? Since we have assumed $\theta \in \cC_t$ with high probability we have that

\begin{align*}

\max_{\tilde \theta \in \cC_t} \shortinner{x_1, \tilde \theta} \geq 1\,.

\end{align*}

On the other hand, if $T_{x_2}(t-1) > 4c^2 \log(n)$, then

\begin{align*}

\max_{\tilde \theta \in \cC_t} \shortinner{x_2, \tilde \theta}

&= \max_{\tilde \theta \in \cC_t} \shortinner{x_2, \tilde \theta – \theta} \\

&\leq 2 c \sqrt{\norm{x_2}_{G_t^{-1}} \log(n)} \\

&\leq 2 c \sqrt{\frac{\log(n)}{T_{x_2}(t-1)}} \\

&< 1\,,
\end{align*}
which means that $x_2$ will not be chosen more than $1 + 4c^2 \log(n)$ times. So if $\gamma = \Omega(c^2)$, then the optimistic algorithm will not choose $x_2$ sufficiently often and a simple computation shows it must choose $x_3$ at least $\Omega(\log(n)/\epsilon^2)$ times and suffers regret of $\Omega(\log(n)/\epsilon)$. The key take-away from this is that optimistic algorithms do not choose actions that are statistically sub-optimal, but for linear bandits it can be optimal to choose these actions more often to gain information about *other actions*.

# Notes

Note 1: The worst-case bound demonstrates the near-optimality of the OFUL algorithm for a specific action-set. It is an open question to characterize the optimal regret for a wide range of action-sets. We will return to these issues soon when we discuss adversarial linear bandits.

Note 2: There is an algorithm that achieves the asymptotic lower bound (see references below), but so far there is no algorithm that is simultaneously asymptotically optimal and (near) minimax optimal.

Note 3: The assumption that $x^*$ was unique in the asymptotic results can be relaxed at the price of a little more work, and simple (natural) modifications to the theorem statements.

# References

Worst-case lower bounds for stochastic bandits have appeared in a variety of places, all with roughly the same bound, but for different action sets. Our very simple proof is new, but takes inspiration mostly from the paper by Shamir.

- Dani, Hayes and Kakade. Stochastic Linear Optimization under Bandit Feedback, 2008.
- Rusmevichientong and Tsitsiklis. Linearly Parameterized Bandits, 2010.
- Shamir. On the Complexity of Bandit Linear Optimization, 2014.

The asymptotic lower bound (along with a strategy for which the upper bound matches) is by the authors.

- Lattimore and Szepesvari. The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits, 2016.

The example used to show optimistic approaches cannot achieve the optimal rate has been used before in the pure exploration setting where the goal is to simply find the best action, without the constraint that the regret should be small.

- Soare, Lazaric, Munos. Best-Arm Identification in Linear Bandits, 2015.

We should also mention that examples have been constructed before demonstrating the need for carefully balancing the trade-off between information and regret. For many examples, and a candidate design principle for addressing these issues see

- Russo & Van Roy. Learning to Optimize via Information-Directed Sampling, 2014.

The results for the unrealizable case are inspired by the work of one of the authors on the *pareto regret frontier* for bandits, which characterizes what trade-offs are available when it is desirable to have a regret that is unusually small relative to some specific arms.

- Lattimore. The Pareto Regret Frontier for Bandits, 2015.

# Ellipsoidal Confidence Sets for Least-Squares Estimators

Continuing the previous post, here we give a construction for confidence bounds based on ellipsoidal confidence sets. We also put things together and show bound on the regret of the UCB strategy that uses the constructed confidence bounds.

# Constructing the confidence bounds

To construct the confidence bounds we will construct appropriate confidence sets $\cC_t$, which will be based on least-squares, more precisely penalized least-squares estimates. In a later post we will show a different construction that improves the regret when the parameter vector is sparse. But first things first, let’s see how to construct those confidence bounds in the lack of additional knowledge.

Assume that we are at the end of stage $t$ when a bandit algorithm has chosen $A_1,\dots,A_t\in \R^d$ and received the respective payoffs $X_1,\dots,X_t$. The **penalized least-squares**, also known as the **ridge-regression** estimate of $\theta_*$, is defined as the minimizer of the penalized squared empirical loss,

\begin{align*}

L_{t}(\theta) = \sum_{s=1}^{t} (X_s – \ip{A_s,\theta})^2 + \lambda \norm{\theta}_2^2\,,

\end{align*}

where $\lambda\ge 0$ is the “penalty factor”. Choosing $\lambda>0$ helps because it ensures that the loss function has a unique minimizer even when $A_1,\dots,A_t$ do not span $\R^d$, which simplifies the math. By solving for $L_t'(\theta)=0$, the optimizer $\hat \theta_t \doteq \argmin_{\theta\in \R^d} L_t(\theta)$ of $L_t$ can be easily seen to satisfy

\begin{align*}

\hat \theta_t = V_t(\lambda)^{-1} \sum_{s=1}^t X_s A_s\,,

\end{align*}

where

\begin{align*}

V_t(\lambda) = \lambda I + \sum_{s=1}^t A_s A_s^\top\,.

\end{align*}

The matrix $V_t(0)$ is known as the **Grammian** underlying $\{A_s\}_{s\le t}$ and we will keep calling $V_t(\lambda)$ also as the Grammian. Just looking at the definition of the least-squares estimate in the case of a fixed sequence of $\{A_s\}_s$ and independent Gaussian noise a confidence set is very easy to get. To get some intuition this is exactly what we will do first.

## Building intuition: Fixed design regression and independent Gaussian noise

To get a sense of what a confidence set $\cC_{t+1}$ should look like we start with a simplified setting, where we make the following extra assumptions.

- Gaussian noise: $(\eta_s)_s$ is an i.i.d. sequence and in particular $\eta_s\sim \mathcal N(0,1)$;
- Nonsingular Grammian: $V \doteq V_t(0)$ is invertible.
- “Fixed design”: $A_1,\dots,A_t$ are deterministically chosen without the knowledge of $X_1,\dots,X_t$;

The distributional assumption on $(\eta_s)_s$ and the second assumption are for convenience. In particular, the second assumption lets us set $\lambda=0$, which we will indeed use. The independent of $(\eta_s)_s$ and the third assumption, on the other hand, are anything but innocent. In the absence of these we will be forced to use specific techniques.

A bit about notation: To emphasize that $A_1,\dots,A_t$ are chosen deterministically, we will use $a_s$ in place of $A_s$ (recall our convention that lowercase letters denote nonrandom, deterministic values). With this, we have $V = \sum_s a_s a_s^\top$ and $\hat \theta_t = V^{-1} \sum_s X_s a_s$.

Plugging in $X_s = \ip{a_s,\theta_*}+\eta_s$, $s=1,\dots,n$, into the expression of $\hat \theta_t$, we get

\begin{align}

% \hat \theta_t = V^{-1} \sum_{s=1}^t a_s a_s^\top \theta_* + V^{-1} \sum_{s=1}^t \eta_s a_s

\hat \theta_t -\theta_*

= V^{-1} Z\,,

\label{eq:lserror0}

\end{align}

where

\begin{align*}

Z = \sum_{s=1}^t \eta_s a_s\,.

\end{align*}

Noting that the distribution of the the linear combination of Gaussian random variables is also Gaussian, we see that $Z$ is also normally distributed. In particular, from $\EE{Z}= 0$ and $\EE{ Z Z^\top } = V$ we immediately see that $Z \sim \mathcal N(0, V )$ (a Gaussian distribution is fully determined by its mean and covariance). From this it follows that

\begin{align}

V^{1/2} (\hat \theta_t -\theta_*) = V^{-1/2} Z \sim \mathcal N(0,I)\,,

\label{eq:standardnormal}

\end{align}

where $V^{1/2}$ is a square root of the symmetric matrix $V$ (i.e., $V^{1/2} V^{1/2} = V$).

To get a confidence set for $\theta_*$ we can then choose any $S\subset \R^d$ such that

\begin{align}

\frac{1}{\sqrt{(2\pi)^d}}\int_S \exp\left(-\frac{1}{2}\norm{x}_2^2\right) \,dx \ge 1-\delta\,.

\label{eq:lbanditsregion}

\end{align}

Indeed, for such a subset $S$, defining $\cC_{t+1} = \{\theta\in \R^d\,:\, V^{1/2} (\hat \theta_t -\theta) \in S \}$, we see that $\cC_{t+1}$ is a **$(1-\delta)$-level confidence set**:

\begin{align*}

\Prob{\theta_*\in \cC_{t+1}} = \Prob{ V^{1/2} (\hat \theta_t -\theta) \in S } \ge 1-\delta\,.

\end{align*}

(In particular, if \eqref{eq:lbanditsregion} holds with an equality, we also have an equality in the last display.)

How should the set $S$ be chosen? One natural choice is based on constraining the $2$-norm of $V^{1/2} (\hat \theta_t -\theta)$. This has the appeal that $S$ will be a Euclidean ball, which makes $\cC_{t+1}$ an ellipsoid. The details are as follows: Recalling that the distribution of the sum of the squares of $d$ independent $\mathcal N(0,1)$ random variables is the $\chi^2$ distribution with $d$ degrees of freedom (in short, $\chi^2_d$), from \eqref{eq:standardnormal} we get

\begin{align}

\norm{\hat \theta_t – \theta_*}_{V}^2 = \norm{ Z }_{V^{-1}}^2 \sim \chi^2_d\,.

\label{eq:lschi2}

\end{align}

Now, if $F(t)$ is the tail probability of the $\chi^2_d$ distribution: $F(t) = \Prob{ U> t}$ for $U\sim \chi_d^2$, it is easy to verify that

\begin{align*}

\cC_{t+1} = \left\{ \theta\in \R^d \,:\, \norm{\hat \theta_t – \theta}_{V}^2 \le t \right\}

\end{align*}

contains $\theta_*$ with probability $1-F(t)$. Hence, $\cC_{t+1}$ is a $(1-F(t))$-level confidence set for $\theta_*$. To get the value of $t$ given $F(t) = \delta$, we can either resort to numerical calculations, or use Chernoff’s method. After some calculation, this latter approach gives $t \le d + 2 \sqrt{ d \log(1/\delta) } + 2 \log(1/\delta)$, which implies that

\begin{align}

\cC_{t+1} = \left\{ \theta\in \R^d\,:\, \norm{ \hat \theta_t-\theta }_{V}^2 \le d + 2 \sqrt{ d \log(1/\delta) } + 2 \log(1/\delta) \right\}\,

\label{eq:confchi2}

\end{align}

is an $1-\delta$-level confidence set for $\theta_*$ (see Lemma 1 on page 1355 of a paper by Laurent and Massart).

## Martingale noise and Laplace’s method

We now start working towards gradually removing the extra assumptions. In particular, we first ask what happens when we only know that $\eta_1,\dots,\eta_t$ are conditionally $1$-subgaussian:

\begin{align}

\EE{ \exp(\lambda \eta_s) | \eta_1,\dots,\eta_{s-1} } \le \exp( \frac{\lambda^2}{2} )\,, \qquad s = 1,\dots, t\,.

\label{eq:condsgnoise}

\end{align}

Can we still get a confidence set, say, of the form \eqref{eq:confchi2}? Recall that previously to get this confidence set all we had to do was to upper bound the tail probabilities of the “normalized error” $\norm{Z}_{V^{-1}}^2$ (cf. \eqref{eq:lschi2}). How can we get this when we only know that $(\eta_s)_s$ are conditionally $1$-subgaussian?

Before diving into this let us briefly mention that \eqref{eq:condsgnoise} implies that $(\eta_s)_s$ is a $(\sigma(\eta_1,\dots,\eta_s))_s$-adapted **martingale difference process**:

Definition (martingale difference process): Let $(\cF_s)_s$ be a filtration over a probability space $(\Omega,\cF,\PP)$ (i.e., $\cF_{s} \subset \cF_{s+1}$ for all $s$ and also $\cF_s \subset \cF$). The sequence of random variables $(U_s)_s$ is an $(\cF_s)_s$-adapted martingale difference process if for all $s$, $\EE{U_s}$ exists and $U_s$ is $\cF_s$-measurable and $\EE{ U_s|\cF_{s-1}}=0$.

A collection of random variables is in general called a “random process”. Somewhat informally a martingale difference process is also called “martingale noise”. We see that in the linear bandit model, the noise process, $(\eta_s)_s$, is necessarily martingale noise with the filtration given by $\cF_s = \{A_1,X_1,\dots,A_{s-1},X_{s-1},A_s\}$. Note the inclusion of $A_s$ in the definition of $\cF_s$. The martingale noise assumption allows the noise $\eta_s$ impacting the feedback in round $s$ to depend on past choices, including the most recent action. This is actually essential if we have for example Bernoulli payoffs. If $(U_s)_s$ is an $(\cF_s)_s$ martingale difference process, the partial sums $M_t = \sum_{s=1}^t U_s$ define an $(\cF_s)_s$-adapted **martingale**. When the filtration is clear from the context, the reference to it is often dropped.

Let us return to the construction of confidence sets. Since we want exponentially decaying tail probabilities, one is tempted to try Chernoff’s method, which given a random variable $U$ yields $\Prob{U\ge u}\le \EE{\exp(\lambda U)}\exp(-\lambda u)$ which holds for any $\lambda\ge 0$. When $U$ is $1$-subgaussian, $\EE{\exp(\lambda U)}$ can be conveniently bounded by $\exp(\lambda^2/2)$, after which we can choose $\lambda$ to get the tightest bound, minimizing the quadratics $\lambda^2/2-\lambda u$ over nonnegative values.

To make this work with $U= \norm{Z}_{V^{-1}}^2$, we need to bound $\EE{\exp(\lambda \norm{Z}_{V^{-1}}^2 )}$. Unfortunately, this turns out to be a daunting task! Can we still somehow use Chernoff’s method? Let us start what we know: We know that there are (conditionally) $1$-subgaussian random variables that make up $Z$, namely $\eta_1,\dots,\eta_t$. Hence, we may try to see just how $\EE{ \exp( \lambda^\top Z ) }$ behaves for some $\lambda\in \R^d$. Note that we had to switch to a vector $\lambda$ since $Z$ is vector-valued. An easy calculation (with using \eqref{eq:condsgnoise} first with $s=t$, then with $s=t-1$, etc.) gives

\begin{align*}

\EE{ \exp(\lambda^\top Z) } = \EE{ \exp( \sum_{s=1}^t (\lambda^\top a_s) \eta_s ) } \le \exp( \frac12 \sum_{s=1}^t (\lambda^\top a_s)^2) = \exp( \frac12 \lambda^\top V \lambda )\,.

\end{align*}

How convenient that $V$ appears on the right-hand side of this inequality! But does this have anything to do with $U=\norm{Z}_{V^{-1}}^2$?

Rewriting the above inequality as

\begin{align*}

\EE{ \exp(\lambda^\top Z -\frac12 \lambda^\top V \lambda) } \le 1\,,

\end{align*}

we may notice that

\begin{align*}

\max_\lambda \exp(\lambda^\top Z -\frac12 \lambda^\top V \lambda)

= \exp( \max_\lambda \lambda^\top Z -\frac12 \lambda^\top V \lambda ) = \exp(\frac12 \norm{Z}_{V^{-1}}^2)\,,

\end{align*}

where the last equality comes from solving $f'(\lambda)=0$ for $\lambda$ where $f(\lambda)=\lambda^\top Z -\frac12 \lambda^\top V \lambda$. It will be useful to explicitly write the expression of the optimal $\lambda$:

\begin{align*}

\lambda_* = V^{-1} Z\,.

\end{align*}

It is worthwhile to point out that this argument uses a “linearization trick” of ratios which can be applied in all kind of situations. Abstractly, the trick is to write a ratio as an expression that depends on the square root of the numerator and the denominator in a linear fashion: For $a\in \R$, $b\ne 0$, $\max_{x\in \R} ax-\frac12 bx^2 = \frac{a^2}{2b}$.

Let us summarize what we have so far. For this, introduce

\begin{align*}

M_\lambda = \exp(\lambda^\top Z -\frac12 \lambda^\top V \lambda)\,.

\end{align*}

Then, on the one hand, we have that $\EE{ M_{\lambda} }\le 1$, while on the other hand we have that $\max_{\lambda} M_{\lambda} = \exp(\frac12 \norm{Z}_{V^{-1}}^2)$. Combining this with Chernoff’s method we get

\begin{align*}

\Prob{ \frac12 \norm{Z}_{V^{-1}}^2 > u } = \Prob{ \max_\lambda M_{\lambda} > \exp(u) } \le \exp(-u) \EE{ \max_\lambda M_{\lambda} }\,.

\end{align*}

Thus, we are left with bounding $\EE{ \max_\lambda M_\lambda}$. Unfortunately, $\EE{ \max_\lambda M_{\lambda} }>\max_{\lambda}\EE{ M_{\lambda} }$, so the knowledge that for any fixed $\lambda$ the inequality $\EE{ M_\lambda } \le 1$ is not useful on its own. We need to somehow argue that the expectation of the maximum is still not too large.

There are at least two possibilities, both having their own virtues. The first one is to replace the continuous maximum with a maximum over an appropriately selected finite subset of $\R^d$ and argue that the error introduced this way is small. This is known as the “covering argument” as we need to cover the “parameter space” sufficiently finely to argue that the approximation error is small. An alternative, perhaps lesser known but quite powerful approach, is based on Laplace’s method of approximating the maximum value of a function using an integral. The power of this is that it removes the need for bounding $\EE{ \max_\lambda M_\lambda }$. We will base our construction on this approach.

To understand how the integral approximation of a maximum works, let us review briefly Laplace’s method. The best is to do this in a simple case. Thus, assume that we are given a smooth function $f:[a,b]\to \R$ which has a unique maximum at $x_0\in (a,b)$, Laplace’s method to approximate $f(x_0)$ is to compute the integral

\begin{align*}

I_s \doteq \int_a^b \exp( s f(x) ) dx

\end{align*}

for some large value of $s>0$. The idea is that this behaves like a Gaussian integral. Indeed, writing $f(x) = f(x_0) + f'(x_0)(x-x_0) + \frac12 f’’(x_0) (x-x_0)^2 + R_3(x)$, since $x_0$ is a maximizer of $f$, $f'(x_0)=0$ and $-q\doteq f’’(x_0)<0$. Under appropriate technical assumptions, \begin{align*} I_s \sim \int_a^b \exp( s f(x_0) ) \exp( -\frac{(x-x_0)^2}{2/(sq)} ) \,dx \end{align*} as $s\to \infty$. Now, as $s$ gets large, $\int_a^b \exp( -\frac{(x-x_0)^2}{2/(sq)} ) \,dx \sim \int_{-\infty}^{\infty} \exp( -\frac{(x-x_0)^2}{2/(sq)} ) \,dx = \sqrt{ \frac{2\pi}{sq} }$ and hence \begin{align*} I_s \sim \exp( s f(x_0) ) \sqrt{ \frac{2\pi}{sq} }\,. \end{align*} Intuitively, the dominant term in the integral $I_s$ is $\exp( s f(x_0) )$. It should also be clear that the fact that we integrate with respect the Lebesgue measure does not matter much: We could have integrated with respect to any other measure as long as that measure puts a positive mass on the neighborhood of the maximizer. The method is illustrated on the figure shown below. The take home message of this is that if we integrate the exponential of a function that has a pronounced maximum then we can expect that the integral will be close to the exponential function of the maximum. Since $M_\lambda$ is already the exponential function of the expression to be maximized, this gives us the idea to replace $\max_\lambda M_\lambda$ with $\bar M = \int M_\lambda h(\lambda) d\lambda$ where $h$ will be conveniently chosen so that the integral can be calculated in closed form (this is not really a requirement against the method, but it just makes the argument shorter). The main benefit of replacing the maximum with an integral is of course that (from Fubini's theorem) we easily get \begin{align} \EE{ \bar M } = \int \EE{ M_\lambda } h(\lambda) d\lambda \le 1 \label{eq:barmintegral} \end{align} and thus \begin{align} \Prob{ \log(\bar M) \ge u } \le e^{-u}\,. \label{eq:barmbound} \end{align} Thus, it remains to choose $h$ and calculate $\bar M$. When choosing $h$ we want two things: $h$ should put a large mass at the maximizer of $M_\lambda$ (which is $V^{-1}Z$), and either $\bar M$ should be available in closed form (with $\bar M \approx \max_\lambda M_\lambda$ in some sense), or a lower bound on $\bar M$ should be easy to obtain which is still close to $\max_\lambda M_\lambda$. Recalling the form of $M_\lambda$, we can realize that if we choose $h$ to be the density of Gaussian then the calculation of $\bar M$ will reduce to the calculation of a Gaussian integral, a convenient outcome since Gaussian integrals can be evaluated in closed form. In particular, setting $h$ to be the density of $\mathcal N(0,H^{-1})$, we find that \begin{align*} \bar M = \frac{1}{\sqrt{(2\pi)^d \det H^{-1}}} \int \exp( \lambda^\top Z - \frac12 \norm{\lambda}_{V}^2 - \frac12 \norm{\lambda}_{H}^2 ) d\lambda\,. \end{align*} Completing the square we get \begin{align*} \lambda^\top Z - \frac12 \norm{\lambda}_{V}^2 - \frac12 \norm{\lambda}_{H}^2 & = %\frac12 \norm{Z}_{V^{-1}}^2 -\frac12 \left\{ \norm{\lambda - V^{-1}Z}_{V}^2 + \norm{\lambda}_H^2 \right\} \\ %& = %\frac12 \norm{Z}_{V^{-1}}^2 -\frac12 \left\{ \norm{\lambda-(H+V)^{-1}Z}_{H+V}^2+\norm{Z}_{V^{-1}}^2-\norm{Z}_{(H+V)^{-1}}^2 \right\} \\ %& = \frac12 \norm{Z}_{(H+V)^{-1}}^2 -\frac12 \norm{\lambda-(H+V)^{-1}Z}_{H+V}^2\,. \end{align*} Hence, a short calculation gives \begin{align*} \bar M %&= %\frac{1}{\sqrt{(2\pi)^d \det H^{-1}}} \exp( \frac12 \norm{Z}_{(H+V)^{-1}}^2 ) \int \exp( -\frac12 \norm{\lambda-(H+V)^{-1}Z}_{H+V}^2 ) d\lambda\\ %& = %\frac{1}{\sqrt{(2\pi)^d \det H^{-1}}} \exp( \frac12 \norm{Z}_{(H+V)^{-1}}^2 ) %\sqrt{(2\pi)^d \det (H+V)^{-1} }\\ & = \left(\frac{\det (H)}{\det (H+V)}\right)^{1/2} \exp( \frac12 \norm{Z}_{(H+V)^{-1}}^2 ) \,, \end{align*} which, combined with \eqref{eq:barmbound} gives \begin{align} \label{eq:selfnormalizedbound} \Prob{ \frac12 \norm{Z}_{(H+V)^{-1}}^2 \ge u+ \frac12 \log \frac{\det (H+V)}{\det(H)} } \le e^{-u}\,. \end{align} Now choosing $H=V$, we have $\det(H+V) = 2^d \det(V)$ and $\norm{Z}_{(H+V)^{-1}}^2 = \norm{Z}_{(2V)^{-1}}^2 = Z^\top (2V)^{-1} Z = \frac12 \norm{Z}_{V^{-1}}^2$, giving \begin{align*} \Prob{ \norm{Z}_{V^{-1}}^2 \ge 2\log(2)d+ 4u } \le e^{-u}\,. \end{align*} Using the identity $V^{1/2}(\hat \theta_t - \theta_*) = \norm{Z}_{V^{-1}}^2$ we get that \begin{align*} \cC_{t+1} = \left\{ \theta\in \R^d \,:\, \norm{\hat\theta_t-\theta}_V^2 \le 2\log(2)d+4 \log(\tfrac1\delta) \right\} \end{align*} is a $(1-\delta)$-level confidence set for $\theta_*$. Compared with \eqref{eq:confchi2} (the confidence set that is based on approximating the tail of the chi-square distribution using Chernoff's method), we see that the two radii scale similarly as a function of $d$ and $\delta$, with the new confidence set losing a bit (though only by a constant factor) as $\delta\to 0$. In general, the radii are incomparable. This is quite remarkable given the generality gained.

## Confidence sets for sequential designs

This approach just described generalizes almost without any changes to the case when $A_1,\dots,A_t$ are not fixed, but are sequentially chosen as it is done by UCB (this is what is known as a “sequential design” in statistics). The main difference is that in this case it is not possible to choose $H=V$ in the last step. We will also drop the assumption that $V_t(0)$ is invertible and hence use $V_t(\lambda)$ with $\lambda>0$ in place of $V = V_t(0)$. Because of this, we need to replace the identity \eqref{eq:lserror0} with

\begin{align}

%\hat \theta_t

% &= V_t^{-1}(\lambda) \sum_{s=1}^t A_s A_s^\top \theta_* + V_t^{-1}(\lambda) \sum_{s=1}^t \eta_s A_s \\

% & = V_t^{-1}(\lambda) \left(\lambda I + \sum_{s=1}^t A_s A_s^\top\right) \theta_*

% + V_t^{-1}(\lambda) \sum_{s=1}^t \eta_s A_s – \lambda V_t^{-1}(\lambda)\theta_*

% & = \theta_* + V_t^{-1}(\lambda) Z – \lambda V_t^{-1}(\lambda)\theta_*

\hat \theta_t -\theta_* = V_t^{-1}(\lambda) Z – \lambda V_t^{-1}(\lambda)\theta_*\,,

\label{eq:hthdeviation}

\end{align}

and thus

\begin{align*}

V_t^{1/2}(\lambda) (\hat \theta_t -\theta_*) = V_t^{-1/2}(\lambda) Z – \lambda V_t^{-1/2}(\lambda)\theta_*\,.

\end{align*}

Inequality \eqref{eq:selfnormalizedbound} still holds, though, as already noted, in this case the choice $H=V$ is not available because this would make the density $h$ random, which would undermine the equality in \eqref{eq:barmintegral} (one may try to condition on $V$ to bound $\EE{\bar M}$ but this introduces other problems). Hence, we will simply set $H=\lambda I$, which gives a high-probability bound on $\norm{ Z }_{V_t^{-1}(\lambda)}^2$ and eventually giving rise to the following theorem:

Theorem: Assuming that $(\eta_s)_s$ are conditionally $1$-subgaussian, for any $u\ge 0$,

\begin{align}

\Prob{ \norm{\hat \theta_t – \theta_*}_{V_t(\lambda)} \ge \sqrt{\lambda} \norm{\theta_*} + \sqrt{ 2 u + \log \frac{\det V_t(\lambda)}{\det (\lambda I)} } } \le e^{-u}\,

\label{eq:ellipsoidbasic}

\end{align}

and in particular, assuming $\norm{\theta_*}\le S$,

\begin{align}

C_{t+1} = \left\{ \theta\in \R^d\,:\,

\norm{\hat \theta_t – \theta}_{V_t(\lambda)} \le \sqrt{\lambda} S + \sqrt{ 2 \log(\frac1\delta) + \log \frac{\det V_t(\lambda)}{\det (\lambda I)} } \right\}

\label{eq:ellipsoidconfset}

\end{align}

is a $(1-\delta)$-level confidence set: $\Prob{\theta_*\in C_{t+1}}\ge 1-\delta$.

## Avoiding union bounds

For our results we need to ensure that $\theta_*$ is included in all of $C_1,C_2,\dots$. To ensure this one can use the union bound: In particular, we can replace $\delta$ used in the definition of $C_{t+1}$ by $\delta/(t(t+1))$. Then, the probability of none of $C_1,C_2,\dots$ containing $\theta_*$ is at most $\delta \sum_{t=1}^\infty \frac{1}{t(t+1)} = \delta$. The effect of this is that the radius of the confidence ellipsoid used in round $t$ is increased by a factor of $O(\log(t))$, which results in loser bounds and a larger regret. Fortunately, this is actually easy to avoid by resorting to a stopping time argument due to Freedman.

For developing the argument fix some positive integer $n$. To explain the technique we need to make the time dependence of $Z$ explicit. Thus, for $t\in [n]$ we let $Z_t = \sum_{s=1}^t X_s A_s$. Define also $M_\lambda(t)\doteq\exp( \lambda^\top Z_t – \frac12 \lambda^\top V_t(0) \lambda )$. In constructing $C_{t+1}$, the core inequality in the previous proof was that $\EE{ M_\lambda(t) } \le 1$ which allowed us to conclude the same for $\bar M(t) \doteq \int h(\lambda) M_\lambda(t) d\lambda$ and thus via Chernoff’s method led to $\Prob{\bar M(t) \ge e^{t}}\le e^{-t}$. As it was briefly mentioned on earlier, the proof of $\EE{ M_\lambda(t) } \le 1$ is based on chaining the inequalities

\begin{align}

\EE{ M_\lambda(s) | M_\lambda(s-1) } \le M_\lambda(s-1)

\label{eq:supermartingale}

\end{align}

for $s=t,t-1,\dots,1$, where we can define $M_\lambda(0)=1$. That $(M_\lambda(s))_s$ satisfies \eqref{eq:supermartingale} makes this sequence what is called a **supermartingale** adapted to the filtration $(\cF_s)_s \doteq (\sigma(A_1,X_1,\dots,A_s,X_s))_s$:

Definition (supermartingale): Let $(\cF_s)_{s\ge 0}$ be a filtration. A sequence of random variables, $(X_s)_{s\ge 0}$, is called an $(\cF_s)_s$-adapted supermartingale if $(X_s)_s$ is $(\cF_s)_s$ adapted (i.e., $X_s$ is $\cF_s$-measurable), the expectation of all the random variables is defined, and $\EE{X_s|\cF_{s-1}}\le X_{s-1}$ holds for $s=1,2,\dots$.

Integrating the above inequalities in \eqref{eq:supermartingale} with respect to $h(\lambda) d\lambda$ we immediately see that $(\bar M(s))_s$ is also an $(\cF_s)_s$ supermartingale with the filtration $(\cF_s)_s$ as before. A supermartingale process $(X_s)_s$ has the advantageous property that if we “stop it” at a random time $\tau\in [n]$ “without peeking into the future” then its mean still cannot increase: $\EE{X_\tau} \le \EE{X_1}$. When $\tau$ is a random time with this property, it is called a stopping time:

Definition (stopping time): Let $(\cF_s)_{s\in [n]}$ be a filtration. Then a random variable $\tau$ with values in $[n]$ is a stopping time given $(\cF_s)_s$ if for any $s\in [n]$, $\{\tau=s\}$ is an event of $\cF_s$.

Stopping times are after also defined when $n=\infty$ but we will not need this generality here.

Let $\tau$ be thus an arbitrary stopping time given the filtration $(\cF_s)_s \doteq (\sigma(A_1,X_1,\dots,A_s,X_s))_s$. In accordance of our discussion, $\EE{ \bar M(\tau) } \le \EE{ \bar M(0) } = 1$. From this it immediately follows that \eqref{eq:ellipsoidbasic} holds even when $t$ is replaced by $\tau$.

To see how this can be used to our advantage it will be convenient to introduce the events

\begin{align*}

\cE_t = \left\{ \norm{\hat \theta_t – \theta_*}_{V_t(\lambda)} \ge \sqrt{\lambda} S + \sqrt{ 2u + \log \frac{\det V_t(\lambda)}{\det (\lambda I)} } \right\}\,, \qquad t=1,\dots,n\,.

\end{align*}

With this, \eqref{eq:ellipsoidbasic} takes the form $\Prob{\cE_t}\le e^{-u}$ and by our discussion, for any random index $\tau\in [n]$ which is a stopping time with respect to $(\cF_t)_t$, we also have $\Prob{\cE_{\tau}} \le e^{-u}$. Now, choose $\tau$ to be the smallest round index $t\in [n]$ such that $\cE_t$ holds, or $n$ when none of $\cE_1,\dots,\cE_n$ hold. Formally, if the probability space holding the random variables is $(\Omega,\cF,\PP)$, we define $\tau(\omega) = t$ if $\omega\not \in \cE_1,\dots,\cE_{t-1}$ and $\omega \in \cE_t$ for some $t\in [n]$ and we let $\tau(\omega)=n$ otherwise. Since $\cE_1,\dots,\cE_t$ are $\cF_t$ measurable, $\{\tau=t\}$ is also $\cF_t$ measurable. Thus $\tau$ is a stopping time with respect to $(\cF_t)_t$. Now, consider the event

\begin{align*}

\cE= \left\{ \exists t\in [n] \mathrm{ s.t. } \norm{\hat \theta_t – \theta_*}_{V_t(\lambda)} \ge \sqrt{\lambda} S + \sqrt{ 2u + \log \frac{\det V_t(\lambda)}{\det (\lambda I)} } \right\}\,.

\end{align*}

Clearly, if $\omega\in \cE$ then $\omega \in \cE_{\tau}$. Hence, $\cE\subset \cE_{\tau}$ and $\Prob{\cE}\le \Prob{\cE_{\tau}}\le e^{-u}$. Finally, since $n$ was arbitrary we also get that the upper limit on $t$ in the definition of $\cE$ can be removed. This shows that the bad event that any of the confidence sets $C_{t+1}$ of the previous theorem fail to hold the parameter vector $\theta_*$ is also bounded by $\delta$:

Corollary (Uniform bound):

\begin{align*}

\Prob{ \exists t\ge 0 \text{ such that } \theta_*\not\in C_{t+1} } \le \delta\,.

\end{align*}

Recalling \eqref{eq:hthdeviation}, for future reference we now restate the conclusion of our calculations in a form concerning the tail of the process $(Z_s)_s \doteq (\sum_{s=1}^t \eta_s A_s)_s$:

Corollary (Uniform self-normalized tail bound on $(Z_s)_s$): For any $u\ge 0$,

\begin{align*}

\Prob{ \exists t\ge 0 \text{ such that }

\norm{Z_t}_{V_t^{-1}(\lambda)} \ge \sqrt{ 2u + \log \frac{\det V_t(\lambda)}{\det (\lambda I)} }

} \le e^{-u}\,.

\end{align*}

## Putting things together: The regret of Ellipsoidal-UCB

We will call the version of UCB that uses the confidence set of the previous section the “Ellipsoidal-UCB”. To state a bound on the regret of Ellipsoidal-UCB, let us summarize the conditions we will need: Recall that $\cF_t = \sigma(A_1,X_1,\dots,A_{t-1},X_{t-1},A_t)$, $X_t = \ip{A_t,\theta_*}+\eta_t$ and $\cD_t\subset \R^d$ is the action set available at the beginning of round $t$. We will assume that the following hold true:

- $1$-subgaussian martingale noise: $\forall \lambda\in \R$, $s\in \N$, $\EE{\exp(\lambda \eta_s)|\cF_{s} } \le \exp(\frac{\lambda^2}{2})$.
- Bounded parameter vector: $\norm{\theta_*}\le S$
- Bounded actions: $\sup_{t} \sup_{a\in \cD_t} \norm{a}_2\le L$
- Bounded mean reward: $|\ip{a,\theta_*}|\le 1$ for any $a\in \cup_t \cD_t$

Combining our previous results gives the following corollary:

Theorem (Regret of Ellipsoidal-UCB): Assume that the conditions listed above hold. Let $\hat R_n = \sum_{t=1}^n \max_{a\in \cD_t} \ip{a,\theta_*} – \ip{A_t,\theta_*}$ be the pseudo-regret of the Ellipsoidal-UCB algorithm that uses the confidence set \eqref{eq:ellipsoidconfset} in round $t+1$. With probability $1-\delta$, simultaneously for all $n\ge 1$,

\begin{align*}

R_n

\le \sqrt{ 8 d n \beta_{n-1} \, \log \frac{d\lambda+n L^2}{ d\lambda } }\,,

\end{align*}

where

\begin{align*}

\beta_{n-1}^{1/2}

& = \sqrt{\lambda} S + \sqrt{ 2\log(\frac1\delta) + \log \frac{\det V_{n-1}(\lambda)}{\det (\lambda I)}} \\

& \le \sqrt{\lambda} S + \sqrt{ 2\log(\frac1\delta) \,+\,\frac{d}{2}\ \log\left( 1+n \frac{L^2}{ d\lambda }\right)} \,.

\end{align*}

Choosing $\delta = 1/n$, $\lambda = \mathrm{const}$ we get that $\beta_n^{1/2} = O(d^{1/2} \log^{1/2}(n/d))$ and thus the expected regret of Ellipsoidal-UCB, as a function of $d$ and $n$ satisfies

\begin{align*}

R_n

& = O( \beta_n^{1/2} \sqrt{ dn \log(n/d) } )

= O( d \log(n/d) \sqrt{ n } )\,.

\end{align*}

Note that this holds simultaneously for all $n\in \N$. We also see that (apart from logarithmic factors) the regret scales linearly with the dimension $d$, while it is completely free of the cardinality of the action set $\cD_t$. Later we will see that this is indeed unavoidable in general.

## Fixed action set

When the action set is fixed and small, it is better to construct the upper confidence bounds for the payoffs of the actions directly. This is best illustrated for the fixed-design case when $A_1,\dots,A_t$ are deterministic, the noise is i.i.d. standard normal and the Grammian is invertible even with $\lambda=0$. In this case, the confidence set for $\theta_*$ was given by \eqref{eq:confchi2}, i.e., the radius is $\beta_t = 2d+8\log(1/\delta)$. By our earlier observation (see here),

\begin{align*}

\UCB_{t+1}(a) = \ip{a,\hat \theta_t} + (2d+8\log(1/\delta))^{1/2} \norm{a}_{V_t^{-1}}\,.

\end{align*}

Notice that the “radius” $\beta_t$ scales linearly with $d$, which then propagates into the UCB values. By our main theorem, this then propagates into the regret bound making the regret scale *linearly* with $d$. It is easy to see that the linear dependence of $\beta_t$ is an unavoidable consequence of using the confidence set construction which relied on the properties of the chi-square distribution with $d$ degrees of freedom. Unfortunately, this means that even when $\cD_t = \{e_1,\dots,e_d\}$, corresponding to the standard finite-action stochastic bandit case, the regret will scale linearly with $d$ (the number of actions), whereas we have seen it earlier that the optimal scaling is $\sqrt{d}$. To get this scaling we thus see that we need to avoid a confidence set construction which is based on ellipsoids.

A simple construction which avoids this problem is as follows: Staying with the fixed design and independent Gaussian noise, recall that $\hat \theta_t – \theta_* = V^{-1} Z \sim \mathcal N(0,I)$ (cf. \eqref{eq:standardnormal}). Fix $a\in \R^d$. Then, $\ip{a, \hat \theta_t – \theta_*} = a^\top V^{-1} Z \sim \mathcal N(0, \norm{a}_{V^{-1}}^2)$. Hence, by the subgaussian property of Gaussians, defining

\begin{align}\label{eq:lingaussianperarmucb}

\mathrm{UCB}_{t+1}(a) = \ip{a,\hat \theta_t} + \sqrt{ 2 \log(1/\delta) }\, \norm{a}_{V_t^{-1}} \,

\end{align}

we see that $\Prob{ \ip{a,\theta_*}>\mathrm{UCB}_{t+1}(a) } \le \delta$. Note that this bound indeed removed the extra $\sqrt{d}$ factor.

Unfortunately, the generalization of this method to the sequential design case is not obvious. The difficulty comes from controlling $\ip{a, V^{-1} Z}$ without relying on the exact distributional properties of $Z$.

# Notes

An alternative to the $2$-norm based construction is to use $1$-norms. In the fixed design setting, under the independent Gaussian noise assumption, using Chernoff’s method this leads to

\begin{align}

\cC_{t+1} = \left\{ \theta\in \R^d\,:\, \norm{ V^{1/2}(\hat \theta_t-\theta) }_1 \le

\sqrt{2 \log(2) d^2 +2d \log(1/\delta) }

\right\}\,.

\label{eq:confl1}

\end{align}

This set, together with the one based on the $2$-norm (cf. \eqref{eq:confchi2}), is illustrated on the figure on the side, which shows the $1-\delta=0.9$-level confidence sets. For low dimensions, and not too small values of $\delta$ the $1$-norm based confidence set is fully included in the $2$-norm based confidence set as shown here. This happens of course due to the approximations used in deriving the radii of these sets. Illustration of 2-norm and 1-norm based confidence sets in 2 dimension. The $1-\delta=0.9$-level confidence sets are shown. For low dimension, and not too small values of $\delta$ the $1$-norm based confidence set is fully included in the $2$-norm based confidence set. This happens of course due to the approximations used in deriving the radii of these sets. In general, the two confidence sets are incomparable.

# References

As mentioned in the previous post, the first paper to consider UCB-style algorithms is by Peter Auer:

- Peter Auer: Using confidence bounds for exploitation-exploration trade-offs. The Journal of Machine Learning Research. 3:397-422, 2002.

The setting of the paper is the one when in every round the number of actions is bounded by a constant $K>0$. For this setting an algorithm (SupLinRel) is given and it is shown that its expected regret is at most $O(\sqrt{ d n \log^3(K n \log(n) ) })$, which is better by a factor of $\sqrt{d}$ than the bound derived here, but it also depends on $K$ (although only logarithmically) and is slightly worse than the bound shown here in its dependence on $n$, too. The dependence on $K$ can be traded for the dependence on $d$ by considering an $L/\sqrt{n}$-cover of the ball $\{ a\,:\, \norm{a}_2 \le L \}$, which gives $K = \Theta( (\sqrt{n}/L)^d )$ and a regret of $O(d^2 \sqrt{n \log^3(n)})$ for $n$ large, which is larger than the regret given here. Note that SupLinRel can also be run without actually discretizing the action set, just its confidence intervals have to be set based on the cardinality of the discretization (in particular, inflated by a factor of $\sqrt{d}$).

SupLinRel builds on LinRel, which, as we noted, is UCB with a specific upper confidence value. LinRel uses confidence bounds of the form \eqref{eq:lingaussianperarmucb} with a confidence parameter roughly of the size $\delta = 1/(n \log(n) K)$. This is possible because SupLinRel uses LinRel in a nontrivial way, “filtering” the data that LinRel works on.

In particular, SupLinRel uses a version of the “doubling trick”. The main idea is to keep at most $\log_2(n^{1/2})$ lists that hold mutually exclusive data. In every round SupLinRel starts with list of index $s=1$ and feeds the data of this list to LinRel which calculates UCB values and confidence width based on the data it received. If all the calculated widths are small (below $\theta=n^{-1/2}$) then the action with the highest UCB value is selected and the data generated is thrown away. Otherwise, if any width is above $2^{-s}$ then the corresponding action is chosen and the data observed is added to the list with index $s+1$. If all the widths are below $2^{-s}$ then all actions which, based on the current confidence intervals calculated by LinRel, cannot be optimal are eliminated, $s$ is incremented and the process is repeated until an action is chosen. Overall the effect of this is that the lists grow, lists with smaller index growing first until they are sufficiently rich for their desired target accuracy. Furthermore, the contents of a list is determined not by data of the list but by data on lists with smaller index. Because of this, the fixed-design confidence interval design as described here can be used, which ultimately saves the $O(\sqrt{d})$ factor. While apart from log-factors SupLinRel is unimprovable, in practice SupLinRel is excessively wasteful.

A confidence ellipsoid construction based on covering arguments can be found in the paper by Varsha Dani, Thomas Hayes and Sham Kakade:

- V. Dani, T. Hayes and S. Kakade: Stochastic Linear Optimization under Bandit Feedback, COLT-2008.

An analogous construction is given by

- Paat Rusmevichientong, John N. Tsitsiklis: Linearly Parameterized Bandits, MOR, 35:395-411

The confidence ellipsoid construction described in this post is based on

- Yasin Abbasi-Yadkori, Dávid Pál and Csaba Szepesvári: Improved Algorithms for Linear Stochastic Bandits, NIPS, pp. 2312-2320, 2011.

Laplace’s method is also called the “Method of Mixtures” in the probability literature and it’s use goes back to the work of Robbins and Siegmund that was done in 1970. In practice, the improvement that results from using Laplace’s method as compared to the previous ellipsoidal constructions that are based on covering arguments is quite enormous.

As mentioned earlier, a variant of SupLinRel that is based on ridge regression (as opposed to LinRel, which is based on truncating the smallest eigenvalue of the Grammian) is described in

- Wei Chu, Lihong Li, Lev Reyzin, Robert E. Schapire: Contextual Bandits with Linear Payoff Functions, AISTATS, pp. 208-214, 2011.

The algorithm, which is called SupLinUCB, uses the same construction as SupLinRel and enjoys the same regret.

For a fixed action set (i.e., when $\cD_t=\cD$), one can use an elimination based algorithm, which in every phase collects data by using a “spanning set” of the remaining actions. At the end of the phase, since the data collected in the phase only depends on data collected in previous phases, one can use Hoeffding’s bounds to construct UCB values for the actions. This is the idea underlying the “SpectralEliminator” algorithm in the paper

- Michal Valko, Remi Munos, Branislav Kveton and Tomas Kocak: Spectral Bandits for Smooth Graph Functions, ICML 2014.

# Sparse linear bandits

In the last two posts we considered stochastic linear bandits, when the actions are vectors in the $d$-dimensional Euclidean space. According to our previous calculations, under the condition that the expected reward of all the actions are in a fixed bounded range (say, [-1,1]), the regret of Ellipsodial-UCB is $\tilde O(d \sqrt{n} )$ ($\tilde O(\cdot)$ hides logarithmic factors). This is definitely an advance when the number of actions $K$ is much larger than $d$, but $d$ itself can be quite large in some applications. In particular, in contextual bandits $d$ would be the dimension of the feature space that the actions are mapped into. The previous result indicates that there is a high price to be paid for having more features. This is not what we want. Ideally, one should be able to “add features” without suffering much additional regret if the feature added does not contribute in a significant way. This can be captured by the notion of sparsity, which is the central theme of this post.

# Sparse linear stochastic bandits

The sparse linear stochastic bandit problem is the same as the stochastic linear bandit problem with a small difference. Just like in the standard setting, at the beginning of a round with index $t\in \N$ the learner receives a decision set $\cD_t\subset \R^d$. Let $A_t\in \cD_t$ be the action chosen by the learner based on the information available to it. The learner then receives the reward

\begin{align}

X_t = \ip{A_t,\theta_*} + \eta_t\,,

\label{eq:splinmodel}

\end{align}

where $(\eta_t)_t$ is zero-mean noise (subject to the usual subgaussianity assumption) and $\theta_*\in \R^d$ is an unknown vector. The specialty of the sparse setting is that the parameter vector $\theta_*$ is assumed to have many zero entries. Introducing the “$0$-norm”

\begin{align*}

\norm{\theta_*}_0 = \sum_{i=1}^d \one{ \theta_{*,i}\ne 0 }\,,

\end{align*}

the assumption is that

\begin{align*}

p \doteq \norm{\theta_*}_0 \ll d\,.

\end{align*}

Much ink has been spilled on what can be said about the speed of learning in linear models like (\ref{eq:splinmodel}) when $\{A_t\}_t$ are *passively generated* and the parameter vector is known to be sparse. Most results are phrased about recovering $\theta_*$, but there also exist a few results that quantify the speed at which good predictions can be made. The ideal outcome would be that the learning speed depends mostly on $p$, while the dependence on $d$ becomes less severe (e.g., the learning speed is a function of $p \log(d)$). Almost all the results come under the assumption that the vectors $\{A_t\}_t$ are “incoherent”, which means that the Grammian underlying $A_t$ should have good conditioning. The details are a bit more complicated, but the main point is that these conditions are exactly what the action vector resulting from a good bandit algorithm will **not** satisfy: Bandit algorithms want to choose the optimal action as often as possible, while for standard theory on fast learning under sparsity one needs all directions of the space to be thoroughly explored. We need some approach that does not rely on such strong assumptions.

# Elimination on the hypercube

As a warmup problem consider linear bandits when the decision set (of every round) is the $d$-dimensional hypercube: $\cD = [-1,1]^d$. To reduce clutter we will denote the “true” parameter vector by $\theta$. Thus, in every round $t$, $X_t = \ip{A_t,\theta_t}+\eta_t$. We assume that $\eta_t$ is conditionally $1$-subgaussian given the past:

\begin{align*}

\EE{ \exp(\lambda \eta_t ) | \cF_{t-1} } \le \exp( \frac{\lambda^2}{2} ) \quad \text{for all } \lambda \in \R\,.

\end{align*}

Here $\cF_{t-1} = \sigma( A_1,X_1,\dots,A_{t-1},X_{t-1}, A_t )$. Since conditional subgaussanity comes up frequently, we introduce a notation for it: When $X$ is $\sigma^2$ subgaussian given some $\sigma$-field $\cF$ we will write $X|\cF \sim \subG(\sigma^2)$. Our earlier statement that the sum of independent subgaussian random variables is subgaussian with a subgaussianity factor that is the sum of the two factors also holds for conditionally subgaussian random variables.

We shall also assume that the mean rewards lie in the range $[-1,1]$. That is, $|\ip{a,\theta}|\le 1$ for all $a\in \cD$. Note that this is equivalent to $\norm{\theta}_1\le 1$.

Stochastic linear bandits on the hypercube is notable as it enjoys perfect “separability”: For each dimension $i\in [d]$, $A_{ti}$ can be selected regardless of the choice of $A_{tj}$ for the other dimensions $j\ne i$. It follows among other things that the optimal action can be computed dimension-wise. In particular, the optimal action for dimension $i$ is simply the sign of $\theta_i$. This is good and bad news. In the worst case, this structure means that each sign has to be learned independently. As we shall see later this implies that in the worst case the regret will be as large as $\Omega(d\sqrt{n})$ (our next post will expand on this). However, the separable structure is also good news: An algorithm can estimate $\theta_i$ for each dimension separately while paying absolutely no price for this experimentation when $\theta_i=0$ (because the choice of $A_{ti}$ does not restrict the choice of $A_{tj}$ with $j\ne i$). As we shall see, because of this we can design an algorithm whose regret scales with $O(\norm{\theta}_0\sqrt{n})$ even without knowing the value of $\norm{\theta}_0$.

The algorithm that achieves this is a randomized variant of Explore then Commit. For reasons that will become clear, we will call it “Selective Explore then Commit” or SETC. The algorithm works as follows: SETC keeps an interval-estimate $U_{ti}\subset \R$ of all components $\theta_i$. These will be constructed such that $\theta_i$ is contained in $\cap_{t\in [n]} U_{ti}$ with high probability (for simplicity we assume that a horizon $n$ is given to the algorithm; lifting this assumption is left as an exercise to the reader).

Assume for a moment that we are given intervals $(U_{ti})_{t}$ that contain $\theta_i$ with high probability. Consider the event when the intervals actually do contain $\theta_i$. If at some point $t$ it holds that $0\not \in U_{ti}$, then the sign of $\theta_i$ can be determined by examining on which size of $0$ the interval $U_{ti}$ falls, which allows us to set $A_{ti}$ optimally! In particular, if $U_{ti}\subset (0,\infty)$ then $\theta_i>0$ must hold and we will set $A_{ti}$ to $1$, while if $U_{ti}\subset (-\infty,0)$ then $\theta_i<0$ must hold and we set $A_{ti}$ to $-1$. On the other hand, if at some round $t$, $0\in U_{ti}$ then the sign of $\theta_i$ is uncertain and more information needs to be collected on $\theta_i$, i.e., one needs more exploration.In this case, to maximize the information gained, $A_{ti}$ can be chosen at random to be either $1$ or $-1$ with equal probabilities (a random variable that takes on values of $+1$ and $-1$ with equal probabilities is called a Rademacher random variable with parameter $0.5$). This leads to new information from which $U_{ti}$ can be refined (when no exploration is needed, $U_{ti}$ is not updated: $U_{t+1,i} = U_{ti}$). The algorithm, as far as the choice of $A_{ti}$ is concerned is summarized on the figure shown below.

To derive the form of $U_{t+1,i}$ consider $A_{ti} X_t$, the correlation of $A_{ti}$ and the reward observed. Expanding on the definition of $X_t$, using that $A_{ti}^2=1$, we see that

\begin{align*}

A_{ti} X_t = \theta_i + \underbrace{A_{ti} \sum_{j\ne i} A_{tj} \theta_j + A_{ti} \eta_t}_{Z_{ti}}\,.

\end{align*}

This suggests to set

\begin{align*}

\hat \theta_{ti} = \frac1t \sum_{s=1}^t A_{si} X_s\,,\quad

c_{ti} = 2\sqrt{ \frac{ \log(2n^2)}{t} }\, \text { and }\,\,

U_{t+1,i} = [\hat \theta_{ti} – c_{ti}, \hat \theta_{ti} + c_{ti} ]\,.

\end{align*}

To explain the choice of $c_{ti}$ we need a version of our the subgaussian concentration inequality which was stated for independent sequences of subgaussian random variables. In our case $(Z_{ti})_t$, as we shall soon see, are conditionally subgaussian. Using Chernoff’s method it is not hard to prove the following result:

Lemma (Hoeffding-Azuma): Let $\cF\doteq (\cF_t)_{0\le t\le n}$ be a filtration, $(Z_t)_{t\in [n]}$ be an $\cF$-adapted martingale difference sequence such that $Z_t$ is conditionally $\sigma^2$-subgaussian given $\cF_{t-1}$: $Z_t|\cF_{t-1} \sim \subG(\sigma^2)$. Then, $\sum_{t=1}^n Z_t$ is $n\sigma^2$ subgaussian and for any $\eps>0$,

\begin{align*}

\Prob{ \sum_{t=1}^n Z_t \ge \eps } \le \exp( – \frac{ \eps^2}{2n\sigma^2} )\,.

\end{align*}

**Proof**: Let $S_t = \sum_{s=1}^t Z_t$ with $S_0 = 0$. It suffices to show that $S_n$ is $n\sigma^2$ subgaussian. To show this, note that for $t\ge 1$, $\lambda\in \R$,

\begin{align*}

\EE{ \exp(\lambda S_t) } = \EE{\exp(\lambda S_{t-1}) \EE{ \exp(\lambda Z_t) | \cF_{t-1}} } \le \exp(\frac{\lambda^2\sigma^2}{2}) \EE{ \exp(\lambda S_{t-1}) }\,.

\end{align*}

Since $\exp(\lambda S_0)=1$, chaining the inequalities obtained we get the desired bound.

QED.

Sometimes, it is more convenient to use the “deviation” form of the above inequality which states that for any $\delta\in (0,1)$, with probability $1-\delta$,

\begin{align*}

\sum_{t=1}^n Z_t \le \sqrt{ 2 n \sigma^2 \log(1/\delta) }\,.

\end{align*}

The Hoeffding-Azuma inequality together with some union bounds leads to the following lemma:

Lemma (Reliable Intervals): Let $F_i = \{ \exists t\in [n] \text{ s.t. } \theta_i \not \in U_{ti} \}$ be the “failure” event that some of the intervals $U_{ti}$ fails to include $\theta_i$. Then,

\begin{align*}

\Prob{F_i} \le \frac{1}{n}\,.

\end{align*}

The proof will be provided at the end of the section.

Let $R_{ni} = n|\theta_i|-\EE{\sum_{t=1}^n A_{ti}\theta_i }$ and let $R_n = \max_{a\in \cD} n \ip{a,\theta} – \EE{ \sum_{t=1}^n \ip{A_t,\theta}}$ be the regret of SETC. Then, $R_n = \sum_{i:\theta_i\ne 0} R_{ni}$. Hence, it suffices to bound $R_{ni}$ for $i$ such that $\theta_i\ne 0$. Now, when the confidence intervals for dimension $i$ do not fail, i.e., on $F^c_{ni}$, the regret is at most $2|\theta_i| \tau_i$ where $\tau_i = \sum_{s=1}^n E_{si} $ is the number of exploration steps for dimension $i$. On the same event, we claim that $\tau_i\le 1 + 16 \log(2n^2)/|\theta_i|^2$ also holds. To see this it is enough to show that $\tau_i \le t$ for any $t>t_0\doteq 16 \log(2n^2) / |\theta_i|^2$. Pick such a $t$ and observe that $t>t_0$ is equivalent to $4 \sqrt{ \log(2n^2)/t} < | \theta_i|$. By definition, $c_{ti} = 2 \sqrt{ \log(2n^2)/(\tau_i\wedge t) }$. If $\tau_i\le t$ there is nothing to be proven. Hence, assume that $\tau_i>t$. Then, $\tau_i \wedge t = t$ and thus we get that $2c_{ti}<|\theta_i|$. Then, $|\hat \theta_{ti} | – c_{ti} \ge |\theta_i| – 2c_{ti}>0$, where the first inequality follows from that $|\theta_i-\hat\theta_{ti}|\le c_{ti}$. Since $|\hat \theta_{ti}|-c_{ti}>0$, $0\not\in U_{ti}$ and hence $E_{t+1,i}=\dots = E_{ni}=0$, which implies that $\tau_i\le t$, which contradicts with $\tau_i>t$, finishing the proof. Putting things together we get

\begin{align*}

R_{ni}

& = \EE{ \one{F_i} \sum_{t=1}^n (\sgn(\theta_i)-A_{ti}) \theta_i }

+ \EE{ \one{F_i^c} \sum_{t=1}^n (\sgn(\theta_i)-A_{ti}) \theta_i }\\

& \le 2n |\theta_i| \Prob{F_i} + \EE{ \one{F_i^c} 2|\theta_i| \tau_i } \\

& \le 2n|\theta_i| \Prob{F_i} + |\theta_i| (1+ \frac{16 \log(2n^2) }{|\theta_i|^2}) \\

&= 3|\theta_i| + \frac{16 \log(2n^2) }{|\theta_i|}\,,

\end{align*}

yielding the following theorem:

Theorem (Regret of SETC): Let $\norm{\theta}_1\le 1$. Then, the regret $R_n$ of SETC satisfies

\begin{align*}

R_n \le 3 \norm{\theta}_1 + 16 \sum_{i:\theta_i\ne 0} \frac{\log(2n^2) }{|\theta_i|}\,.

\end{align*}

Since $R_{ni} \le 2 n |\theta_i|$ also holds, we immediately get a bound on the worst-case regret of SETC for $p$-sparse vectors:

Corollary (Worst-case regret of SETC): Let $n\ge 2$. The minimax regret of SETC $R_n^*$ over $\norm{\theta}_1\le 1$, $\norm{\theta}_0\le p$ satisfies

\begin{align*}

R_n \le 3p + 8 p \sqrt{ n \log(2n^2)} \,.

\end{align*}

Thus we see that SETC successfully trades the dimension $d$ for the sparsity index $p$ without ever needing the knowledge of $p$. This should be encouraging! Can this result be generalized beyond the hypercube? This is the question we investigate next.

However, we first finish the proof of the theorem by providing the proof of the lemma that claimed that the event $F_i$ has a small probability.

## Proof of $\Prob{F_i}\le 1/n$.

**Proof**: Setting $\cF_t = \sigma(A_1,X_1,\dots,A_t,X_t)$, we see that $(Z_{ti})_t$ is $(\cF_t)_t$-adapted (note that $\cF_{t-1}$ now excludes $A_t$!). Letting $S_{ti} = \sum_{j\ne i} A_{tj} \theta_j$, we can write $Z_{ti} = A_{ti} S_{ti} + A_{ti}\eta_t$. We check that $Z_{ti}|\cF_{t-1}\sim \subG(2)$. With repeated conditioning we calculate

\begin{align*}

\EE{ \exp(\lambda Z_{ti} )|\cF_{t-1} }

&= \EE{ \EE{ \exp(\lambda Z_{ti} )|\cF_{t-1},A_t} |\cF_{t-1} } \\

&= \EE{ \exp(\lambda A_{ti} S_{ti} ) \EE{ \exp(\lambda A_{ti} \eta_t )|\cF_{t-1},A_t} |\cF_{t-1} } \\

&\le \EE{ \exp(\lambda A_{ti}S_{ti} ) \exp(\frac{\lambda^2}{2}) |\cF_{t-1} } \\

&= \exp(\frac{\lambda^2}{2})

\EE{ \EE{\exp(\lambda A_{ti} S_{ti} )|\cF_{t-1},S_{ti}} |\cF_{t-1} } \\

&\le \exp(\frac{\lambda^2}{2})

\EE{ \exp( \frac{\lambda^2 S_{ti}^2}{2} ) |\cF_{t-1} } \\

&\le \exp(\lambda^2)\,,

\end{align*}

where the first inequality used that $\eta_t$ and $A_t$ are conditionally independent given $\cF_{t-1}$ and that $\eta_t|\cF_{t-1}\sim \subG(1)$, the second to last inequality used that $S_{ti}$ and $A_{ti}$ are conditionally independent given $\cF_{t-1}$ and that $A_{ti}|\cF_{t-1}\sim \subG(1)$, the last step used that $|S_{ti}|\le 1$. From this, we conclude that $Z_{ti}|\cF_{t-1} \sim \subG(2)$. Let $E_{si}$ be the indicator of whether dimension $i$ is “explored” in round $s$. That is, $E_{si}\in \{0,1\}$ and $E_{si}=1$ iff dimension $i$ is explored in round $s$. Note that for any $t\in [n]$,

\begin{align*}

\hat \theta_{ti} = \frac{\sum_{s=1}^t E_{si} A_{si} X_s}{\sum_{s=1}^t E_{si}}\,,

\qquad

c_{ti} = 2\sqrt{ \frac{ \log(2n^2) }{ \sum_{s=1}^t E_{si} } }\,.

\end{align*}

Then,

\begin{align*}

\Prob{ \exists t\in [n] \,:\, \theta_i \not\in U_{ti} }

& \le \sum_{t=1}^n \Prob{ \theta_i \not\in U_{ti} } \\

& = \sum_{t=1}^n \Prob{ |\hat\theta_{ti} – \theta_i| > c_{ti} } \\

& = \sum_{t=1}^n \Prob{ \left| \sum_{s=1}^t E_{si} Z_{si} \right| > 2\sqrt{ (\sum_{s=1}^t E_{si}) \log(2n^2) } } \\

& = \sum_{t=1}^n

\sum_{p=1}^t

\Prob{ \left| \sum_{s=1}^t E_{si} Z_{si} \right| > 2\sqrt{ (\sum_{s=1}^t E_{si}) \log(2n^2) },

\sum_{s=1}^t E_{si}=p } \\

& \le \sum_{t=1}^n

\sum_{p=1}^t

\Prob{ \left| \sum_{s=1}^p E_{si} Z_{si} \right| > 2\sqrt{ p \log(2n^2) }

} \\

& \le \sum_{t=1}^n

\sum_{p=1}^t

\exp\left( \frac{4 p \log(2n^2)} { 2 (2p) } \right) =1/n\,.

\end{align*}

QED.

# UCB with sparsity

Let us now tackle the question of how to exploit sparsity when there is no restriction on the action set $\cD_t$. The plan is to use UCB where the ellipsoidal confidence set used earlier is replaced by another confidence set, which is also ellipsoidal but has a smaller radius when the parameter vector is sparse.

Our starting point is the generic regret bound for UCB, which we replicate here for the reader’s convenience.

Let $A_t$ be the action chosen by UCB in round $t$: $A_t = \argmax_{a\in \cD_t} \UCB_t(a)$ where $\UCB_t(a)$ is the UCB index of action $a$ and $\cD_t\subset \R^d$ is the set of actions available in round $t$. Define

\begin{align}

V_t=V_0 + \sum_{s=1}^{t} A_s A_s^\top, \qquad t\in [n]\,,

\label{eq:sparselinbanditreggrammian}

\end{align}

where $V_0\succ 0$ is a fixed positive semidefinite matrix, which is often set to $\lambda I$ with some $\lambda>0$ tuning parameter. Let the following conditions hold:

- Bounded scalar mean reward: $|\ip{a,\theta_*}|\le 1$ for any $a\in \cup_t \cD_t$;
- Bounded actions: for any $a\in \cup_t \cD_t$, $\norm{a}_2 \le L$;
- Honest confidence intervals: With probability $1-\delta$, for all $t\in [n]$, $a\in \cD_t$, $\ip{a,\theta_*}\in [\UCB_t(a)-2\sqrt{\beta_{t-1}} \norm{a}_{V_{t-1}^{-1}},\UCB_t(a)]$ where $(V_{t})_t$ are given by \eqref{eq:sparselinbanditreggrammian}.

Our earlier generic result for UCB for linear bandits was as follows:

Theorem (Regret of UCB for Linear Bandits): Let the conditions listed above hold. Further, assume that $(\beta_t)_t$ is nondecreasing and $\beta_n\ge 1$. Then, with probability $1-\delta$, the pseudo-regret $\hat R_n = \sum_{t=1}^n \max_{a\in \cD_t} \ip{a,\theta_*} – \ip{A_t,\theta_*}$ of UCB satisfies

\begin{align*}

\hat R_n

% \le \sqrt{ 8 n \beta_{n-1} \, \log \frac{\det V_{n}}{ \det V_0 } }

\le \sqrt{ 8 d n \beta_{n-1} \, \log \frac{\trace(V_0)+n L^2}{ d\det V_0 } }\,.

\end{align*}

As noted earlier the half-width $\sqrt{\beta_{t-1}} \norm{a}_{V_{t-1}^{-1}}$ used in the assumption is the same as the one that we get when a confidence set $\cC_t$ is used which satisfies

\begin{align}

\cC_t \subset

\cE_t \doteq \{ \theta \in \R^d \,:\,

\norm{\theta-\hat \theta_{t-1}}_{ V_{t-1}}^2 \le \beta_{t-1} \}\,

\label{eq:ellconf}

\end{align}

with some $\hat \theta_{t-1}\in \R^d$. Our approach will be to identify some $\hat \theta_{t-1}$ and $\beta_{t-1}$ such that $\beta_{t-1}$ will scale with the sparsity of $\theta_*$.

In particular, we will prove the following result:

Theorem (Sparse Bandit UCB): There exist a strategy to compute the $\UCB_t(\cdot)$ indices such that the expected regret $R_n$ of the resulting policy satisfies $R_n = \tilde{O}(\sqrt{dpn})$.

Before presenting how this is done note that the best regret that we can hope to get is $\tilde{O}(\sqrt{ d n })$ with some additional terms depending on $p = \norm{\theta_*}_0$. This may be a bit disappointing. However, it is easy to see that the appearance of $d$ is the price one must pay for the increased generality of the result. In particular, we know that the regret for the standard $d$-action stochastic bandit problem is $\Omega(\sqrt{dn})$. Recall that $d$-action bandits can be represented as linear bandits with $\cD_t = \{e_1,\dots,e_d\}$ where $e_1,\dots,e_d$ is the standard Euclidean basis. Furthermore, checking the proof of the lower bound reveals that the proof uses $2$-sparse parameter vectors when rerepresented as linear bandits. Thus the appearance of $\sqrt{d}$, however unfortunate, is unavoidable in the general case. Later we will return to the question of lower bounds for sparse linear bandits.

## Online to confidence set conversion

The idea of the construction that we present here is to build a confidence set around a prediction method that enjoys strong learning guarantees in the sparse case. The construction is actually generic and can be applied beyond sparse problems.

The prediction problem considered is **online linear prediction** where the prediction error is measured in the squared loss. This is also known as the **online linear regression** problem. In this problem setting a learner interacts with a strategic environment in a sequential manner in discrete rounds. In round $t\in \N$ the learner is first presented with a vector $A_t\in \R^d$ that is chosen by the environment ($A_t$ is allowed to depend on past choices of the learner) and the learner needs to produce a real-valued predictions $\hat X_t$, which is then compared to the target $X_t\in \R$ which is also chosen by the environment. The learner’s goal is to produce predictions whose total loss is not much worse than the loss suffered by any of the linear predictors in some set. The loss in a given round is the squared difference. The regret of the learner against a linear predictor that uses the “weights” $\theta\in \R^d$ is

\begin{align}

\rho_n(\theta) \doteq \sum_{t=1}^n (X_t – \hat X_t)^2 – \sum_{t=1}^n (X_t – \ip{A_t,\theta})^2

\label{eq:olrregretdef}

\end{align}

and we say that the learner enjoys a regret guarantee $B_n$ against some $\Theta \subset \R^d$ if no matter the environment’s strategy,

\begin{align*}

\sup_{\theta\in \Theta} \rho_n(\theta)\le B_n\,.

\end{align*}

The online learning literature in machine learning has a number of powerful algorithms for this learning problem with equally powerful regret guarantees. Later we will give a specific result for the sparse case.

Now take any learner for online linear regression and assume that the environment generates $X_t$ in a stochastic manner like in linear bandits:

\begin{align}

X_t = \ip{A_t,\theta_*} + \eta_t\,,

\label{eq:sparselbmodel234}

\end{align}

where $\eta_t|\cF_{t-1} \sim \subG(1)$ and $\cF_{t} = \sigma(A_1,X_1,\dots,A_t,X_t,A_{t+1})$. Observing that a confidence set $\theta_*$ is nothing but a constraint on $\theta_*$ in terms of known quantities, combining \eqref{eq:olrregretdef} and \eqref{eq:sparselbmodel234} by elementary algebra we derive

\begin{align}

\sum_t (\hat X_t – \ip{A_t,\theta_*})^2 = \rho_n(\theta_*) + 2 \sum_t \eta_t (\hat X_t – \ip{A_t,\theta_*})\,.

\label{eq:spb_regret_identity}

\end{align}

Let $Z_t = \sum_{s=1}^t \eta_t (\hat X_t – \ip{A_t,\theta_*})$ with $Z_0=0$. Since $\hat X_t$ is chosen based on information available at the beginning of round $t$, $\hat X_t$ is $\cF_{t-1}$-measurable. Hence, $(Z_t – Z_{t-1})| \cF_{t-1} \sim \subG( \sigma_t^2 )$ where $\sigma_t^2 = (\hat X_t – \ip{A_t,\theta_*})^2$. Now define $Q_t = \sum_{s=1}^t \sigma_t^2$. The uniform self-normalized tail bound at the end of our previous post implies that for any $u>0$ and any $\lambda>0$,

\begin{align*}

\Prob{ \exists t\ge 0 \text{ such that }

|Z_t| \ge \sqrt{ (\lambda+Q_t) \log \frac{(\lambda+Q_t)}{\delta^2\lambda} }

} \le \delta\,.

\end{align*}

Choose $\lambda=1$. On the event $\cE$ when $|Z_t|$ is bounded by $\sqrt{ (1+Q_t) \log \frac{(1+Q_t)}{\delta^2} }$, from \eqref{eq:spb_regret_identity} and $\rho_t(\theta_*)\le B_t$ we get

\begin{align*}

Q_t \le B_t + 2 \sqrt{ (1+Q_t) \log \frac{(1+Q_t)}{\delta^2} }\,.

\end{align*}

While both sides depend on $Q_t$, the left-hand side grows linearly, while the right-hand side grows sublinearly in $Q_t$. This means that the largest value of $Q_t$ that satisfies the above inequality is finite. A tedious calculation then shows this value must be less than

\begin{align}

\beta_t(\delta) \doteq 1 + 2 B_t + 32 \log\left( \frac{\sqrt{8}+\sqrt{1+B_t}}{\delta} \right)\,.

\label{eq:betadef}

\end{align}

As a result on $\cE$, $Q_t \le \beta_t(\delta)$, implying the following result:

Theorem (Sparse confidence set): Assume that for some strategy for online linear regression with $\theta\in \Theta$, $\rho_t(\theta)\le B_t$. Let $(A_t,X_t,\hat X_t)_t$, $(\cF_t)_t$ be such that $A_t,\hat X_t$ are $\cF_{t-1}$ measurable, $X_t$ is $\cF_{t}$-measurable and for some $\theta_*\in \Theta$, $X_t = \ip{A_t,\theta_*}+\eta_t$ where $\eta_t|\cF_{t-1}\sim \subG(1)$. Fix $\delta\in (0,1)$ and define

\begin{align*}

C_{t+1} = \{ \theta\in \R^d \,:\, \sum_{s=1}^t (\hat X_s – \ip{A_s,\theta})^2 \le \beta_t(\delta) \}\,.

\end{align*}

Then, $\Prob{ \exists t\in \N \text{ s.t. } \theta_* \not\in C_{t+1} }\le \delta$.

To recap, our online linear regression based UCB (OLR-UCB) in round $t$ works as follows. For specificity, let us call $\pi$ the online linear regression method.

- Receive the action set $\cD_t\subset \R^d$.
- Choose $A_t = \argmax_{a\in \cD_t} \max_{\theta\in C_t} \ip{a,\theta}$
- Feed $A_t$ to $\pi$ and get its prediction $\hat X_t$.
- Construct $C_{t+1}$ via the help of $A_t$ and $\hat X_t$.
- Receive the reward $X_t = \ip{A_t,\theta_*} + \eta_t$
- Feed $X_t$ to $\pi$ as feedback.

The set $C_{t+1}$ is an ellipsoid underlying the Grammian $V_t = I + \sum_{s=1}^t A_t A_t^\top$ and center

\begin{align*}

\hat \theta_t = \argmin_{\theta\in \R^d} \norm{\theta}_2^2 + \sum_{s=1}^t (\hat X_t – \ip{A_t,\theta})^2\,.

\end{align*}

On the event when $\theta_*\in C_{t+1}$, $C_{t+1}\not=\emptyset$ and hence $\hat\theta_t\in C_{t+1}$. Thus, we can write

\begin{align*}

C_{t+1} =

\{ \theta\in \R^d \,:\,

\norm{\theta-\hat\theta_t}_{V_t}^2 + \norm{\hat\theta_t}_2^2

+ \sum_{s=1}^t (\hat X_s – \ip{A_s,\hat \theta_t})^2 \le \beta_t(\delta) \}\,.

\end{align*}

Thus,

\begin{align*}

C_{t+1} \subset

\{ \theta\in \R^d \,:\, \norm{\theta-\hat\theta_t}_{V_t}^2 \le \beta_t(\delta) \}\,.

\end{align*}

Hence, our general theorem applies and gives that the UCB algorithm which uses $C_{t+1}$ as its confidence bound enjoys the following regret bound:

Theorem (UCB for sparse linear bandits): Fix some $\Theta\subset \R^d$ and assume that the strategy $\pi$ enjoys the regret bounds $(B_t)_t$ against $\Theta$. Then, with probability $1-\delta$, the pseudo-regret $\hat R_n = \sum_{t=1}^n \max_{a\in \cD_t} \ip{a,\theta_*} – \ip{A_t,\theta_*}$ of OLR-UCB satisfies

\begin{align*}

\hat R_n

\le \sqrt{ 8 d n \beta_{n-1}(\delta) \, \log\left( 1+ \tfrac{n L^2}{ d }\right) }\,,

\end{align*}

where $(\beta_{t})_t$ is given by \eqref{eq:betadef}.

Note that $\beta_n = \tilde{O}(B_n)$ and hence $\hat R_n = \tilde{O}( d B_n n )$.

To finish, let us quote a result for online sparse linear regression:

Theorem (Online sparse linear regression): There exists a strategy $\pi$ for the learner such that for any $\theta\in \R^d$, the regret $\rho_n(\theta)$ of $\pi$ against any strategic environment such that $\max_{t\in [n]}\norm{A_t}_2\le L$ and $\max_{t\in [n]}|X_t|\le X$ satisfies

\begin{align*}

\rho_n(\theta) \le c X^2 \norm{\theta}_0

\left\{\log(e+n^{1/2}L) + C_n \log(1+\tfrac{\norm{\theta}_1}{\norm{\theta}_0})\right\}

+ (1+X^2)C_n\,,

\end{align*}

where $c>0$ is some universal constant and $C_n = 2+ \log_2 \log(e+n^{1/2}L)$.

The strategy is a variant of the exponential weights method except that the method is now adjusted so that the set of experts is now $\R^d$. An appropriate sparsity prior is used and when predicting an appropriate truncation strategy is used. The details of the procedure are less important at this stage for our purposes and are thus left out. A reference to the work containing the missing details will be given at the end of the post.

Note that $A_n = O(\log \log(n))$. Hence, dropping the dependence on $X$ and $L$, for $p>0$, $\sup_{\theta: \norm{\theta}_0\le p, \norm{\theta}_2\le L} \rho_n(\theta) = O(p \log(n))$. Note how strong this is: The guarantee hold no matter what strategy the environment uses!

Now, in sparse linear bandits with subgaussian noise, the noise $(\eta_t)_t$ is not necessarily bounded, and as a consequence the rewards $(X_t)_t$ are also not necessarily bounded. However, the subgaussian property implies with probability $1-\delta$, $|\eta_t| \le \log(2/\delta)$. Now, choosing $\delta = 1/n^2$, we thus see that for problems with bounded mean reward, $\max_{t\in [n]} |X_t| \le X \doteq 1+\log(2n^2)$ with probability at least $1-1/n$. Putting things together then yields the announced result:

The expected regret of OLR-UCB when using the strategy $\pi$ from above satisfies

\begin{align*}

R_n = \tilde{O}( \sqrt{ d p n } )\,.

\end{align*}

Two important notes are in order: Firstly, while here we focused on the sparse case the results and techniques apply to other settings. For example, we can also get alternative confidence sets from results in online learning even for the standard non-sparse case. Or one may consider additional or different structural assumptions on $\theta$ (e.g., $\theta$, when reshaped into a matrix, could have a low spectral norm). Secondly, when the online linear regression results are applied it is important to use the tightest possible, data-dependent regret bounds $B_n$. In online learning most regret bounds start as tight, data-dependent bounds, which are then loosened to get further insight into the structure of problems. For our application, naturally one should use the tightest available regret bounds (or one should attempt to modify the existing proofs to get tighter data-dependent bounds). The gains from using data-dependent bounds can be very significant.

Finally, we need to emphasize that the sparsity parameter $p$ must be known in advance and that no algorithm can simultaneously enjoy a regret of $\Omega(\sqrt{d p n})$ for all $p$ simultaneously. This will be seen shortly in a post focusing exclusively on lower bounds for stochastic linear bandits.

# Summary

In this post we considered sparse linear bandits in the stochastic setting. When the action was the hypercube we have shown a simple variant of the explore-then-commit strategy which selectively stops exploration in each dimension, independently of the others. We have shown that this strategy is able to adapt to the unknown sparsity of the parameter vector defining the linear bandit environment.

In the second part of the post we considered the more general setting when the action sets can change in time and they may lack the nice separable structure of the hypercube which made the first result possible. In this case we first argued that the dependence on the full dimensionality of the parameter vector is unavoidable. Then we constructed a method, OLR-UCB, which is a variant of UCB and which uses as a subroutine an online linear regression procedure. We showed that in the case of sparse linear bandits this gives a regret bound of $\tilde{O}(\sqrt{d p n })$. In our next post we will consider lower bounds for linear bandits and we will actually show that this bound is unimprovable in general.

# References

The SETC algorithm is from a paper of the authors:

- Tor Lattimore, Koby Crammer, Csaba Szepesvári:

Linear Multi-Resource Allocation with Semi-Bandit Feedback. NIPS 2015: 964-972

Look at Appendix E if you want to see how things are done in this paper.

The construction for the sparse case is from another paper co-authored by one of the authors:

- Yasin Abbasi-Yadkori, Dávid Pál, Csaba Szepesvári:

Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits. AISTATS 2012: 1-9.

This paper has the few details that we have skipped in this blog.

Independently and simultaneously to this paper, Alexandra Carpentier and Remi Munos have also published a paper on sparse linear stochastic bandits:

- Alexandra Carpentier, Rémi Munos:

Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit. AISTATS 2012: 190-198

Their setting is a considerably different than the one studied here. In particular, they rely on the action set being nicely rounded and containing zero, while, perhaps, most importantly, the noise in their model effects the parameter vector: $X_t = \ip{A_t, \theta_*+\eta_t}$. Just like in the case of the hypercube this makes it possible to avoid the poor dependence on the dimension $d$: Their regret bounds take the form $R_n = O( p\sqrt{\log(d)n})$. We hope to return to discuss the differences and similarities between these two noise models in some later post.

The theorem on online sparse linear regression that we cited is due to Sebastien Gerschinovitz. The reference is

- Sébastien Gerchinovitz: Sparsity regret bounds for individual sequences in online linear regression. Journal of Machine Learning Research 14(1): 729-769 (2013)

The theorem cited here is Theorem 10 in the above paper.

A very recent paper by Alexander Rakhlin, Karthik Sridharan also discusses relationship between online learning regret bounds and self-normalized tail bounds of the type given here:

- Alexander Rakhlin, Karthik Sridharan: On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Interestingly, what they show is that the relationship goes in both directions: Tail inequalities imply regret bounds and regret bounds imply tail inequalities.

I am told by Francesco Orabona that techniques similar to used here for constructing confidence bounds have been used earlier in a series of papers by Claudio Gentile and friends. For completeness, here is the list for further exploration:

- Ofer Dekel, Claudio Gentile, Karthik Sridharan:

Robust Selective Sampling from Single and Multiple Teachers. COLT 2010: 346-358 - Ofer Dekel, Claudio Gentile, Karthik Sridharan:

Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research 13: 2655-2697 (2012) - Koby Crammer, Claudio Gentile:

Multiclass Classification with Bandit Feedback using Adaptive Regularization. ICML 2011: 273-280 - Koby Crammer, Claudio Gentile:

Multiclass classification with bandit feedback using adaptive regularization. Machine Learning 90(3): 347-383 (2013) - Claudio Gentile, Francesco Orabona:

On Multilabel Classification and Ranking with Partial Feedback. NIPS 2012: 1160-1168 - Claudio Gentile, Francesco Orabona:

On multilabel classification and ranking with bandit feedback. Journal of Machine Learning Research 15(1): 2451-2487 (2014)

# Adversarial linear bandits

In the next few posts we will consider adversarial linear bandits, which, up to a crude first approximation, can be thought of as the adversarial version of stochastic linear bandits. The discussion of the exact nature of the relationship between adversarial and stochastic linear bandits is postponed until a later post. In the present post we consider only the the finite action case with $K$ actions.

# The model

The learner is given a finite action set $\cA\subset \R^d$ and the number of rounds $n$. An instance of the adversarial problem is a sequence of loss vectors $y_1,\dots,y_n$ where for each $t\in [n]$, $y_t\in \R^d$ (as usual in the adversarial setting, it is convenient to switch to losses). Our standing assumption will be that the scalar loss for any of the action is in $[-1,1]$:

Assumption (bounded loss): The losses are such that for any $t\in [n]$ and action $a\in \cA$, $| \ip{a,y_t} | \le 1$.

As usual, in each round $t$ the learner selects an action $A_t \in \cA$ and receives and observes a loss $Y_t = \ip{A_t,y_t}$. The learner does not observe the loss vector $y_t$ (if the loss vector is observed, then we call it the *full information setting*, but this is a topic for another blog). The regret of the learner after $n$ rounds is

\begin{align*}

R_n = \EE{\sum_{t=1}^n Y_t} – \min_{a\in \cA} \sum_{t=1}^n \ip{a,y_t}\,.

\end{align*}

Clearly the finite-armed adversarial bandit framework discussed in a previous post is a special case of adversarial linear bandits corresponding to the choice $\cA = \{e_1,\dots,e_d\}$ where $e_1,\dots,e_d$ are the unit vectors of the $d$-dimensional standard Euclidean basis. The strategy will also be an adaptation of the exponential weights strategy that was so effective in the standard finite-action adversarial bandit problem.

# Strategy

As noted, we will adapt the exponential weights algorithm. Like in that setting we need a way to estimate the individual losses for each action, but now we make use of the linear structure to share information between the arms and decrease the variance of our estimators. Let $t\in [n]$ be the index of the current round. Assuming that the loss estimates for action $a\in \cA$ are $(\hat Y_s(a))_{s < t}$, the distribution that the exponential weights algorithm proposes is

\begin{align*}

\tilde P_t(a) = \frac{ \exp\left( -\eta \hat \sum_{s=1}^{t-1} \hat Y_s(a) \right)}

{\sum_{a’\in \cA} \exp\left( -\eta \hat \sum_{s=1}^{t-1} \hat Y_s(a’) \right)}\,,

\qquad a\in \cA

\end{align*}

To control the variance of the loss estimates, it will be useful to mix this distribution with an **exploration distribution**, which we will denote by $\pi$. In particular, $\pi: \cA \to [0,1]$ with $\sum_a\pi(a)=1$. The mixture distribution is

\begin{align*}

P_t(a) = (1-\gamma) \tilde P_t(a) + \gamma \pi(a)\,,

\end{align*}

where $\gamma$ is a constant mixing factor to be chosen later. Then, in round $t$, the algorithm draws an action from $P_t$:

\begin{align*}

A_t \sim P_t(\cdot)\,.

\end{align*}

Recall that $Y_t = \ip{A_t,y_t}$ is the observed loss after taking action $A_t$. The next question is how to estimate $y_t(a) \doteq \ip{a,y_t}$? The idea will be to estimate $y_t$ with some vector $\hat Y_t\in \R^d$ and then letting $\hat Y_t(a) = \ip{a,\hat Y_t}$.

It remains to construct $\hat Y_t$. For this we will use the least-squares estimator $\hat Y_t = R_t A_t Y_t$, where $R_t$ is selected so that $\hat Y_t$ is an unbiased estimate of $y_t$ given the history. In particular, letting $\E_t[X] = \EE{ X \,|\, A_1,\dots,A_{t-1} }$, we calculate

\begin{align*}

\E_t [\hat Y_t ] = R_t \E_t [ A_t A_t^\top ] y = R_t \underbrace{\left(\sum_a P_t(a) a a^\top\right)}_{Q_t} y\,.

\end{align*}

Hence, using $R_t = Q_t^{-1}$, we get $\E_t [ \hat Y_t ] = y$. There is one minor technical detail, which is that $Q_t$ should be non-singular. This means that both the action set $\cA$ and the support of $P_t$ must span $\R^d$. To ensure this is true we will simply assume that $\cA$ spans $\R^d$ and eventually we will choose $P_t$ in such a way that its span is indeed $\R^d$. The assumption that $\cA$ spans $\R^d$ is non-restrictive, since if not we can simply adjust the coordinate system and reduce the dimension.

To summarize, in each round $t$ the algorithm estimates the unknown loss vector $\hat Y_t$ using the least-squares estimator, which is then used to directly estimate the loss for each action.

\begin{align*}

\hat Y_t = Q_t^{-1} A_t Y_t\,,\qquad \hat Y_t(a) = \ip{a,\hat Y_t}\,.

\end{align*}

Given the loss estimates all that remains is to apply the exponential weights algorithm with an appropriate exploration distribution and we will be done. Of course, the devil is in the details, as we shall see.

# Regret analysis

By modifying our previous regret proof and assuming that for each $t\in [n]$,

\begin{align}

\label{eq:exp2constraint}

\eta \hat Y_t(a) \ge -1, \qquad \forall a\in \cA\,,

\end{align}

then the regret is bounded by

\begin{align}

R_n \le \frac{\log K}{\eta} + 2\gamma n + \eta \sum_t \EE{ \sum_a P_t(a) \hat Y_t^2(a) }\,.

\label{eq:advlinbanditbasicregretbound}

\end{align}

Note that we cannot use the proof that leads to the tighter constant ($\eta$ getting replaced by $\eta/2$ in the second term above) because there is no guarantee that the loss estimates will be upper bounded by one. To get a regret bound it remains to set $\gamma, \eta$ so that \eqref{eq:exp2constraint} is satisfied, and we also need to bound $\EE{ \sum_a P_t(a) \hat Y_t^2(a) }$. We start with the latter. Let $M_t = \sum_a P_t(a) \hat Y_t^2(a)$. Since

\begin{align*}

\hat Y_t^2(a) = (a^\top Q_t^{-1} A_t Y_t)^2 = Y_t^2 A_t^\top Q_t^{-1} a a^\top Q_t^{-1} A_t\,,

\end{align*}

we have $M_t = \sum_a P_t(a) \hat Y_t^2(a) = Y_t^2 A_t^\top Q_t^{-1} A_t\le A_t^\top Q_t^{-1} A_t$ and so

\begin{align*}

\E_t[ M_t ] \le \trace \left(\sum_a P_t(a) a a^\top Q_t^{-1} \right) = d\,.

\end{align*}

It remains to choose $\gamma$ and $\eta$. To begin, we strengthen \eqref{eq:exp2constraint} to $|\eta \hat Y_t(a) | \le 1$ and note that since $|Y_t| \leq 1$,

\begin{align*}

|\eta \hat Y_t(a) | = |\eta a^\top Q_t^{-1} A_t Y_t| \le \eta |a^\top Q_t^{-1} A_t|\,.

\end{align*}

Let $Q(\pi) = \sum_{\nu \in \cA} \pi(\nu) \nu \nu^\top$, then $Q_t \succ \gamma Q(\pi)$ and so by Cauchy-Schwartz we have

\begin{align*}

|a^\top Q_t^{-1} A_t| \leq \norm{a}_{Q_t^{-1}} \norm{A_t}_{Q_t^{-1}}

\leq \max_{\nu \in \cA} \nu^\top Q_t^{-1} \nu \leq \frac{1}{\gamma}\max_{\nu \in \cA} \nu^\top Q^{-1}(\pi) \nu\,,

\end{align*}

which implies that

\begin{align*}

|\eta \hat Y_t(a)| \leq \frac{\eta}{\gamma} \max_{\nu \in \cA} \nu^\top Q^{-1}(\pi)\nu\,.

\end{align*}

Since this quantity only depends on the action set and the choice of $\pi$, we can make it small by solving a *design problem*.

\begin{align}

\begin{split}

\text{Find a sampling distribution } \pi \text{ to minimize } \\

\max_{v \in \cA} v^\top Q^{-1}(\pi) v\,.

\end{split}

\label{eq:adlinbanditoptpbl}

\end{align}

If $D$ is the optimal value of the above minimization problem, the constraint \eqref{eq:exp2constraint} will hold whenever $\eta D \le \gamma$. This suggests choosing $\gamma = \eta D$ since \eqref{eq:advlinbanditbasicregretbound} is minimized if we choose a smaller $\gamma$. Plugging into \eqref{eq:advlinbanditbasicregretbound}, we get

\begin{align*}

R_n \le \frac{\log K}{\eta} +\eta n(2D+d) = 2 \sqrt{ (2D+d) \log(K) n }\,,

\end{align*}

where for the last equality we chose $\eta$ to minimize the upper bound. Hence, it remains to calculate, or bound $D$. In particular, in the next section we will show that $D\le d$ (in fact, equality also holds), which leads to the following result:

Theorem (Worst-case regret of adversarial linear bandits):

Let $R_n$ be the expected regret of the exponential weights algorithm as described above. Then, for an appropriate choice of $\pi$ and an appropriate shifting of $\cA$, if the bounded-loss assumption holds for the shifted action set,

\begin{align*}

R_n \le 2 \sqrt{3dn \log(K)}\,.

\end{align*}

## Optimal design, volume minimizing ellipsoids and John’s theorem

It remains to show that $D\le d$. For this we need to choose a norm and a distribution $\pi$ according to \eqref{eq:adlinbanditoptpbl}. Consider now the problem of choosing a distribution $\pi$ on $\cA$ to minimize

\begin{align}

g(\pi) \doteq \max_{v\in \cA} v^\top Q^{-1}(\pi) v\,.

\label{eq:gcrit}

\end{align}

This is known as the **$G$-optimal design problem** in statistics, studied by Kiefer and Wolfowitz in 1960. In statistics, more precisely, in optimal experiment design, $\pi$ would be called a “design”. A $G$-optimal design $\pi$ is the one that minimizes the maximum variance of least-squares predictions over the “design space” $\cA$ when the independent variables are chosen with frequencies proportional to the probabilities given by $\pi$ and the response follows a linear model with independent zero mean noise and constant variance. After all the work we have done for stochastic linear bandits, the reader hopefully does recognize that $v^\top Q^{-1}(\pi) v$ is indeed the variance of the prediction under a least-squares predictor.

Theorem (Kiefer-Wolfowitz, 1960): The following are equivalent:

- $\pi$ is a minimizer of $f(\pi)\doteq -\log \det Q(\pi)$;
- $\pi$ is a minimizer of $g$;
- $g(\pi) = d$.

So there we have it. $D = d$ (with equality!) and our proof of the theorem is complete. A disadvantage of this approach is that it is not very apparent from the above theorem what the optimal sampling distribution $\pi$ actually looks like. We now make an effort to clarify this, and to briefly discuss the computational issues.

Consider the first equivalent claim of the Kiefer-Wolfowitz result which concerns the problem of minimizing $-\log \det Q(\pi)$, which is also is known as the **$D$-optimal design problem** ($D$ after “determinant”). The criterion in $D$-optimal design can be given an information theoretic interpretation, but we will find it more useful to consider it from a geometric angle. For this, we will need to consider ellipsoids.

An **ellipsoid** $E$ with center zero (“central ellipsoid”) is simply the image of the unit ball $B_2^d$ under some nonsingular linear map $L$: $E = L B_2^d$ where for a set $S\subset \R^d$, $LS = \{Lx \,:x\in S\}$. Let $H= (LL^\top)^{-1}$. Then, for any vector $y\in E$, $L^{-1}y\in B_2^d$, or $\norm{L^{-1}y}_2^2 = \norm{y}_{H}^2 \le 1$. We shall denote the ellipsoid $LB_2^d$ by $E(0,H)$ (the $0$ signifies that the center of the ellipsoid is at zero). Now $\vol(E(0,H)) = |\det L| \vol(B_2^d) = \mathrm{const}(d)/\sqrt{\det H} = \exp(\mathrm{const}(d) – \frac12 \log \det H)$. Hence, minimizing $-\log \det Q(\pi)$ is equivalent to maximizing the volume of the ellipsoid $E(0,Q^{-1}(\pi))$. By convex duality, one can then show that this is exactly the same problem as minimizing the volume of the central ellipsoid $E(0,H)$ that contains $\cA$. Fritz John’s celebrated result, which concerns **minimum-volume enclosing ellipsoids** (MVEEs) with no restriction on their center, gives the missing piece:

John’s theorem (1948): Let $\cK\subset \R^d$ be convex, closed and assume that $\mathrm{span}(\cK) = \R^d$. Then there exists a unique MVEE of $\cK$. Furthermore, this MVEE is the unit ball $B_2^d$ if and only if there exists $m\in \N$ contact points (“the core set”) $u_1,\dots,u_m$ that belong to both $\cK$ and the surface of $B_2^d$ and there also exist positive reals $c_1,\dots, c_m$ such that

\begin{align}

\sum_i c_i u_i=0 \quad \text{ and } \quad \sum_i c_i u_i u_i^\top = I.

\label{eq:johncond}

\end{align}

Note that John’s theorem allows the center of the MVEE to be arbitrary. Again, by shifting the set $\cK$, we can always arrange for the center to be at zero. So how can we use this result? Let $\cK = \co(\cA)$ be the convex hull of $\cA$ and let $E = L B_2^d$ be its MVEE with some nonsingular $L$. Without loss of generality, we can shift $\cA$ to allow that the center of this MVEE to be at zero (why? actually, here we lose a factor of two: think of the bounded reward condition).

To pick $\pi$, note that the MVEE of $L^{-1} \cK$ is $L^{-1} L B_2^d = B_2^d$. Hence, by the above result, there exists $u_1,\dots,u_m\in L^{-1}\cK \cap \partial B_2^d$ and positive reals $c_1,\cdots,c_m$ such that \eqref{eq:johncond} holds. Note that $u_1,\dots,u_m$ must also be elements of $L^{-1} cA$ (why?). Therefore, $Lu_1,\dots,L u_m\in \cA \cap \partial E$. Thanks to the rotation property of the trace and that $\norm{u_i}_2 = 1$, taking the trace of the second equality in \eqref{eq:johncond} we see that $\sum_i c_i = d$. Hence, we can choose $\pi(L u_i) = c_i/d$, $i\in [m]$ and set $\pi(a)=0$ for all $a\in \cA \setminus \{ L u_1,\dots, L u_m \}$. Therefore

\begin{align*}

Q(\pi) = \frac{1}{d} L \left(\sum_i c_i u_i u_i^\top \right) L^\top = \frac{1}{d} L L^\top\,

\end{align*}

and so

\begin{align}

\sup_{v \in \cA} v^\top Q^{-1}(\pi) v

\leq \sup_{v\in E=LB_2^d} v^\top Q^{-1}(\pi) v

= d \sup_{u: \norm{u}_2\le 1} (L u)^\top (L L^\top)^{-1} (L u) = d\,.

\label{eq:ellipsoiddesign}

\end{align}

# Notes and departing thoughts

Note 1: In linear algebra, a **frame** of a vector space $V$ with an inner product can be seen as a generalization of the idea of a basis to sets which may be linearly dependent. In particular, given $0A<B<+\infty$, the vectors $v_1,\dots,v_m \in V$ is said to form an $(A,B)$-frame in $V$ if for any $x\in V$, $A\norm{x}^2 \le \sum_k |\ip{x,v_k}|^2 \le B \norm{x}^2$ where $\norm{x}^2 = \ip{x,x}$. Here, $A$ and $B$ are called the lower and upper frame bounds. When $\{v_1,\dots,v_m\}$ forms a frame, it must span $V$ (why?). However, $\span(\{v_1,\dots,v_m\}) = V$ is not a sufficient condition for $\{v_1,\dots,v_m\}$ to be a frame. The frame is called **tight** if $A=B$ and in this case the frame obeys a generalized

Parseval’s identity. If $A = B = 1$ then the frame is called Parseval or normalized. A frame can be used almost as a basis. John’s theorem can be seen as asserting that for any convex body, after an appropriate affine transformation of the body, there exist a tight frame inside the transformed convex body. Indeed, if $u_1,\dots,u_m$ are the vectors whose existence is guaranteed by John’s theorem and $\{c_i\}$ are the corresponding coefficients, $v_i = \sqrt{c_i/d}\, u_i$ will form a tight frame inside $\cK$. Given a frame $\{v_1,\dots,v_m\}$ and a given point $x\in V$ if we define $\norm{x}^2 \doteq \sum_k |\ip{x,v_k}|^2$, this is indeed a norm. For the frame coming from John’s theorem, for any $x\in \cK$, $\norm{x}^2\le 1$. Frames are widely used in harminoc (e.g., wavelet) analysis.

Note 2: The prediction with expert advice problem can also be framed as a linear prediction problem with changing action sets. In this generalization in every round the learner is given $M$ vectors, $a_{1t},\dots,a_{Mt}\in \R^d$, where $a_{it}$ is the recommendation of expert $i$. The environment’s choice for the round is $y_t\in \R^d$. The loss suffered by expert $i$ is $\ip{a_{it},y_t}$. The goal is to compete with the best expert in advice by selecting in every round based on past information one of the experts. If $A_t\in \{a_{1t},\dots,a_{Mt}\}$ is the vector recommended by the expert selected, the regret is

\begin{align*}

R_n = \EE{ \sum_{t=1}^n \ip{A_t,y_t} } – \min_{i\in [M]} \sum_{t=1}^n \ip{ a_{it}, y_t }\,.

\end{align*}

The algorithm discussed can be easily adapted to this case. The only change needed is that in every round one needs to choose a new exploration distribution $\pi_t$ that is adapted to the current “action set” $\cA_t = \{ a_{1t},\dots,a_{Mt} \}$. The regret bound proved still holds for this case. The setting of Exp4 discussed earlier corresponds to when $\cA_t$ is the subset of the $d$-dimensional simplex. The price of the increased generality of the current result is that while Exp4 enjoyed (in the notation of the current post) a regret of $\sqrt{2n (M \wedge d) \log(M) }$, here we would get on $O(\sqrt{n d \log(M)})$ with slightly higher constants.

Note 3 (Computation): The computation of the MVEE is a convex problem when the center of the ellipsoid is fixed and numerous algorithms have been developed for it, enjoying polynomial runtime guarantees exist. In general, the computation of the MVEE is hard. The approximate computation of optimal design is also widely researched; one can use, e.g., the so-called Franke-Wolfe algorithm for this purpose, which is known as Wynn’s method in optimal experimental design. John has also shown (implicitly) that the cardinality of the core set is at most $d(d+3)/2$.

# References

Using John’s theorem for guiding exploration in linear bandits was proposed by

- Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade:

Towards Minimax Policies for Online Linear Optimization with Bandit Feedback. COLT 2012: 41.1-41.14

where the exponential weights algorithm adopted to this setting is called Expanded Exp, or Exp2. Our proof departs from the proofs given here in some minor ways (we tried to make the argument a bit more pedagogical).

The review work

- Sébastien Bubeck and Nicolò Cesa-Bianchi (2012), “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems“, Foundations and Trends in Machine Learning: Vol. 5: No. 1, pp 1-122

also discussed Exp2.

According to our best knowledge, the connection to optimal experimental design through the Kiefer-Wolfowitz theorem and the proof that solely relied on this result has not been pointed out in the literature beforehand, though the connection between the Kiefer-Wolfowitz theorem and MVEEs is well known. It is discussed, for example in the recent nice book of Michael J. Todd

- Michael J. Todd: Minimum-Volume Ellipsoids: Theory and Algorithms, MOS-SIAM Series on Optimization, 2016.

This book also discusses algorithmic issues, the mentioned duality.

The theorem of Kiefer and Wolfowitz is due to them:

- J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960

John’s theorem is due to John Fritz:

- F. John. Extremum problems with inequalities as subsidiary conditions. In Studies and Essays, presented to R. Courant on his 60th birthday, January 8, 1948, pages 187–204. Interscience, New York, 1948.

The duality mentioned in the text was proved by Siber:

- R. Sibson. Discussion on the papers by Wynn and Laycock. Journal of the Royal Statistical Society, 34:181–183, 1972.

See also:

- F. Gürtuna. Duality of ellipsoidal approximations via semi-infinite programming. SIAM Journal on Optimization, 20:1421–1438, 2009.

Finally, one need not find the *exact* optimal solution to the design problem. An approximation up to reasonable accuracy is sufficient for small regret. The issues of finding a good exploration distribution efficiently (even in infinite action sets) are addressed in

# Adversarial linear bandits and the curious case of the unit ball

According to the main result of the previous post, given any finite action set $\cA$ with $K$ actions $a_1,\dots,a_K\in \R^d$, no matter how an adversary selects the loss vectors $y_1,\dots,y_n\in \R^d$, as long as the action losses $\ip{a_k,y_t}$ are in a known interval, the exponential weights algorithm with linear loss estimation and an exploration strategy that mixes the distribution of the exponential weights algorithm with a fixed exploration strategy chosen to maximize information given $\cA$, enjoys a regret of at most $O(\sqrt{d n \log(K)})$. When the action set is continuous, like the unit ball, one can of course discretize it by putting a fine enough net over it. Optimizing the fineness of the discretization level, for convex action sets with bounded volume, this discretization approach delivers a regret of $O(d \sqrt{ n \log(n) })$: a discretization accuracy of $\epsilon=1/\sqrt{n}$ is sufficient to guarantee that the additional regret due to the discretized action set is bounded by $\sqrt{n}$ at most, and the cardinality of the resulting action-set is $O((1/\epsilon)^d) = O(n^{d/2})$.

Can we do better than this? As we shall see, the answer is yes, at least for the unit ball: In fact we will give an algorithm which avoids any kind of discretization (discretization is great for proving theoretical results, but the exponential-sized action-sets are impractical to work with), and in particular, the algorithm will be computationally efficient, while it will be shown to enjoy an improved regret bound of $O(\sqrt{dn})$. The algorithm given is based on the so-called mirror descent algorithm from optimization and online learning, which is a very powerful algorithm. In fact, the exponential weights algorithm is a special case of mirror descent.

# Online linear optimization and mirror descent

**Mirror descent** (MD) is an algorithm that originated from the convex optimization literature and has later been transported to the online learning literature. Here, we describe it in the online learning framework, more specifically, as an algorithm to solve online linear optimization problems.

In online linear optimization one is given a closed convex set $\cK\subset \R^d$ and the goal is to simultaneously compete with all elements of $\cK$ for all linear environments. an environment instance is just a sequence of vectors $g_1,\dots,g_n\in \R^d$. In round $t$, the learning algorithm chooses a point $x_t\in \cK$ from the constraint set (we assume a deterministic choice here), suffers the loss $\ip{g_t,x_t}$ and then observes $y_t\in \R^d$. The algorithm’s performance is measured using its regret. We will find it convenient to discuss the regret against some fixed competitor $x\in \cK$:

\begin{align*}

R_n(x) = \sum_{t=1}^n \ip{g_t,x_t} – \sum_{t=1}^n \ip{g_t,x}\,.

\end{align*}

Then, $R_n = \max_{x\in \cK} R_n(x)$ is the regret over $\cK$.

The basic version of MD has two extra parameters beyond $n$ and $\cK$: A learning rate $\eta>0$ and a convex, “distance generating function” $F: \cD \to \R$, where $\cD\subset \R^d$ is the domain of $F$. The function $F$ is also often called a “potential function”, or a “regularizer” for reasons that will become clear later.

In the first round MD predicts

\begin{align}

x_1 = \arg\min_{x\in \cK} F(x)\,.

\end{align}

In round $t\ge 1$, after $g_t$ is observed, the prediction $x_{t+1}$ for the next round is computed using

\begin{align}

x_{t+1} = \argmin_{x\in \cK \cap \cD} \eta \ip{g_t,x} + D_F(x,x_t)\,.

\label{eq:mdoneline}

\end{align}

Here, $D_F(x,x_t)$ is the so-called $F$-induced **Bregman divergence** of $x$ from $x_t$ defined by

\begin{align}

D_F(x,x_t) = F(x) – F(x_t) – \ip{ \nabla F(x_t), x-x_t }

\end{align}

with $\nabla F(x_t)$ denoting the derivative of $F$ at $x_t$ (i.e., $\nabla F: \mathrm{dom}(\nabla F) \to \R^d$). The minimization in \eqref{eq:mdoneline} is over $\cK\cap \cD$ because the domain of $D_F(\cdot,x_t)$ if the same as that of $F$.

To get a sense of the divergence function $D_F$, note that $D_F(\cdot,y)$ is the difference between $F(\cdot)$ and its first-order Taylor expansion, $F(y)+\ip{\nabla F(y),\cdot-y}$, about the point $y\in \mathrm{dom}(\nabla F)$. Since $F$ is convex, the linear approximation of $F$ is a lower bound on $F$ and so $D_F(\cdot,y)$ is nonnegative over its domain with $D_F(y,y)=0$ (see the **figure** below for an illustration). Furthermore, $D_F(\cdot,y)$ is convex as it is the sum of a convex and a linear function. Hence, \eqref{eq:mdoneline} is a convex optimization problem.

Since $D_F(x,y)$ is defined only when $y\in \mathrm{dom}(\nabla F)$, for the definition \eqref{eq:mdoneline} to make sense we need to have that $F$ is differentiable at all of $x_1,\dots,x_{t-1}$. For our general regret bound we will thus assume that the following holds:

\begin{align}

x_t\in \mathrm{dom}(\nabla F)\,, \qquad \forall t\in [n+1]\,.

\label{eq:xtingraddomain}

\end{align}

Note that $x_n,x_{n+1} \in \mathrm{dom}(\nabla F)$ is not needed for MD to be well defined on the horizon $n$; we included these as they will be convenient for our analysis (here, $x_{n+1}$ is defined by \eqref{eq:mdoneline} as expected). A sufficient condition for \eqref{eq:xtingraddomain} to hold is that $F$ is a turns out to be a Legendre function (the operations in the definition $F$ are defined in a componentwise manner).

## Mirror maps

The update rule of MD aims to achieve two goals: making the loss of $x_{t+1}$ under $g_t$ small, while at the same time staying close to the previous estimate, thus buying stability for the algorithm. When $F(x) = \frac12 \norm{x}_2^2$ with $\cD = \R^d$ and $\R^d$, we get $\nabla F(x) = x$, $D_F(x,x_t) = \frac12 \norm{x-x_t}_2^2$ and the update rule just corresponds to gradient descent:

\begin{align}

x_{t+1} = x_t – \eta g_t\,.

\label{eq:mdgd}

\end{align}

More generally, when $\cK$ is compact nonempty convex set, the update rule corresponds to projected gradient descent: After computing the gradient update as in \eqref{eq:mdgd}, the resulting vector $\tilde x_{t+1}$, if it is lying outside of $\cK$, needs to be projected back to $\cK$ by calculating the point of $\cK$ which is closest to $\tilde x_{t+1}$ in the Euclidean metric.

This two-step computation works also with other appropriate distance generating functions $F$. In particular, when, for example, $F$ is Legendre, one can show that \eqref{eq:mdoneline} is equivalent to the following two-step method:

\begin{align}

\tilde{x}_{t+1} &= (\nabla F)^{-1}( \nabla F(x_t) – \eta g_t )\,, \label{eq:mdunnorm}\\

x_{t+1} &= \argmin_{x\in \cK\cap \cD} D_F(x,\tilde x_{t+1})\,.\label{eq:mdproj}

\end{align}

Here, $\tilde{x}_{t+1} = \argmin_{x\in \cD} \eta \ip{g_t,x} + D_F(x,x_t)$ is the “unconstrained” optimizer of the objective function minimized by MD. For this to be an unconstrained optimizer, what is needed is that $\nabla F$ blows up at the “boundary” of $\cD$ and that $\cD^\circ$, the interior of $\cD$ is nonempty and is the same as the domain of $\nabla F$, the defining properties of Legendre functions.

When $F$ is Legendre, one can show that $\tilde x_{t+1} \in \cD^\circ$. Thus, in this case differentiating the objective function, $\tilde x_{t+1}$ must satisfy

\begin{align}

0=\eta g_t + \nabla F(\tilde x_{t+1}) – \nabla F(x_t)\,.

\label{eq:uncopt}

\end{align}

from which one obtains \eqref{eq:mdunnorm}. The second step \eqref{eq:mdproj} takes the unconstrained optimizer and projects it to $\cK\cap \cD$. The update rules \eqref{eq:mdunnorm}-\eqref{eq:mdproj} explain the name of the algorithm: The algorithm operates in two spaces: The **primal space** where the predictions “live”, and the **dual space**, where the slopes, or gradients “live”. The name “dual” comes from the viewpoint that gradients, or derivatives are linear functionals over the primal space, which thus “live” in the space of linear functionals, which is also called the “dual”. With this language, the “mirror functions” $\nabla F$ and $(\nabla F)^{-1}$ map between subsets of the primal and dual spaces.

Back to the algorithm, from \eqref{eq:mdunnorm} we see that MD first takes the last prediction $x_t$ and maps it to the dual space through $\nabla F$, where it is combined with the observed loss vector $g_t$ (which has the nature of a gradient). This can be thought as the gradient update step. Next, the dual vector obtained is mapped back to the primal space via $(\nabla F)^{-1}$ to get $\tilde x_{t+1}$. Since this can lie outside of the constraint set, it is mapped back to it in the final step \eqref{eq:mdproj}. For the mathematically or computer-language minded, it should be gratifying that the loss vector $g_t$, a “dual type quantity” is combined with another “dual type quantity” rather than with a “primal quantity”, i.e., the “types” of these quantities are respected. In particular, the gradient update \eqref{eq:mdgd} gains new meaning through this: Here $x_t$ is really $\nabla F(x_t)$ for $F(x) = \frac12 \norm{x}_2^2$, which of course, happens to be equal to $x_t$. However, now we know that this is a mere coincidence!

However pleasing the thought of type-compatibility may be, it remains to be seen that the generality of MD will help. As we shall soon see the regret of MD depends on two quantities: the “diameter” of $\cK$ when measured via $F$, and the magnitudes of $g_t$ when measured in a metric compatible with $F$. One can then show that for many sets $\cK$, by choosing $F$ in an appropriate manner, MD becomes a **dimension-free** algorithm in the sense that its regret will be independent of the dimension $d$. This is a huge advantage when the dimension $d$ can be large, as is often the case in modern applications.

A specific example where choosing a non-Euclidean $F$ is beneficial is when $\cK$ is an $\ell^p$-ball. In this case, an appropriate $F$ gives dimension-free regret bounds, while using $F(x) = \norm12 \norm{x}_2^2$ does not share this quality (the reader is invited to check these after we state our general result). Another special case, mentioned earlier, is when $\cK$ is the simplex of $\R^d$, $F$ is the unnormalized negentropy, $\norm{g_t}_\infty\le 1$ (so that $|\ip{x,g_t}|\le 1$ for any $t$ and $x\in \cK$). While in this case the regret will not be dimension-free, the dimension dependence as compared to standard gradient descent is reduced from a linear dependence to a logarithmic dependence, an exponential improvement. On the top of this, the resulting algorithm, exponential weights, is trivial to implement, while a Euclidean projection to the simplex, is more complex to implement.

## The regret of MD

We start with a simple result from convex analysis. As is well known, the minimizer $x^*$ of a differentiable function $\psi:\R^d \to \R$ must satisfy $\nabla \psi(x^*)=0$, which is known as the **first-order optimality condition** (first-order, because the first derivative of $\psi$ is taken). When $psi$ is convex, this is both a sufficient and necessary condition (this is what we like about convex functions). The first-order optimality condition can also be generalized to constrained minima. In particular, if $\cK\subset \R^d$ is a non-empty convex set and $\psi$ is as before then

\begin{align}

x^*\in \argmin_{x\in \cK} \psi(x) \Leftrightarrow \forall x\in \cK: \ip{ \nabla \psi(x^*),x-x^* } \ge 0\,.

\label{eq:firstorderoptfull}

\end{align}

The necessity of this condition is easy to understand by a geometric reasoning as shown on the picture on the right: Since $x^*$ is a minimizer of $\psi$ over $\cK$, $-\nabla \psi(x^*)$ must be the outer normal of the supporting hyperplane $H_{x^*}$ of $\cK$ at $x^*$ otherwise $x^*$ could be moved by a small amount while staying inside $\cK$ and improving the value of $\psi$. Since $\cK$ is convex, it thus lies entirely on the side of $H_{x^*}$ that $\nabla \psi(x^*)$ points into. This is clearly equivalent to \eqref{eq:firstorderoptfull}. The sufficiency of the condition also follows from this geometric viewpoint as the reader may verify.

The above statement continues to hold with a small modification even when $\psi$ is not everywhere differentiable. In particular, in this case the equivalence \eqref{eq:firstorderoptfull} holds for any $x^*\in \mathrm{dom}(\nabla \psi)$ with the modification that on both sides of the equivalence, $\cK$ should be replaced by $\cK \cap \dom(\psi)$:

Proposition (first-order optimality condition): Let $\psi:\dom(\psi)\to\R$ be a convex function, $\cK\not=\emptyset$, $\cK\subset \R^d$ convex. Then for any $x^*\in \dom(\nabla \psi)$, it holds that

\begin{align}

x^*\in \argmin_{x\in \cK\cap \dom(\psi)} \psi(x)\\

\Leftrightarrow \forall x\in \cK\cap \dom(\psi): \ip{ \nabla \psi(x^*),x-x^* } \ge 0\,.

\label{eq:firstorderopt}

\end{align}

With this we are ready to start bounding the regret of MD. As the potential function $F$ will be kept fixed, to minimize clutter we abbreviate $D_F$ to $D$ in the expressions below.

In round $t\in [n]$, the instantaneous regret of MD against $x\in \cK$ is $\ip{g_t,x_t-x}$. As $x_{t+1} = \argmin_{x\in \cK \cap \cD} \psi_t(x)$ where $\psi_t:\cD \to \R$ is given by $\psi_t(x) = \eta \ip{g_t,x} + D(x,x_t)$, the above first-order optimality condition immediately gives the following result:

Proposition (Instantaneous regret of MD): Let $x\in \cK \cap \cD$. Assume that \eqref{eq:xtingraddomain} holds. Then, for any $t\in [n]$,

\begin{align}

\begin{split}

\ip{g_t,x_t-x} & \le \frac{D(x,x_t)-D(x,x_{t+1})}{\eta} \\

& \qquad + \ip{g_t,x_t-x_{t+1}} – \frac{D(x_{t+1},x_t)}{\eta}\,.

\end{split}

\label{eq:mdir1}

\end{align}

**Proof**: Fix $t\in [n]$. By \eqref{eq:xtingraddomain}, $x_{t+1} \in \dom(\nabla \psi_t)$. Hence, we can apply the first-order optimality proposition to $x_{t+1}$ to get that

\begin{align*}

\ip{ \nabla \psi_t(x_{t+1}),x-x_{t+1} } \ge 0\,.

\end{align*}

Plugging in the definition of $\psi_t$ and lengthy but simple algebraic manipulations give \eqref{eq:mdir1}.

QED.

When summing up the instantaneous regret, the first term on the right-hand side of \eqref{eq:mdir1} telescopes and gives $\frac{D(x,x_1)-D(x,x_{n+1})}{\eta}$. Since $D$ is nonnegative, this can be upper bounded by $D(x,x_1)/\eta$. Further, thanks to the choice of $x_1$, using again the first-order optimality condition, we get that this is upper bounded by $(F(x)-F(x_1))/\eta$:

\begin{align}

\sum_{t=1}^n \frac{D(x,x_t)-D(x,x_{t+1})}{\eta}

\le \frac{F(x)-F(x_1)}{\eta}\,.

\end{align}

To bound the remaining terms in \eqref{eq:mdir1}, we may try to upper bound the inner product $\ip{g_t,x_t-x_{t+1}}$ here by bringing in a norm $\norm{\cdot}$ and its dual $\norm{\cdot}_*$:

\begin{align}

\ip{g_t,x_t-x_{t+1}} \le \norm{g_t}_* \norm{x_t-x_{t+1}}\,,

\label{eq:mdholder}

\end{align}

where we may recall that for a norm $\norm{\cdot}$ its dual $\norm{\cdot}_*$ is defined via $\norm{g}_* = \sup_{x: \norm{x}\le 1} \ip{g,x}$. To combine this with $-D(x_{t+1},x_t)$ is it beneficial if this can be lower bounded by some expression that involves the same norm. To get inspiration consider the case when $F(x) = \frac12 \norm{x}_2^2$. As noted earlier, in this case $D=D_F$ is given by $D_F(x,y) = \frac12 \norm{x-y}_2^2$. This gives the idea that perhaps we should assume that

\begin{align}

D_F(x,y) \ge \frac12 \norm{x-y}^2\,, \quad \forall (x,y)\in \dom(F)\times \dom(\nabla F)\,,

\label{eq:mddfsoc}

\end{align}

which is equivalent to requiring that $F$ is strongly convex w.r.t. $\norm{\cdot}$. To merge \eqref{eq:mdholder} and $-\frac{1}{2\eta} \norm{x_{t}-x_{t+1}}^2$, we may further upper bound the right-hand side of \eqref{eq:mdholder} via using $2ab \le a^2+b^2$. Choosing $a = \eta^{1/2} \norm{g_t}_*$ and $b = \eta^{-1/2} \norm{x_t-x_{t+1}}$, we get that

\begin{align}

\ip{g_t,x_t-x_{t+1}} – \frac{D(x_t,x_{t+1})}{\eta} \le \frac{\eta}{2} \norm{g_t}_*^2\,.

\end{align}

For increased generality (which will prove to be useful later) we may let the $\norm{\cdot}$ be chosen as a function of the round index $t$, leading to the following bound:

Theorem (Regret of MD): Let $\eta>0$, $F$ be a convex function with domain $\cD$. Assume that \eqref{eq:xtingraddomain} holds. Further, for each $t\in [n]$, let $\norm{\cdot}_{(t)}$ be a norm such that

\begin{align}

D_F(x_t,x_{t+1}) \ge \frac12 \norm{ x_t – x_{t+1} }^2_{(t)}

\label{eq:bregsoc}

\end{align}

and let $\norm{\cdot}_{(t)^*}$ be the dual norm of $\norm{\cdot}_{(t)}$. Then, the regret $R_n(x)$ of mirror descent against any $x\in \cK \cap \cD$ satisfies the bound

\begin{align}

R_n(x) \le

\frac{F(x)-F(x_1)}{\eta} + \frac{\eta}{2} \sum_{t=1}^n \norm{g_t}_{(t)^*}^2\,.

\end{align}

While the theorem is presented for the case of a fixed sequence $\{g_t\}_t$, it is not hard to see that the same bound holds even if $g_t$ is chosen as a function $y_1,x_1,\dots,y_{t-1},x_t$, i.e., when the environment is “strategic”. This is entirely analogous to how the basic regret bound for exponential weights continues to hold in the face of strategically chosen losses.

As an example of how to use this theorem consider first the case of $\cK = B_2^d$ (the unit ball in $\R^d$ with respect to the $2$-norm) with $F(x) = \frac12 \norm{x}_2^2$. Note that $\cD = \dom(\nabla F) = \R^d$ and thus \eqref{eq:xtingraddomain} is automatically satisfied. We have $\diam_F(\cK) = \sup_{x\in \cK} F(x) – F(x_1)\le 1$. Choosing $\norm{\cdot}_{(t)} = \norm{\cdot}_2$, we have $\norm{\cdot}_{(t)^*} = \norm{\cdot}_2$. Assume that $\{g_t\}$ is leads to bounded losses and in particular, $|\ip{x,g_t}|\le 1$ for all $t\in [n]$ and $x\in \cK$. Note that this holds if and only if $\norm{g_t}_2\le 1$. Thus we get that the regret of MD (which is just the projected gradient descent algorithm in this case) satisfies

\begin{align}

R_n(x) \le \frac1\eta + \frac{\eta n}{2} = \sqrt{2n}\,,

\end{align}

where for the last step we set $\eta = \sqrt{\frac{2}{n}}$.

As a second example consider the case when $\cK$ is the unit simplex of $\R^d$: $\cK = \{ x\in [0,1]^d\,:\, \sum_i x_i = 1\}$. Note that $\cK$ lies in a $d-1$-dimensional hyperplane of $\R^d$. In this case, choosing $F(x) = x\log (x) – x$, the unnormalized negentropy function, $\cD = [0,\infty)^d$ and we find that $x_1 = (1/d,\dots,1/d)$, since $F$ is negative valued, $\diam_F(\cK) = F(x) – F(x_1) \le -F(x_1) =\log(d)$. Further, $D_F(x,y) = \KL(x,y)=\sum_i x_i \log(x_i/y_i)$ is the relative entropy of $x$ given $y$, which is known to satisfy

\begin{align}

D_F(x,y) \ge \frac12 \norm{x-y}_1^2\,

\label{eq:pinsker}

\end{align}

(this inequality is known as Pinsker’s inequality). Hence, we choose $\norm{\cdot}_{(t)} = \norm{\cdot}_1$ and thus $\norm{\cdot}_{(t)^*} = \norm{\cdot}_\infty$. In this case the assumption that the losses are properly normalized imply that $\norm{g_t}_\infty \le 1$. Putting things together we get that the regret of MD satisfies

\begin{align}

R_n(x) \le \frac{\log d}{\eta} + \frac{\eta n}{2} = \sqrt{2 \log(d) n }\,,

\end{align}

which matches the regret bound we derived earlier for the exponential weights algorithm. Curiously, MD with this choice is in fact the exponential weights algorithm. What is more, there is a one-to-one map of the steps of the two respective proofs, as the reader may verify.

What would have happened if we used $F(x) = \frac12 \norm{x}_2^2$ with $\norm{\cdot}_{(t)} = \norm{\cdot}_2$ instead of the unnormalized negentropy function? While in this case $\diam_F(\cK) \le 1$, $\norm{g_t}_2^2$ could be as large as $d$ (e.g., $g_t = (1,\dots,1)$), which gives $R_n(x) \le \sqrt{2 d n }$. In fact, this is real: MD with this choice can suffer a much larger regret than MD with the unnormalized negentropy potential. Thus we see that at least in this case the increased generality of MD allows a nontrivial improvement of the regret.

The reader is also invited to check the regret bounds that follow with various potential functions when $\cK$ is the unit $\ell^p$ ball, i.e., $\cK = \{ x\,: \norm{x}_p \le 1 \}$ where $\norm{x}_p$ is defined through $\norm{x}_p^p = \sum_i |x_i|^p$, while the loss per round in still in $[-1,1]$. For example, what is the regret for $F(x) = \frac12 \norm{x}_2^d$, or $F(x) = \frac12 \norm{x}_p^2$? For the calculation it may be useful to know that $F$ is $(p-1)$ strongly convex with respect to $\norm{\cdot}_p$, as the reader may also verify.

From these examples we see that two things matter when using MD with some potential function $F$ and a norm $\norm{\cdot}$: the magnitude of $\diam_F(\cK)$ and that of $G = \sup \{ \norm{g}_* \,: \sup_{x\in \cK} |\ip{g,x}| \le 1 \}$. It follows that if $\cK$ is a subset of the unit ball underlying $\norm{\cdot}$, $\norm{g}_* = \sup_{x:\norm{x}\le 1} |\ip{g,x}| \le \sup_{x\in \cK} |\ip{g,x}| \le 1$. Thus, for a subset of unit balls $G\le 1$ regardless the choice of $F$. Thus, in this case the choice of $F$ boils down to minimizing the diameter of $\cK$ under $F$ subject to the constraint \eqref{eq:bregsoc}.

Our next goal will be to design an algorithm based on MD when the action set $\cA$ is the $d$-dimensional unit ball $B_2^d$. It turns out that for this it will be useful to further upper bound the regret of MD. In particular, this way we will get a bound that will be easier to bound than bounding the regret directly. For this, we will assume that $F:\cD\to \R$ is Legendre with a “large enough” domain.

Theorem (Regret for MD with Legendre potential): Let $F:\cD \to \R$ be Legendre such that $\cK \subset \bar \cD$. For any $x\in \cK$, the regret $R_n(x)$ of MD against $x$ satisfies

\begin{align}

R_n(x) \le

\frac{F(x)-F(x_1)}{\eta} + \frac{1}{\eta}\, \sum_{t=1}^n D_F(x_t,\tilde x_{t+1}) \,.

\end{align}

**Proof**: Note that \eqref{eq:xtingraddomain} is satisfied because $\bar \cD \subset \cK$ and $F$ is Legendre. We start from \eqref{eq:mdir1}. Earlier, we argued that the sum of first terms on the right-hand side of this is bounded by $\frac{F(x)-F(x_1)}{\eta}$. Hence, it remains to bound the remaining terms. We claim that

\begin{align*}

\ip{g_t,x_t-x_{t+1}} – \frac{D_F(x_{t+1},x_t)}{\eta}

= D_F(x_t,\tilde x_{t+1}) – D_F(x_{t+1},\tilde x_{t+1})\,,

\end{align*}

from which the results follow since $D_F$ is nonnegative. Now, the above identity can be shown using some algebra and the optimality property of $x_{t+1}$, namely \eqref{eq:uncopt}.

QED.

The reason this result is useful is because $D_F(x_t,\tilde x_{t+1})$ can also be written as the Bregman divergence of $\nabla F(\tilde x_{t+1})$ from $\nabla F(x_t)$ with respect to what is known as the Legendre-Fenchel dual $F^*$ of $F$ and oftentimes this divergence is easier to bound than the expression we had previously. The Legendre-Fenchel dual of $F$ is defined via $F^*(g) \doteq \sup_{x\in \cD} \ip{g,x} – F(x)$. Now, the identity mentioned states that for any $F$ Legendre and any $x,x’ \in \dom(\nabla F)$,

\begin{align}

D_F(x,x’) = D_{F^*}(\nabla F(x’), \nabla F(x))\,.

\label{eq:dualbregman}

\end{align}

In particular, this also means that

\begin{align}

R_n(x) \le

\frac{F(x)-F(x_1)}{\eta} + \frac{1}{\eta}\, \sum_{t=1}^n D_{F^*}(\nabla F(\tilde x_{t+1}), \nabla F( x_{t}) ) \,.

\end{align}

For calculations it is often easier to obtain $F^*$ from the identity $\nabla F^*(g) = (\nabla F)^{-1}(g)$, which holds for all $g\in \dom(\nabla F^*)$. This means that one can find $F^*$ by calculating the primitive function of $(\nabla F)^{-1}$.

# Linear bandits on the unit ball

In accordance with the notation of the previous post, the sequence of losses will be denoted by $y_t$ (and not $g_t$ as above) and we shall assume that

\begin{align}

\sup_{t,a\in \cA} |\ip{y_t,a}| \le 1\,.

\label{eq:mdboundedloss}

\end{align}

Per our discussion this is equivalent to $\max_t \norm{y_t}_2\le 1$.

To use MD we will use randomized estimates of the losses $\{y_t\}$. In particular, in round $t$ the bandit algorithm will choose a random action $A_t\in \cA$ based on which an estimate $\tilde Y_t$ of $y_t$ will be constructed using the observed loss $Y_t = \ip{y_t,A_t}$.

Let $F$ be a convex function on $\cD\subset \R^d$ to be chosen later, let $X_1=x_1= \argmin_{x\in \cA\cap \cD} F(x)$, and for $t\in [n-1]$ let $X_{t+1}$ be the output of MD when used with learning rate $\eta$ on the input $\tilde Y_t$:

\begin{align*}

X_{t+1} = \argmin_{x\in \cA\cap \cD} \eta \ip{\tilde Y_t, x} + D_F(x,x_t)\,.

\end{align*}

(The upper case letters signify that $X_t$ is random, owing to that $A_1,\dots,A_{t-1}$, hence $\tilde Y_1,\dots,\tilde Y_{t-1}$ are random.) We will make sure that $F$ is such that the minimum here can be taken over $\cA$. For uniformity, we also use an uppercase letter to denote the first choice $X_1$ of MD. As long as

\begin{align}

\EE{ A_t|X_t } = X_t\,, \qquad \text{ and } \qquad \EE{ \tilde Y_t| X_t } = y_t

\label{eq:md_atyt}

\end{align}

hold for all $t\in [n]$, the regret $R_n(x)= \EE{ \sum_{t=1}^n \ip{y_t,A_t} – \ip{y_t,x} }$ of this algorithm satisfies

\begin{align*}

R_n(x)

& = \EE{ \sum_{t=1}^n \ip{y_t,X_t} – \ip{y_t,x} }

= \EE{ \sum_{t=1}^n \ip{\tilde Y_t,X_t} – \ip{\tilde Y_t,x} }\,.

\end{align*}

Notice that the last expression inside the expectation is the (random) regret of mirror descent on the (recursively constructed) sequence $\tilde Y_t$. Hence, when the conditions of our theorem are satisfied, for any $x\in \cA \cap \cD$, for $F$ Legendre,

\begin{align}

\label{eq:mdlinbandit}

R_n(x) \le \frac{F(x)-F(x_1)}{\eta } +

\frac{1}{\eta} \sum_{t=1}^n \EE{ D_{F^*}( \nabla F(\tilde X_{t+1}), \nabla F(X_t) ) }\,.

\end{align}

Thus, it remains to choose $A_t,\tilde Y_t$ and $F$ with domain $\cD$ such that $\cK \subset \bar \cD$.

Unsurprisingly, $A_t$ and $\tilde Y_t$ will be chosen similarly to what was done in the previous post. In particular, $\tilde Y_t$ will be a linear estimator with a small twist that will simplify some calculations, while $A_t$ will be selected at random to be either $X_t$ or an exploratory action $U_t$ drawn randomly from some exploration distribution $\pi$ supported on $\cA$. The calculation of the previous post suggests that a $G$-optimal exploration distribution should be used. For the unit ball, it turns out that we can simply choose $\pi$ as the uniform distribution on $\{ \pm e_i \}$ where $\{e_i\}_i$ is the standard Euclidean basis. That this is $G$-optimal follows immediately from the Kiefer-Wolfowitz theorem.

To formally define $A_t$, let $P_t\in (0,1]$ be the probability of exploring in round $t$ ($P_t$ will be chosen appropriately later), let $E_t \sim \mathrm{Ber}(P_t)$, $U_t\sim \pi$ such that $E_t$ and $U_t$ are independent given the information available at the time when they are chosen. Then, to satisfy $\EE{A_t|X_t} = X_t$, we can set

\begin{align}

A_t = E_t U_t + \frac{1-E_t}{1-P_t} X_t\,.

\end{align}

Since $A_t\in \cA = B_2^d$ must hold this gives the constraint that

\begin{align}

P_t \le 1-\norm{X_t}_2 \,.

\label{eq:ptconstraint}

\end{align}

This also means that one must somehow ensure that $\norm{X_t}_2<1$ holds for all $t\in [n]$. One simple way to enforce this is to replace the domain used in MD by $(1-\gamma) \cA$ for some small $\gamma>0$. Since this introduces at most a factor of $\gamma n$ in the regret, as long as $\gamma$ is small enough, the additional regret due to this modification will be negligible.

Having chosen $A_t$, it remains to construct $\tilde Y_t$. Consider

\begin{align*}

\tilde Y_t = Q_t^{-1} E_t A_t Y_t\,, \quad Q_t = \E_t{ E_t A_t A_t^\top }\,,

\end{align*}

where $\E_t{Z} = \EE{Z| X_1,A_1,Y_1,\dots,X_{t-1},A_{t-1},Y_{t-1},X_t }$. Note that this is a slightly modified version of the linear estimator and thus $\E_t{\tilde Y_t} = y_t$. The modification is that in this estimator $A_t$ is multiplied by $E_t$, which means that the estimate is $0$ unless an exploratory action is chosen. While this is not necessary, it simplifies the calculations and in fact slightly improves the bound that we derive. In particular, we have $E_t A_t = U_t$ and hence $Q_t = \frac{P_t}{d} I$ and thus

\begin{align*}

\tilde Y_t = \frac{d\, E_t}{P_t} U_t U_t^\top y_t\,.

\end{align*}

It remains to choose $F$ and the norms. One possibility would be $F(x) = \frac12 \norm{x}_2^2$ which is an appealing choice both because it leads to simple calculations and also because in the full information setting (i.e., when $y_t$ is observed in every round) this choice enjoyed a dimension-free regret bound. It is easy to check that $F=F^*$, hence $D_{F^*}(u,v) = \frac12 \norm{u-v}_2^2$. Hence,

\begin{align}

\frac1\eta \E_t D_{F^*}( \nabla F( \tilde X_{t+1} ), \nabla F( X_t ) )

&=

\frac{\eta}{2} \E_t[ \norm{\tilde Y_t}_2^2 ]

=

\frac{d^2}{P_t^2} \E_t[ E_t y_t^\top U_t U_t^\top U_t U_t^\top y_t ]\nonumber \\

&=

\frac{d^2}{P_t^2} \E_t[ E_t y_t^\top U_t U_t^\top y_t ]

=

\frac{d}{P_t} \norm{y_t}_2^2 \le \frac{d}{P_t}\,,

\label{eq:twonormbound}

\end{align}

where the first equality used identity \eqref{eq:uncopt}. To finish, one could choose $P_t = \gamma$ and set the domain of MD to $(1-\gamma) \cA$, but even with these modifications, the best bound that can be derived scales only with $n^{2/3}$, which is not what we want.

Hence, if the other parameters of the construction are kept fixed, we need to go beyond the Euclidean potential. The potential we consider is

\begin{align*}

F(x) = -\log(1-\norm{x}) – \norm{x}\,, \quad x\in \cD = \{ x’\,:\, \norm{x’}<1 \}\,.
\end{align*}
This can easily be seen Legendre. Further, $\nabla F(x) = \frac{x}{1-\norm{x}}$, and also $F^*(g)=-\log(1+\norm{g})+\norm{g}$.
Using that for $x\ge -1/2$, $\log(1+x)\ge x-x^2$, one can show that for
\begin{align}
\frac{\norm{g}-\norm{h}}{1+\norm{h}}\ge -1/2\,,
\label{eq:ghapart}
\end{align}
it holds that
\begin{align}
D_{F^*}(g,h) & = \frac{1}{1+\norm{h}}\norm{g-h}^2\,.
\end{align}
Plugging in $g=\nabla F(\tilde x_{t+1})$ and $h=\nabla F(x_t)$, we see that \eqref{eq:ghapart} is satisfied, hence, thanks to $1/(1+\norm{\nabla F(X_t)}) = 1-\norm{X_t}$, after some calculation we get
\begin{align*}
\E_t D_{F^*}( \nabla F(\tilde X_{t+1}),\nabla F(x_t))
\le d\,.
\end{align*}
Putting together things, again running MD on $(1-\gamma)B_2^d$ and choosing $\eta$ and $\gamma$ appropriately, we get that
\begin{align*}
R_n(x)\le 3 \sqrt{dn\log(n)}\,.
\end{align*}
Surprisingly, this is smaller than the regret that we got for stochastic bandits on the unit ball by a factor of at least $\sqrt{d}$. However, it will be the subject of the next post to discuss how this could happen.

# Notes and departing thoughts

**Computational complexity of MD**: Usually, \eqref{eq:mdunnorm} is trivial to implement, while \eqref{eq:mdproj} can be quite expensive. However, in some important special cases, the projection can be computed efficiently, even in closed form, e.g., for the simplex and the negentropy potential, or the Euclidean unit ball and the Euclidean potential $F(x) = \frac12\norm{x}_2^2$. In general, one can of course use a convex solver to implement this step, but then one can also use the convex solver to find an approximate optimizer of \eqref{eq:mdoneline} directly.

**Legendre functions**: A strictly convex map $F:\cD \to \R$ is Legendre if the interior $\cD^\circ$ of $\cD$ is nonempty, $\mathrm{dom}(\nabla F) = \cD^\circ$ ($F$ is differentiable over the interior of its domain) and $\nabla F(x)$ diverges as $x$ approaches the boundary of $\cD$. Formally, this last condition means that for any $\{x_n\}_n \subset \mathrm{dom}(\nabla F)$ such that either $\norm{x_n}\to\infty$ or $x_n \to \partial \cD = \bar \cD \setminus \cD^\circ$, $\norm{ \nabla F(x_n) } \to \infty$ as $n\to\infty$. Note that since in a finite-dimensional space all norms are equivalent, it does not matter which norm is used in the above definition. One can show that if the distance generating map $F$ is Legendre then the sequence $\{x_t\}_{t\in [n]}$ computed by MD always satisfies \eqref{eq:mdunnorm}-\eqref{eq:mdproj}, and \eqref{eq:xtingraddomain} also holds.

**Simplex, negentropy, exponential weights**: A canonical example that is instructive to illustrate the concept of Legendre functions is when $\cK = \Delta_d = \{ x\in [0,1]^d\,:\, \sum_i x_i = 1\}$ is the simplex embedded into $\R^d$ and $F(x) = \sum_i x_i \log(x_i)-x_i$ is the unnormalized negentropy function. Here, the domain $\cD$ of $F$ is $\cD = [0,\infty)^d$ where we use the convention that $0 \log(0)=0$ (which is sensible since $\lim_{s\to 0+} s \log(s) = 0$). Then it is easy to see that $\dom(\nabla F) = (0,\infty)^d$ with $\nabla F(x) = \log(x)$ ($\log(\cdot)$ is applied componentwise). Then, $\partial \cD = \{ x\in [0,\infty)^d \,:\, \exists i \, \text{ s.t. } x_i = 0\}$ and whenever $\{x_n\}_n \subset (0,\infty)$ is such that either $\norm{x_n}\to\infty$ or $x_n \to \partial \cD$ as $n\to\infty$ then $\norm{\nabla F(x_n)} \to \infty$. Thus, $F$ is Legendre. As noted earlier, MD in this case is well defined, the computation implemented by MD is captured by the two steps \eqref{eq:mdunnorm}-\eqref{eq:mdproj} and \eqref{eq:xtingraddomain} also holds. In fact, a small computation shows that $x_{t+1}$ in this case satisfies

\begin{align*}

x_{t+1,i} = \frac{ \exp( – \eta g_{ti} ) x_{ti} }{\sum_j \exp( – \eta g_{tj} ) x_{tj} }

= \frac{\exp(-\eta \sum_{s=1}^t g_{si} ) }{\sum_j \exp(-\eta \sum_{s=1}^t g_{sj} ) }\,,

\end{align*}

which we can recognize as the exponential weights algorithm used for example in the previous post, as well as numerous other occasions.

Let $f:\dom(f) \to \R$ be convex, $y\in \dom(\nabla f)$ and let $\norm{\cdot}$ be a norm. As it was pointed out earlier $f$ is then lower bound by its first-order Taylor approximation about the point $y$:

\begin{align*}

f(x) \ge f(y) + \ip{ \nabla f(y), x-y }, \qquad \forall x\in \dom(f)\,.

\end{align*}

If we can add $\frac12 \norm{x-y}^2$ to the lower bound and it still remain a lower bound, i.e., when

\begin{align*}

f(x) \ge f(y) + \ip{ \nabla f(y), x-y } + \frac12 \norm{x-y}^2, \qquad \forall x\in \dom(f)

\end{align*}

then we say that $f$ is **strongly convex** w.r.t. $\norm{\cdot}$ at $y$. When this holds for all $y\in \dom(\nabla f)$, we say that $f$ is strongly convex w.r.t. $\norm{\cdot}$. Often, strong convexity is defined with more flexibility when $\frac12$ above is replaced by $\frac{\alpha}{2}$ with some positive $\alpha>0$:

\begin{align*}

f(x) \ge f(y) + \ip{ \nabla f(y), x-y } + \frac{\alpha}2 \norm{x-y}^2, \qquad \forall x\in \dom(f)

\end{align*}

In this case, we say that $f$ is $\alpha$-strongly convex w.r.t. $\norm{\cdot}$ (at $y$), or that the strong convexity modulus of $f$ at $y$ is $\alpha$. Of course, the two definitions are the same when we scale the norm between the two of them. By reordering the above inequalities, we see that strong convexity can also be expressed as a lower bound on the Bregman-divergence induced by $f$. For example, the second inequality can be written as

\begin{align*}

D_f(x,y) \ge \frac{\alpha}2 \norm{x-y}^2, \quad \forall x\in \dom(f)\,.

\end{align*}

**Strong convexity, Hessians and matrix-weighted norms**: Let $F:\dom(F)\to \R$ be convex and twice differentiable over its domain. We claim that $F$ is strongly convex w.r.t. the (semi-)norm $\norm{\cdot}$ defined by $\norm{x}^2 \doteq x^\top G x$ where $G \preceq \nabla^2 F(y)$ for all $y\in \dom(F)$. To see this note that

\begin{align}

F(x) \ge F(y) + \ip{\nabla F(y),x-y} + \frac12 \norm{x-y}^2

\label{eq:sochessiansoc}

\end{align}

holds if and only if $g(\cdot) \doteq F(\cdot) – \frac12 \norm{\cdot-y}^2$ is convex. Indeed, $g$ is convex if and only if it is an upper bound of its linear approximations. Now, $g(x)\ge g(y) + \ip{\nabla g(y),x-y} = F(y) + \ip{\nabla F(x),x-y}$ and this is indeed equivalent to \eqref{eq:sochessiansoc}. Furthermore, $\nabla^2 g(x) = \nabla^2 F(x)-G$, hence $\nabla^2 g(x)\succeq 0$ by our assumption on $G$.

The price of bandit information on the unit ball is an extra $\sqrt{d}$ factor: On the unit ball the regret of MD under full information is $\Theta(\sqrt{n})$, while it is $\Theta(\sqrt{dn})$ (the lower bound can be shown using the standard proof techniques). Up to log factors, The same, that is, that the price of bandit information is $\sqrt{d}$, holds for the simplex, too. Is this true in general? The answer turns out to be no. The price of bandit information can be as high as $\Theta(d)$ and whether this happens depends on the shape of the action set $\cA$. However, when this happens is not very well understood at the moment.

# References

The results presented here are largely based on

- Sébastien Bubeck and Nicolò Cesa-Bianchi (2012), “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems“, Foundations and Trends in Machine Learning: Vol. 5: No. 1, pp 1-122.

The original reference is

- Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade:

Towards Minimax Policies for Online Linear Optimization with Bandit Feedback. COLT 2012: 41.1-41.14

# Bandit tutorial slides and update on book

This website has been quiet for some time, but we have not given up on bandits just yet.

First up, we recently gave a short tutorial at AAAI that covered the basics of finite-armed stochastic bandits and stochastic linear bandits. The slides are now available (part1, part2).

We mentioned at some point that these notes would become a book. We’re happy to say this project is coming close to completion. We hope in the next month or so to publish a reasonable quality draft.

Of course the book contains all the content in the blog in a polished and extended form. There are also many new chapters. Some highlights are: combinatorial bandits, non-stationary bandits, ranking, Bayesian methods (including Thompson sampling) and pure exploration. We also have two chapters that peek beyond the world of bandits at partial monitoring and learning in Markov decision processes.

Once the draft is complete we would love to have your feedback.