Optimality concepts and information theory

In this post we introduce the concept of minimax regret and reformulate our previous result on the upper bound on the worst-case regret of UCB in terms of the minimax regret. We briefly discuss the strengths and weaknesses of using minimax regret as a performance yardstick. Then we discuss with more precision what we meant when we mentioned that UCB’s worst-case regret cannot be greatly improved, and sketch the proof (that will be included in all its glory in the next post). We finish by introducing a few of the core concepts of information theory, and especially the relative entropy that will be crucial for the formal lower bound proofs.

How to measure regret?

Previously we have seen that in the worst case the UCB algorithm suffers a regret of at most $O(\sqrt{Kn \log(n)})$. It will prove useful to restate this result here. The following notation will be helpful: Let $\cE_K$ stand for the class of $K$-action stochastic environments where the rewards for action $i\in [K]$ are generated from a distribution $P_i$ with mean $\mu_i\in [0,1]$ where $P_i$ is such that if $X\sim P_i$ then $X-\mu_i$ is $1$-subgaussian (environments with this property will be called $1$-subgaussian). Our previous result then reads as follows:

Theorem (Worst-case regret-bound for UCB): There exists a constant $C>0$ such that the following hold: For all $K>0$, $E\in \cE_K$, $n>1$ if $R_n$ is the (expected) regret of UCB when interacting with $E$,
R_n \le C \sqrt{K n \log(n)}\,.

(In the future when we formulate statements like this, we will say that $C$ is a universal constant. The constant is universal in the sense that its value does not depend on the particulars of how $K$, $n$, or $E$ as long as these satisfy the constraints mentioned.)

We call the bound on the right hand-side of the above display a worst-case bound, because it holds even for the environment that is the worst for UCB. Worst-case bounds can point to certain weaknesses of an algorithm. But what did we mean when we said that this bound cannot be improved in any significant manner? In simple terms, this means that no matter the policy, there will be an environment on which the policy achieves almost the same regret as the upper bound on the above side.

This can be concisely written in terms of the minimax regret. The minimax regret is defined for a given horizon $n>0$ and a fixed number of actions $K>0$ as follows. Let $\cA_{n,K}$ stand for the set of all possible policies on horizon $n$ and action set $[K]$. Then, the minimax regret for horizon $n$ and environment class $\cE$ (such as $\cE_K$ above) is defined as
R_n^*(\cE) = \inf_{A\in \cA_{n,K}} \sup_{E\in \cE} R_n(A,E)\,.
Note how this value is independent of any specific choice of a policy, but only depends on $n$, $\cE$ and $K$ (the dependence on $K$ is hidden in $\cE$). Further, no matter how a policy $A$ is chosen, $R_n^*(\cE)\le R_n^*(A,\cE) \doteq \sup_{E\in \cE} R_n(A,E)$, i.e., the $n$-round minimax regret of $\cE$ is a lower bound on the worst-case regret of any policy $A$. If we lower bound the minimax regret, the lower bound tells us that no policy at all exist that can do better than the lower bound. In other words, such a bound is independent of the policy chosen, it shows the fundamental hardness of the problem.

A minimax optimal policy is a policy $A$ whose worst-case regret is equal to the minimax regret: $R_n^*(A,\cE) = R_n^*(\cE)$. Finding a minimax optimal policy is often exceptionally hard; hence we often resort to the weaker goal of finding a near-minimax policy, i.e., a policy whose worst-case regret is at most a constant multiple of $R_n^*(\cE)$. Technically, this makes sense only if this holds universally over $n$, $K$ and $\cE$ (within some limits): To define this formally, we would need to demand that the policy takes as input $n,K$, but we skip the formal definition, hoping that the idea is clear. UCB, as we specified earlier, can be thought as such a “universal” policy for the class of all $1$-subgaussian environments (and it does not even use the knowledge of $n$). When near-minimaxity is too strong as defined above, we may relax it by demanding that the ratio of $R_n^*(A,\cE)/R_n^*(\cE)$ is bounded by a logarithmic function of $\max(1,R_n^*(\cE))$; the idea being that the logarithm of any number is so substantially smaller than itself that it can be neglected. UCB is near-minimax optimal in this sense.

The value $R_n^*(\cE)$ is of interest on its own. In particular, a small value of $R_n^*(\cE)$ indicates that the underlying bandit problem is less challenging in the worst-case sense. A core activity in bandit theory (and learning theory, more generally) is to understand what makes $R_n^*(\cE)$ large or small, often focusing on its behavior as a function of the number of rounds $n$. This focus is justified when $n$ gets large, but often other quantities, such as the number of actions $K$, are also important.

Since $R_n^*(\cE_K) \le R_n(\mathrm{UCB},\cE_K)$, the above theorem immediately gives that
R_n^*(\cE_K) \le C \sqrt{ K n \log(n)}\,,
while our earlier discussion of the explore-then-commit strategies $\mathrm{ETC}_{n,m}$ with a fixed commitment time $m$ can be concisely stated as
\inf_{m} R_n^*(\mathrm{ETC}_{n,m}, \cE_K) \asymp n^{2/3}\,,
where $\asymp$ hides a constant proportionality factor which is independent of $n$ (the factor scales with $K^{1/3}$).
Thus, we see that UCB does better than ETC with the best a priori $n$-dependent choice of $m$ in a worst-case sense, as $n$ gets large. But perhaps there are are even better policies than UCB? The answer was already given above, but let us state it this time in the form of a theorem:

Theorem (Worst-case regret lower-bound): For any $K\ge 2$, $n\ge 2$,
R_n^*(\cE_K) \ge c \sqrt{Kn}
for some universal constant $c>0$.

In particular, we see that UCB is near-minimax optimal. But how to prove that the above result? The intuition is relatively simple and can be understood by just studying Gaussian tails.

Lower bounding ideas

Given $n$ i.i.d. observations from a Gaussian distribution with a known variance of say one and an unknown mean $\mu$, the sample mean of the observations is a sufficient statistic for the mean. This means that the observations can be replaced by the sample mean $\hat\mu$ without losing any information. The distribution of the sample is also Gaussian, with the same mean as the unknown mean and a variance of $1/n$. Now, assume that we told that the unknown mean can take on only two values: Either it is (say) zero, or $\Delta$, which, without loss of generality, we assume to be positive. The task is to decide whether we are in the first (when $\mu=0$), or the second case (when $\mu=\Delta$). As we said before, there is no loss of generality assuming that the decision is based on $\hat\mu$ only. If $n$ is large compared to $\Delta$, intuitively we feel that the decision is easy: If $\hat\mu$ is closer to zero than to $\Delta$, choose zero, otherwise choose $\Delta$. No matter then whether $\mu$ was indeed zero or $\Delta$, the probability of an incorrect choice will be low. How low? An easy upper bound based on our earlier upper bound on the tails of a Gaussian (see here) is
(2\pi n (\Delta/2)^2)^{-1/2} \exp(-n (\Delta/2)^2/2) \le \exp(-\frac{n(\Delta/2)^2}{2})\,,
where the inequality assumed that $2\pi n (\Delta/2)^2\ge 1$). We see that the error probability decays exponentially with $n$, but the upper would still be useless if $\Delta$ was small compared to $n$! Can we do better than this? One might believe the decision procedure could be improved, but the symmetry of the problem makes this seem improbable.

The other possibility is that the upper bound on the Gaussian tails that we use is loose. This turns out not to be the case either. With a calculation slightly more complex than the one given before (using integration by parts, e.g., see here), we can show that if $X$ has the standard normal distribution,
\Prob{X>\eps} \ge \left(\frac{1}{\eps}-\frac{1}{\eps^3}\right)\, \frac{\exp( – \frac{\eps^2}{2} )}{\sqrt{2\pi}}
also holds, showing that there is basically no room for improvement. In particular, this means that if $n(\Delta/2)^2/2=c$ ($\Delta \le \sqrt{8c/n}$, or, equivalently, $n\le 8c/\Delta^2$) then the probability that anyprocedure will make a mistake in either of the two cases when $\mu \in \{0,\Delta\}$ is at least $C c^{-1/2} (1-1/c)\exp(-c)>0$. This puts a lower limit on how reliably we can decide between the two given alternatives. The take home message is that if $n$ is small compared to the differences in the means that we need to test for, then we are facing an impossible mission.

The problem as described is the most classic hypothesis testing problem there is. The ideas underlying this argument are core to any arguments that show an impossibility result in a statistical problem. The next task is to reduce our bandit problem to a hypothesis testing problem as described here.

The high level idea is to select two bandit problem instances. For simplicity, we select two bandit instances $E,E’$ where the reward distributions are Gaussians with a unit variance. The vector of means are in $[0,1]^K$, thus the instances are in $\cE_K$. Let the means of the two instances be $\mu$ and $\mu’$, respectively. Our goal is to choose $E,E’$ (alternatively, $\mu$ and $\mu’$) in such a way that the following two conditions hold simultaneously:

  1. Competition: Algorithms doing well on one instance, cannot do well on the other instance.
  2. Similarity: The instances are “close” so that no matter what policy interacts with them, given the observations, the two instances are indistinguishable as in the hypothesis testing example above.

In particular, for the second requirement we want that any way of interacting with the instances results in data that no decision procedure that needs to decide about which instance it was running on can achieve a low error probability on both instances. The two requirements are clearly conflicting. The first requirement makes us want to choose $\mu$ and $\mu’$ far from each other, while the second requirement makes us want to choose them to be close to each other. A lower bound on the regret follows then a simple reasoning: We lower bound the regret in terms of how well it does on a hypothesis testing problem. The conflict between the two objectives is resolved by choosing the means so that the lower bound is maximized.

This strategy as described works in simple cases, but for our case a slight amendment is necessary. The twist we add is this: We choose $\mu$ in a specific way so that one arm has a slightly higher mean (by a constant $\Delta>0$ to be chosen later) than all the other arms, call this $i_E$. Then we find $i_{E’}\ne i_E$ that was least favored in expectation by the algorithm $A$ and increase the mean payoff of this arm by $2\Delta$ to crease environment $E’$. In particular, all other means are the same in $E$ and $E’$, but $i_{E’}$ is the optimal action in $E’$ and $i_E$ is the optimal action in $E$. Further, the absolute gap between the immediate rewards of these actions is $\Delta$ in both environments.

Note that $i_{E’}$ cannot be used more than $\approx n/K$ times on expectation when $A$ is run on $E$. And the two environments differ only in terms of the mean payoff of exactly of action $i_{E’}$. To make the two algorithms essentially indistinguishable, set $\Delta = 1/\sqrt{n/K}$. This will also mean that when $A$ is run on $E’$, since $\Delta$ is so small, the interaction of $A$ and $E’$ will produce a near-identically distributed sequence of action-choices than the distribution when $A$ was used on $E$.

Now, when $A$ is run on $E$ and $i_E$ is chosen fewer than $n(1-1/K)$ times on expectation then it will have more than $\Delta (n-n(1-1/K)) \approx c\sqrt{Kn}$ regret (here, $c$ is a universal constant “whose value can change line by line”). How about algorithms that choose $i_E$ more than $n(1-1/K)$ times when interacting with $E$? By the indistinguishability of $E$ and $E’$ by $A$, this algorithm will choose $i_E$ almost the same number of times on $E’$, too, which means that it cannot choose $i_{E’}$ too often, and in particular, it will have a regret of at least $c\sqrt{Kn}$ on $E’$. Thus, we see that no algorithm will be able to do well uniformly on all instances.

The worst-case instance in this construction clearly depends on $A$, $n$ and $K$. Thus, the analysis has little relevance, for example, if an algorithm $A$ is run on a fixed instance $E$ and we consider the regret on this instance in the long run. We will return to this question later.

Information theory

In order to make the above argument rigorous (and easily generalizable to multiple other settings) we will rely on some classic tools from information theory and statistics. In particular, we will need the concept of relative entropy, also known as the Kullback-Leibler divergence, named of Solomon Kullback and Richard Leibler (KL divergence, for short).

Relative entropy puts a numerical value on how surprised one should be on average when observing a random quantity $X$ whose distribution is $P$ when one expected that $X$’s distribution will be $Q$. This can also be said to be the extra information we gain when $X$ is observed, again relative to the expectation that $X$’s distribution would have been $Q$. As such, relative entropy can be used to answer question like how many observations we need to notice that the distribution of our observations follow $P\ne Q$, hence its relevance to our problems! To explain these ideas, the best is to start with the case when $X$ is finite-valued.

Assume thus that $X$ takes on finitely many values. Let us first discuss how to define the amount of information that observing $X$ conveys. One way to start is to define information as the amount of communication needed if we want to “tell” a friend about the value we observed. If $X$ takes on $N$ distinct values, say, $X\in [N]$ (we can assume this for simplicity and without loss of generality), we may be thinking of using $\lceil\log_2(N)\rceil$ bits no matter the value of $X$. However, if we have the luxury of knowing the probabilities $p_i = \Prob{X=i}$, $i=1,\dots,N$ and we also have the luxury to agree with our friend on a specific coding to be used (before even observing any values) then we can do much better.

The idea is to use a code that respects the probability distribution of $X$. But how many bits should the code of $i$ use? Intuitively, the smaller $p_i$ is, the larger the length of the code of $i$ should be. The coding also has to be such that our friend can tell when the code of a symbol ended or it will be useless (this is impossible if and only if there exists $i\ne j$ such that the respective codes $c_i,c_j$ are such that $c_i$ is a prefix of $c_j$). It turns out that the optimal bit-length for observation $i$ under the previous conditions is $\log_2 1/p_i$ and is achieved by Huffman-coding. This is meant in a limiting sense, when we observe an $X_1,X_2,\dots$ which are independent copies of $X$. Thus, the optimal average code length for this setting is
H(p) = \sum_i p_i \log \frac{1}{p_i}\,.
The astute reader may have noticed that we switched from $\log_2$ to $\log$ here. Effectively, this means that we changed the units in which we measure the length of codes from bits to something else, which happens to be called nats. Of course, changing units does not change the meaning of the definition, just the scale, hence the change should be of no concern. The switch is justified because it simplifies some formulae later.

But is $H$ well-defined at all? (This is a question that should be asked after every definition!) What happens if $p_i=0$ for some $i$? This comes up, for example, when we take $N$ to be larger than the number of possible values that $X$ really takes on, i.e., if we introduce “impossible” values. We expect that introducing such superfluous impossible values should not change the value of $H$. Indeed, it is not hard to see that $\lim_{x\to 0+} x\log(1/x)=0$, suggesting that we should define $p \log(1/p)=0$ when $p=0$ and this is indeed what we will do, which also means that we can introduce as many superfluous symbols as we want without every changing the amount $H$.

Thus, we can agree that for $X\in [N]$, $H(p)$ with $p_i = \Prob{X=i}$ measures the (expected) information content of observing $X$ (actually, in a repeated setting, as described above). Another interpretation of $H$ takes a more global view. According to this view, $H$ measures the amount of uncertainty in $p$ (or, equivalently but redundantly, in $X$). But what do we mean by uncertainty? One approach to define uncertainty is to think of how much one should be surprised to see a particular value of $X$. If $X$ is completely deterministic ($p_i=1$ for some $i$) then there is absolutely no surprise in observing $X$, thus the uncertainty measure should be zero and this is the smallest uncertainty there is. If $X$ is uniformly distributed, then we should be equally surprised by seeing any values of $i$ and the uncertainty should be maximal. If we observe a pair, $(X,Y)$, which are independent of each other, the uncertainty of the pair should be the sum of the uncertainties of $X$ and $Y$. Long story short, it turns out that reasonable definitions of uncertainty actually give rise to $H$ as defined by $\eqref{eq:entropy}$. Since in physics, entropy is the expression used to express the uncertainty of a system, the quantity $H$ is also called the entropy of $p$ (or the entropy of $X$ giving rise to $p$).

Of course, optimal code lengths can also be connected to uncertainty directly: Think of recording the independent observations $X_1,X_2,\dots,X_n$ for $n$ large where $X_i \sim X$ in some notebook. The more uncertainty the distribution of $X$ has, the longer the transcript of $X_1,\dots,X_n$ will be.

We can summarize our discussion so far as follows:

Information = Optimal code length per symbol = Uncertainty

Now, let us return to the definition of relative entropy: As described earlier, relative entropy puts a value on how surprised one should be on average when observing a random quantity $X$ whose distribution is $P$ (equivalently, $p=(p_i)_i$, where $p_i$ is the probability of seeing $X=i$) when one expected that $X$’s distribution will be $Q$ (equivalently, $q=(q_i)$, where $q_i$ is the probability of seeing $X=i$). We also said that relative entropy is the information contained in observing $X$ when $X$ actually comes from $P$ (equivalently, $p$) relative to one’s a priori expecting that $X$ will follow distribution $Q$ (equivalently, $q$).

Let us discuss the second approach: The situation described in more details is this. Information is how much we need to communicate with our friend with whom we agreed on a coding protocol. That we expect $X\sim q$ means that before communication even starts, we agree on using a protocol respecting this knowledge. But then if $X\sim p$ then when we send our friend the messages about our observations, we will use more bits than we could have used had we chosen the protocol that is the best for $p$. More bits means more information? Well, yes and no. More bits means more information relative to one’s expectation that $X\sim q$! However, of course, more bits is not more information relative to knowing the true distribution of course. The code we use has some extra redundancy, in other words it has some excess, which we can say measures the extra information that $X$ contains, relative to our expectation on the distribution of $X$. Working out the math gives that the excess information is
\KL(p,q) &\doteq \sum_i p_i \log \frac{1}{q_i} \, – \,\sum_i p_i \log \frac{1}{p_i} \\
&=\sum_i p_i \log \frac{p_i}{q_i}\,.
If this quantity is larger, we also expect to be able to tell that $p\ne q$ with fewer independent observations sharing $p$ than if this quantity was smaller. For example, when $p_i>0$ and $q_i=0$ for some $i$, the first time we see $i$, we can tell that $p\ne q$. In fact, in this case, $\KL(p,q)=+\infty$: With positive probability a single observation can tell the difference between $p$ and $q$.

Still poking around the definition, what happens when $q_i=0$ and $p_i=0$? This means that the symbol $i$ is superfluous and the value of $\KL(p,q)$ should not be impacted by introducing superfluous symbols. By our earlier convention that $x \log(1/x)=0$ when $x=0$, this works out fine based on the definition after $\doteq$. We also see that the sufficient and necessary condition for $\KL(p,q)<+\infty$ is that for each $i$ such that $q_i=0$, we also have that $p_i=0$. The condition we discovered is also expressed as saying that $p$ is absolutely continuous w.r.t. $q$, which is also written as $p\ll q$. More generally, for two measures $P,Q$ on a common measurable space $(\Omega,\cF)$, we say that $P$ is absolutely continuous with respect to $Q$ (and write $P\ll Q$) if for any $A\in \cF$, $Q(A)=0$ implies that $P(A)=0$ (intuitively, $\ll$ is like $\le$ except that it only constraints the values when the right-hand side is also zero). This brings us back to defining relative entropy between two arbitrary probability distributions $P,Q$ defined over a common probability space. The difficulty we face is that if $X\sim P$ takes on uncountably infinitely many values then we cannot really use the ideas that use communication because no matter what coding we use, we would need infinitely many symbols to describe some values of $X$. How can then be the entropy of $X$ be defined at all? This seems to be a truly fundamental difficulty! Luckily, this impasse gets resolved automatically if we only consider relative entropy. While we cannot communicate $X$, for any finite “discretization” of the the possible values that $X$ can take on, the discretized values can be communicated finitely and all our definitions will work. Formally, if $X$ takes values in the measurable space $(\cX,\cG)$, with $\cX$ possibly having uncountably many elements, a discretization to $[N]$ levels would be specified using some function $f:\cX \to [N]$ that is $\cG/2^{[N]}$-measurable map. Then, the entropy of $P$ relative $Q$, $\KL(P,Q)$ can be defined as
\KL(P,Q) \doteq \sup_{f} \KL( P_f, Q_f )\,,
where $P_f$ is the distribution of $Y=f(X)$ when $X\sim P$ and $Q_f$ is the distribution of $Y=f(X)$ when $X\sim Q$ and the supremum is for all $N\in \N$ and all maps $f$ as defined above. In words, we take all possible discretizations $f$ (with no limit on the “fineness” of the discretization) and define $\KL(P,Q)$ as the excess information when expecting to see $f(X)$ with $X\sim Q$ while reality is $X\sim P$. If this is well-defined (i.e., finite), we expect this to be a reasonable definition. As it turns out and as we shall see it soon, this is indeed a reasonable definition.

To state this, we need the concept of densities, or, Radon-Nykodim derivatives, as they are called in measure theory. Given $P,Q$ as above, the density of $P$ with respect to $Q$ is defined as a $\cF/\cB$-measurable map $f:\Omega \to [0,\infty)$ such that for all $A\in \cF$,
\int_A f(\omega) dQ(\omega) = P(A)\,.
(Here, $\cB$ is the Borel $\sigma$-algebra restricted to $[0,\infty)$.) Recalling that $P(A) = \int_A dP(\omega)$, we would write the above as $f dQ = dP$ meaning that if we integrated this equality over any set $A$, we would get identity. Because of this shorthand notation, when the density of $P$ with respect to $Q$ exists, we also write it as
Note that this equation defines the symbol $\frac{dP}{dQ}$ in terms of the function $f$ that satisfies $\eqref{eq:densitydef}$. In particular, $\frac{dP}{dQ}$ is a nonnegative-valued random variable on $(\Omega,\cF)$. When $\frac{dP}{dQ}$ exists then it follows immediately that $P$ is absolutely continuous with respect to $Q$ (just look at $\eqref{eq:densitydef}$). It turns out that this is both necessary and sufficient for the existence of the density of $P$ with respect to $Q$. This is stated as a theorem in a slightly more general form in that $P,Q$ are not restricted to be probability measures (the condition $P(\Omega)=Q(\Omega)=1$ is thus lifted for them). The definition of density for such measures is still given by $\eqref{eq:densitydef}$. To avoid some pathologies we need to assume that $Q$ is $\sigma$-finite, which means that while $Q(\Omega)=\infty$ may hold, there exists a countable covering $\{A_i\}$ of $\Omega$ with $\cF$-measurable sets such that $Q(A_i)<+\infty$ for each $i$.

Theorem (Radon-Nikodym Theorem):
Let $P,Q$ be measures on a common measurable space $(\Omega,\cF)$ and assume that $Q$ is $\sigma$-finite. Then, the density of $P$ with respect to $Q$, $\frac{dP}{dQ}$, exists if and only if $P$ is absolutely continuous with respect to $Q$. Furthermore, $\frac{dP}{dQ}$ is uniquely defined up to a $Q$-null set, i.e., for any $f_1,f_2$ satisfying $\eqref{eq:densitydef}$, $f_1=f_2$ holds $Q$-almost surely.

Densities work as expected: Recall that the central limit theorem states that under mild conditions, $\sqrt{n} \hat\mu_n$ of $n$ iid random variables $X_1,\dots,X_n$ with common zero mean and variance one converges to a distribution of a standard normal random variable $Z$. Without much thinking, we usually write that the density of this random variable is $g(x) = \frac{1}{\sqrt{2\pi}} \exp(-x^2/2)$. As it happens, this is indeed the density, but now we know that we need to be more precise: This is the density of $Z$ with respect the Lebesgue measure (which we shall often denote by $\lambda$). The densities of “classical” distributions are almost always defined with respect to the Lebesgue measure. A very useful result is the chain rule, which states that if $P\ll Q \ll S$, then $\frac{dP}{dQ} \frac{dQ}{dS} = \frac{dP}{dS}$. The proof of this result follows from the definition of the densities and the “usual machinery”, which means proving the result holds for simple functions and then applying the monotone convergence theorem to take the limit for any measurable function. The chain rule is often used to reduce the calculation of densities to calculation with known densities. Another useful piece of knowledge that we will need is that if $\nu$ is a counting measure on $([N],2^{[N]})$ then if $P$ is a distribution on $([N],2^{[N]})$ then $\frac{dP}{d\nu}(i) = P(\{i\})$ for all $i\in [N]$.

With this much preparation, we are ready to state the promised result:

Theorem (Relative Entropy) :
Let $(\Omega, \cF)$ be a measurable space and let $P$ and $Q$ be measures on this space.
\KL(P, Q) =\begin{cases}
\int \log \left(\frac{dP}{dQ}(\omega)\right) dP(\omega)\,, & \text{if } P \ll Q\,;\\
+\infty\,, & \text{otherwise}\,.

Note that by our earlier remark, this reduces to $\eqref{eq:discreterelentropy}$ for discrete measures. Also note that the integral is always well defined (but may diverge to positive infinity). If $\lambda$ is a common dominating $\sigma$-finite measure for $P$ and $Q$ (i.e., $P\ll \lambda$ and $Q\ll \lambda$ both hold) then letting $p = \frac{dP}{d\lambda}$ and $q = \frac{dQ}{d\lambda}$, if also $P\ll Q$, the chain rule gives $\frac{dP}{dQ} \frac{dQ}{d\lambda} = \frac{dP}{d\lambda}$, which lets us write
\KL(P,Q) = \int p \log( \frac{p}{q} ) d\lambda\,,
which is perhaps the best known expression for relative entropy, which is also often used as a definition. Note that for probability measures, a common dominating $\sigma$-finite measure can always be bound. For example, $\lambda = P+Q$ always dominates both $P$ and $Q$.

Relative entropy is a kind of “distance” measure between distributions $P$ and $Q$. In particular, if $P = Q$, then $\KL(P, Q) = 0$ and otherwise $\KL(P, Q) > 0$. Strictly speaking, relative entropy is not a distance because it satisfies neither the triangle inequality nor is it symmetric. Nevertheless, it serves the same purpose.

The relative entropy between many standard distributions is often quite easy to compute. For example, the relative entropy between two Gaussians with means $\mu_1, \mu_2 \in \R$ and common variance $\sigma^2$ is
\KL(\mathcal N(\mu_1, \sigma^2), \mathcal N(\mu_2, \sigma^2)) = \frac{(\mu_1 – \mu_2)^2}{2\sigma^2}\,.
The dependence on the difference in means and the variance is consistent with our intuition. If $\mu_1$ is close to $\mu_2$, then the “difference” between the distributions should be small, but if the variance is very small, then there is little overlap and the difference is large. The relative entropy between two Bernoulli distributions with means $p,q \in [0,1]$
\KL(\mathcal B(p), \mathcal B(q)) = p \log \left(\frac{p}{q}\right) + (1-p) \log\left(\frac{(1-p)}{(1-q)}\right)\,,
where $0 \log (\cdot) = 0$. The divergence is infinite if $q \in \set{0,1}$ and $p \in (0,1)$ (because absolute continuity is violated).

To summarize, we have developed the minimax optimality concept and given a heuristic argument that no algorithm can improve on $O(\sqrt{Kn})$ regret in the worst case when the noise is Gaussian. The formal proof of this result will appear in the next post and relies on several of the core concepts from information theory, which we introduced here. Finally we also stated the definition and existence of the Radon-Nikodym derivative (or density), that unifies (and generalizes) the probability distribution (for discrete spaces as learned in high school) and a probability density.


Note 1: The worst-case regret has a natural game-theoretic interpretation. This is best described if you imagine a game between a protagonist and an antagonist that works as follows: Given $K>0$ and $n>1$, the protagonist proposes a bandit policy $A$, while the antagonist, after looking at the policy chosen, picks a bandit environment $E$ from the class of environments considered (in our case from the class $\cE_K$). After both players made their choices, the game unfold by the bandit policy deployed in the chosen environment. This leads to some value for the expected regret, which denote by $R_n(A,E)$. The payoff for the antagonist is then $R_n(A,E)$, while the payoff for the protagonist is $-R_n(A,E)$, i.e., the game is zero sum. Both players aim at maximizing their payoffs. The game is completely described by $n,K$ and $\cE$. One characteristic value in a game is its minimax value. As described above, this is a sequential game (the protagonist moves first, followed by the antagonist). The minimax value of this game, from the perspective of the antagonist, is then exactly $R_n^*(\cE)$, while it is $\sup_A \inf_E -R_n(A,E) = -R_n^*(\cE)$ from the perspective of the protagonist.

Note 2: By its definition, $R_n^*(\cE)$ is a worst-case measure. Should we worry about that an algorithm that optimizes for the worst-case may perform very poorly on specific instances? Intuitively, a bandit problem where the action gaps are quite large or quite small should be easier, the former because large gaps require fewer observations to detect, while smaller gaps can be just ignored. Yet, we can perhaps imagine policies that are worst-case optimal whose regret is $R_n^*(\cE)$ regardless of the nature of the instance that they are running on. When such an policy would be used on an “easy” instance as described above, we would certainly be disappointed to see it suffer a large regret. Ideally, what we would like is if in some sense the policy we end up using with would be optimal for every instance, or, in short, if it was instance-optimal. Instance-optimality in its vanilla form is a highly dubious idea: After all, the best regret on any given instance is always zero! How could then an algorithm achieve this for every instance of some nontrivial class $\cE$ (a class $\cE$ is trivial if it contains environments where it is always optimal to choose, say, action one)? This is clearly impossible. The problem comes from that the best policy for a given instance is a very dumb policy for almost all the other instances. We certainly do not wish to choose such a dumb policy! In other words, we want both good performance on individual instances while we also want good worst-case performance on the same horizon $n$. Or we may want to achieve both good performance on individual instances while we also demand good asymptotic behavior on any instance. In any case, the idea is to restrict the class $\cA_{n,K}$ by ruling out all the dumb policies. If $\cA_{n,K}^{\rm smart}$ is the resulting class, then the target is modified to achieve $R_n^*(E) \doteq \inf_{A\in \cA_{n,K}^{\rm smart}} R_n(A,E)$, up to, say, a constant factor. We will return to various notions of instance-optimality later and in particular will discuss whether for some selections of the class of smart policies, instance-optimal (or near-instance optimal) policies exist.

Note 3: The theorem that connects our definition of relative entropies to densities, i.e., the “classic definition”, can be found e.g., Section 5.2 of the information theory book of Gray.

The Upper Confidence Bound Algorithm

We now describe the celebrated Upper Confidence Bound (UCB) algorithm that overcomes all of the limitations of strategies based on exploration followed by commitment, including the need to know the horizon and sub-optimality gaps. The algorithm has many different forms, depending on the distributional assumptions on the noise.

Optimism in the face of uncertainty

Optimism in the face of uncertainty but on overdose: Not recommended!

The algorithm is based on the principle of optimism in the face of uncertainty, which is to choose your actions as if the environment (in this case bandit) is as nice as is plausibly possible. By this we mean that the unknown mean payoffs of each arm is as large as plausibly possible based on the data that has been observed (unfounded optimism will not work — see the illustration on the right!). The intuitive reason that this works is that when acting optimistically one of two things happens. Either the optimism was justified, in which case the learner is acting optimally, or the optimism was not justified. In the latter case the agent takes some action that they believed might give a large reward when in fact it does not. If this happens sufficiently often, then the learner will learn what is the true payoff of this action and not choose it in the future. The careful reader may notice that this explains why this rule will eventually get things right (it will be “consistent” in some sense), but the argument does not quite explain why an optimistic algorithm should actually be a good algorithm among all consistent ones. However, before getting to this, let us clarify what we mean by plausible.

Recall that if $X_1, X_2,\ldots, X_n$ are independent and $1$-subgaussian (which means that $\E[X_i] = 0$) and $\hat \mu = \sum_{t=1}^n X_t / n$, then
\Prob{\hat \mu \geq \epsilon} \leq \exp\left(-n\epsilon^2 / 2\right)\,.
Equating the right-hand side with $\delta$ and solving for $\epsilon$ leads to
\Prob{\hat \mu \geq \sqrt{\frac{2}{n} \log\left(\frac{1}{\delta}\right)}} \leq \delta\,.
This analysis immediately suggests a definition of “as large as plausibly possible”. Using the notation of the previous post, we can say that when the learner is deciding what to do in round $t$ it has observed $T_i(t-1)$ samples from arm $i$ and observed rewards with an empirical mean of $\hat \mu_i(t-1)$ for it. Then a good candidate for the largest plausible estimate of the mean for arm $i$ is
\hat \mu_i(t-1) + \sqrt{\frac{2}{T_i(t-1)} \log\left(\frac{1}{\delta}\right)}\,.
Then the algorithm chooses the action $i$ that maximizes the above quantity. If $\delta$ is chosen very small, then the algorithm will be more optimistic and if $\delta$ is large, then the optimism is less certain. We have to be very careful when comparing the above display to \eqref{eq:simple-conc} because in one the number of samples is the constant $n$ and in the other it is a random variable $T_i(t-1)$. Nevertheless, this is in some sense a technical issue (that needs to be taken care of properly, of course) and the intuition remains that $\delta$ is approximately an upper bound on the probability of the event that the above quantity is an underestimate of the true mean.

The value of $1-\delta$ is called the confidence level and different choices lead to different algorithms, each with their pros and cons, and sometimes different analysis. For now we will choose $1/\delta = f(t)= 1 + t \log^2(t)$, $t=1,2,\dots$. That is, $\delta$ is time-dependent, and is decreasing to zero slightly faster than $1/t$. Readers are not (yet) expected to understand this choice whose pros and cons we will discuss later. In summary, in round $t$ the UCB algorithm will choose arm $A_t$ given by
A_t = \begin{cases}
\argmax_i \left(\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(t)}{T_i(t-1)}}\right)\,, & \text{if } t > K\,; \\
t\,, & \text{otherwise}\,.
The reason for the cases is that the term inside the square root is undefined if $T_i(t-1) = 0$ (as it is when $t = 1$), so we will simply have the algorithm spend the first $K$ rounds choosing each arm once. The value inside the argmax is called the index of arm $i$. Generally speaking, an index algorithm chooses the arm in each round that maximizes some value (the index), which usually only depends on current time-step and the samples from that arm. In the case of UCB, the index is the sum of the empirical mean of rewards experienced and the so-called exploration bonus, also known as the confidence width.

Illustration of UCB with 2 actions. The true means are shown in red ink. The observations are shown by crosses. Action 2 received fewer observations than Action 1. Hence, although its empirical mean is about the same as that of Action 1, Action 2 will be chosen in the next round.

Besides the slightly vague “optimism guarantees optimality or learning” intuition we gave before, it is worth exploring other intuitions for this choice of index. At a very basic level, we should explore arms more often if they are (a) promising (in that $\hat \mu_i(t-1)$ is large) or (b) not well explored ($T_i(t-1)$ is small). As one can plainly see from the definition, the UCB index above exhibits this behaviour. This explanation is unsatisfying because it does not explain why the form of the functions is just so.

An alternative explanation comes from thinking of what we expect from any reasonable algorithm. Suppose in some round we have played some arm (let’s say arm $1$) much more frequently than the others. If we did a good job designing our algorithm we would hope this is the optimal arm. Since we played it so much we can expect that $\hat \mu_1(t-1) \approx \mu_1$. To confirm the hypothesis that arm $1$ is indeed optimal the algorithm better be highly confident about that other arms are indeed worse. This leads very naturally to confidence intervals and the requirement that $T_i(t-1)$ for other arms $i\ne 1$ better be so large that
\hat \mu_i(t-1) + \sqrt{\frac{2}{T_i(t-1)} \log\left(\frac{1}{\delta}\right)} \leq \mu_1\,,
because, at a confidence level of $1-\delta$ this guarantees that $\mu_i$ is smaller than $\mu_1$ and if the above inequality did not hold, the algorithm would not be justified in choosing arm $1$ much more often than arm $i$. Then, planning for $\eqref{eq:ucbconstraint}$ to hold makes it reasonable to follow the UCB rule as this will eventually guarantee that this inequality holds when arm $1$ is indeed optimal and arm $i$ is suboptimal. But how to choose $\delta$? If the confidence interval fails, by which we mean, if actually it turns out that arm $i$ is optimal and by unlucky chance it holds that
\hat \mu_i(t-1) + \sqrt{\frac{2}{T_i(t-1)} \log\left(\frac{1}{\delta}\right)} \leq \mu_i\,,
then arm $i$ can be disregarded even though it is optimal. In this case the algorithm may pay linear regret (in $n$), so it better be the case that the failure occurs with about $1/n$ probability to fix the upper bound on the expected regret to be constant for the case when the confidence interval fails. Approximating $n \approx t$ leads then (after a few technicalities) to the choice of $f(t)$ in the definition of UCB given in \eqref{eq:ucb}. With this much introduction, we state the main result of this post:

Theorem (UCB Regret): The regret of UCB is bounded by
\begin{align} \label{eq:ucbbound}
R_n \leq \sum_{i:\Delta_i > 0} \inf_{\epsilon \in (0, \Delta_i)} \Delta_i\left(1 + \frac{5}{\epsilon^2} + \frac{2}{(\Delta_i – \epsilon)^2} \left( \log f(n) + \sqrt{\pi \log f(n)} + 1\right)\right)\,.
\begin{align} \label{eq:asucbbound}
\displaystyle \limsup_{n\to\infty} R_n / \log(n) \leq \sum_{i:\Delta_i > 0} \frac{2}{\Delta_i}\,.

Note that in the first display, $\log f(n) \approx \log(n) + 2\log\log(n)$. We thus see that this bound scales logarithmically with the length of the horizon and is able to essentially reproduce the bound that we obtained for the unfeasible version of ETC with $K=2$ (when we tuned the exploration time based on the knowledge of $\Delta_2$). We shall discuss further properties of this bound later, but now let us present a simpler version of the above bound, avoiding all these epsilons and infimums that make for a confusing theorem statement. By choosing $\epsilon = \Delta_i/2$ inside the sum leads to the following corollary:

Corollary (UCB Simplified Regret): The regret of UCB is bounded by
R_n \leq \sum_{i:\Delta_i > 0} \left(\Delta_i + \frac{1}{\Delta_i}\left(8 \log f(n) + 8\sqrt{\pi \log f(n)} + 28\right)\right)\,.
and in particular there exists some universal constant $C>0$ such that for all $n\ge 2$, $R_n \le \sum_{i:\Delta_i>0} \left(\Delta_i + \frac{C \log n}{\Delta_i}\right)$.

Note that taking the limit of the ratio of the bound above and $\log(n)$ does not result in the same rate as in the theorem, which is the main justification for introducing the epsilons in the first place. In fact, as we shall see the asymptotic bound on the regret given in \eqref{eq:asucbbound}, which is derived from~\eqref{eq:ucbbound} by choosing $\epsilon = \log^{-1/4}(n)$, is unimprovable in a strong sense.

The proof of the theorem relies on the basic regret decomposition identity that expresses the expected regret as the weighted sum of the expected number of times the suboptimal actions are chosen. So why will $\EE{T_i(n)}$ be small for a suboptimal action $i$? This is based on a couple of simple observations: First, (disregarding the initial period when all arms are chosen once) the suboptimal action $i$ can only be chosen if its UCB index is higher than that of an optimal arm. Now, this can only happen if the UCB index of action $i$ is “too high”, i.e., higher than $\mu^*-\epsilon>\mu_i$ or the UCB index of that optimal arm is “too low”, i.e., if it is below $\mu^*-\epsilon<\mu^*$. Since the UCB index of any arm is with reasonably high probability an upper bound on the arm’s mean, we don’t expect the index of any arm to be below its mean. Hence, the total number of times when the optimal arm’s index is “too low” (as defined above) is expected to be negligibly small. Furthermore, if the sub-optimal arm $i$ is played sufficiently often, then its exploration bonus becomes small and simultaneously the empirical estimate of its mean converges to the true value, making the expected total number of times when its index stays above $\mu^*-\epsilon$ small.

We start with a useful lemma that will help us quantify the last argument.

Lemma Let $X_1,X_2,\ldots$ be a sequence of independent $1$-subgaussian random variables, $\hat \mu_t = \sum_{s=1}^t X_s / t$, $\epsilon > 0$ and
\kappa = \sum_{t=1}^n \one{\hat \mu_t + \sqrt{\frac{2a}{t}} \geq \epsilon}\,.
Then, $\displaystyle \E[\kappa] \leq 1 + \frac{2}{\epsilon^2} (a + \sqrt{\pi a} + 1)$.

Because the $X_i$ are $1$-subgaussian and independent we have $\E[\hat \mu_t] = 0$, so we cannot expect $\hat \mu_t + \sqrt{2a/t}$ to be smaller than $\epsilon$ until $t$ is at least $2a/\epsilon^2$. The lemma confirms that this is indeed of the right order as an estimate for $\EE{\kappa}$.

Let $u = 2a \epsilon^{-2}$. Then, by the concentration theorem for subgaussian variables,
&\leq u + \sum_{t=\ceil{u}}^n \Prob{\hat \mu_t + \sqrt{\frac{2a}{t}} \geq \epsilon} \\
&\leq u + \sum_{t=\ceil{u}}^n \exp\left(-\frac{t\left(\epsilon – \sqrt{\frac{2a}{t}}\right)^2}{2}\right) \\
&\leq 1 + u + \int^\infty_u \exp\left(-\frac{t\left(\epsilon – \sqrt{\frac{2a}{t}}\right)^2}{2}\right) dt \\
&= 1 + \frac{2}{\epsilon^2}(a + \sqrt{\pi a} + 1)\,.

Before the proof of the UCB regret theorem we need a brief diversion back to the bandit model. We have defined $\hat \mu_i(t)$ as the empirical mean of the $i$th arm after the $t$th round, which served us well enough for the analysis of the explore-then-commit strategy where the actions were chosen following a deterministic rule. For UCB it is very useful also to have $\hat \mu_{i,s}$, the empirical average of the $i$th arm after $s$ observations from that arm, which occurs at a random time (or maybe not at all). To define $\hat \mu_{i,s}$ rigorously, we argue that without the loss of generality one may assume that the reward $X_t$ received in round $t$ comes from choosing the $T_i(t)$th element from the reward sequence $(Z_{i,s})_{1\le s \le n}$ associated with arm $i$, where $(Z_{i,s})_s$ is an i.i.d. sequence with $Z_{i,s}\sim P_i$. Formally,
X_t = Z_{A_t,T_{A_t}(t)}\,.
The advantage of introducing $(Z_{i,s})_s$ is that it allows a clean definition (without $Z_{i,s}$, how does one even define $\hat \mu_{i,s}$ if $T_i(n) \leq s$?). In particular, we let
\hat \mu_{i,s} &= \frac{1}{s} \sum_{u=1}^s Z_{i,u}\,.
Note that $\hat \mu_{i,s} = \hat \mu_i(t)$ when $T_i(t)=s$ (formally: $\hat \mu_{i,T_i(t)} = \hat \mu_i(t)$).

Proof of Theorem
As in the analysis of the explore-then-commit strategy we start by writing the regret decomposition.
R_n = \sum_{i:\Delta_i > 0} \Delta_i \E[T_i(n)]\,.
The rest of the proof revolves around bounding $\E[T_i(n)]$. Let $i$ be some sub-optimal arm (so that $\Delta_i > 0$). Following the suggested intuition we decompose $T_i(n)$ into two terms. The first measures the number of times the index of the optimal arm is less than $\mu_1 – \epsilon$. The second term measures the number of times that $A_t = i$ and its index is larger than $\mu_1 – \epsilon$.
&= \sum_{t=1}^n \one{A_t = i} \nonumber \\
&\leq \sum_{t=1}^n \one{\hat \mu_1(t-1) + \sqrt{\frac{2\log f(t)}{T_1(t-1)}} \leq \mu_1 – \epsilon} + \nonumber \\
&\qquad \sum_{t=1}^n \one{\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(t)}{T_i(t-1)}} \geq \mu_1 – \epsilon \text{ and } A_t = i}\,. \label{eq:ucb1}
The proof of the first part of the theorem is completed by bounding the expectation of each of these two sums. Starting with the first, we again use the concentration guarantee.
\EE{\sum_{t=1}^n \one{\hat \mu_1(t-1) + \sqrt{\frac{2 \log f(t)}{T_1(t-1)}} \leq \mu_1 – \epsilon}}
&= \sum_{t=1}^n \Prob{\hat \mu_1(t-1) + \sqrt{\frac{2 \log f(t)}{T_1(t-1)}} \leq \mu_1 – \epsilon} \\
&\leq \sum_{t=1}^n \sum_{s=1}^n \Prob{\hat \mu_{1,s} + \sqrt{\frac{2 \log f(t)}{s}} \leq \mu_1 – \epsilon} \\
&\leq \sum_{t=1}^n \sum_{s=1}^n \exp\left(-\frac{s\left(\sqrt{\frac{2 \log f(t)}{s}} + \epsilon\right)^2}{2}\right) \\
&\leq \sum_{t=1}^n \frac{1}{f(t)} \sum_{s=1}^n \exp\left(-\frac{s\epsilon^2}{2}\right) \\
&\leq \frac{5}{\epsilon^2}\,.
The first inequality follows from the union bound over all possible values of $T_1(t-1)$. This is an important point. The concentration guarantee cannot be applied directly because $T_1(t-1)$ is a random variable and not a constant. The last inequality is an algebraic exercise. The function $f(t)$ was chosen precisely so this bound would hold. If $f(t) = t$ instead, then the sum would diverge. Since $f(n)$ appears in the numerator below we would like $f$ to be large enough that its reciprocal is summable and otherwise as small as possible. For the second term in \eqref{eq:ucb1} we use the previous lemma.
&\EE{\sum_{t=1}^n \one{\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(t)}{T_i(t-1)}} \geq \mu_1 – \epsilon \text{ and } A_t = i}} \\
&\qquad\leq \EE{\sum_{t=1}^n \one{\hat \mu_i(t-1) + \sqrt{\frac{2 \log f(n)}{T_i(t-1)}} \geq \mu_1 – \epsilon \text{ and } A_t = i}} \\
&\qquad\leq \EE{\sum_{s=1}^n \one{\hat \mu_{i,s} + \sqrt{\frac{2 \log f(n)}{s}} \geq \mu_1 – \epsilon}} \\
&\qquad= \EE{\sum_{s=1}^n \one{\hat \mu_{i,s} – \mu_i + \sqrt{\frac{2 \log f(n)}{s}} \geq \Delta_i – \epsilon}} \\
&\qquad\leq 1 + \frac{2}{(\Delta_i – \epsilon)^2} \left(\log f(n) + \sqrt{\pi \log f(n)} + 1\right)\,.
The first part of the theorem follows by substituting the results of the previous two displays into \eqref{eq:ucb1}. The second part follows by choosing $\epsilon = \log^{-1/4}(n)$ and taking the limit as $n$ tends to infinity.

Next week we will see that UCB is close to optimal in several ways. As with the explore-then-commit strategy, the bound given in the previous theorem is not meaningful when the gaps $\Delta_i$ are small. Like that algorithm it is possible to prove a distribution-free bound for UCB by treating the arms $i$ with small $\Delta_i$ differently. Fix $\Delta>0$ to be chosen later. Then, from the proof of the bound on the regret of UCB we can derive that $\EE{T_i(n)} \le \frac{C \log(n)}{\Delta_i^2}$ holds for all $n\ge 2$ with some universal constant $C>0$. Hence, the regret can be bounded without dependence on the sub-optimality gaps by
&= \sum_{i:\Delta_i > 0} \Delta_i \E[T_i(n)]
= \sum_{i:\Delta_i < \Delta} \Delta_i \E[T_i(n)] + \sum_{i:\Delta_i \geq \Delta} \Delta_i \E[T_i(n)] \\
&< n \Delta + \sum_{i:\Delta_i \geq \Delta} \Delta_i \E[T_i(n)]
\leq n \Delta + \sum_{i:\Delta_i \geq \Delta} \frac{C \log n}{\Delta_i} \\
&\leq n \Delta + K\frac{C \log n}{\Delta}
= \sqrt{C K n \log(n)}\,,
where in the last step we chose $\Delta = \sqrt{K C \log(n) / n}$, which optimizes the upper bound.

There are many directions to improve or generalize this result. For example, if more is known about the noise model besides that it is subgaussian, then this can often be exploited to improve the regret. The main example is the Bernoulli case, where one should make use of the fact that the variance is small when the mean is close to zero or one. Another direction is improving the worst-case regret to match the lower bound of $\Omega(\sqrt{Kn})$ that we will see next week. This requires a modification of the confidence level and a more complicated analysis.


Note 1: Here we argue that there is no loss in generality in assuming that the rewards experienced satisfy $\eqref{eq:rewardindepmodel}$. Indeed, let $T’ = (A’_1,X’_1,\dots,A’_n,X’_n)$ be any sequence of random variables satisfying that $A_t’ = f_t(A’_1,X’_1,\dots,A’_{t-1},X’_{t-1})$ and that for any $U\subset \R$ open interval
\Prob{X_t’\in U\,|\,A’_1,X’_1,\dots,A’_{t-1},X’_{t-1},A’_t} = P_{A’_t}(U)\,,
where $1\le t\le n$. Then, choosing $(Z_{i,s})_s$ as described in the paragraph before $\eqref{eq:rewardindepmodel}$, we let $T=(A_1,X_1,\dots,A_n,X_n)$ be such that $A_t = f_t(A_1,X_1,\dots,A_{t-1},X_{t-1})$ and $X_t$ be so that it satisfies $\eqref{eq:rewardindepmodel}$. It is not hard to see then that the distributions of $T$ and $T’$ agree. Hence, there is indeed no loss of generality by assuming that the rewards are indeed generated by $\eqref{eq:rewardindepmodel}$.

Note 2: The view that $n$ rewards are generated ahead of time for each arm and the algorithm consumes these rewards as it chooses an action was helpful in the proof as it reduced the argument to the study of averages of independent random variables. The analysis could also have been done directly without relying on the “virtual” rewards $(Z_{i,s})_s$ with the help of martingales, which we will meet later.
A third model of how $X_t$ is generated could have been that $X_t = Z_{A_t,t}$. We will meet this “skipping model” later when studying adversarial bandits. For the stochastic bandit models we study here, all these models coincide (they are indistinguishable in the sense described in the first note above).

Note 3: So is the optimism principle universal? Does it always give good algorithms, even in more complicated settings? Unfortunately, the answer is no. The optimism principle leads to reasonable algorithms when using an action gives feedback that informs the learner about how much the action is worth. If this is not true (i.e., in models where you have to choose action $B$ to learn about the rewards of action $A$, and choosing action $A$ would not give you information about the reward of action $A$), the principle fails! (Why?) Furthermore, even if all actions give information about their own value, the optimistic principle may give rise to algorithms whose regret is overly large compared to what could be achieved with more clever algorithms. Thus, in a way, finite-armed stochastic bandits is a perfect fit for optimistic algorithms. While the more complex feedback models may not make much sense at the moment, we will talk about them later.


The idea of using upper confidence bounds appeared in ’85 in the landmark paper of Lai and Robbins. In this paper they introduced a strategy which plays the leader of the “often sampled” actions except that for any action $j$ in every $K$th round the strategy is checking whether the UCB index of arm $j$ is higher than the estimated reward of the leader. They proved that this strategy, when appropriately tuned, is asymptotically unimprovable the same way UCB as we defined it is asymptotically unimprovable (we still owe the definition of this and a proof, which will come soon). The cleaner UCB idea must have been ready to be found in ’95 because Agrawal and Katehakis & Robbins discovered this idea independently in that year. Auer et al. later modified the strategy slightly and proved a finite-time analysis.