Optimality concepts and information theory

In this post we introduce the concept of minimax regret and reformulate our previous result on the upper bound on the worst-case regret of UCB in terms of the minimax regret. We briefly discuss the strengths and weaknesses of using minimax regret as a performance yardstick. Then we discuss with more precision what we meant when we mentioned that UCB’s worst-case regret cannot be greatly improved, and sketch the proof (that will be included in all its glory in the next post). We finish by introducing a few of the core concepts of information theory, and especially the relative entropy that will be crucial for the formal lower bound proofs.

How to measure regret?

Previously we have seen that in the worst case the UCB algorithm suffers a regret of at most $O(\sqrt{Kn \log(n)})$. It will prove useful to restate this result here. The following notation will be helpful: Let $\cE_K$ stand for the class of $K$-action stochastic environments where the rewards for action $i\in [K]$ are generated from a distribution $P_i$ with mean $\mu_i\in [0,1]$ where $P_i$ is such that if $X\sim P_i$ then $X-\mu_i$ is $1$-subgaussian (environments with this property will be called $1$-subgaussian). Our previous result then reads as follows:

Theorem (Worst-case regret-bound for UCB): There exists a constant $C>0$ such that the following hold: For all $K>0$, $E\in \cE_K$, $n>1$ if $R_n$ is the (expected) regret of UCB when interacting with $E$,
\begin{align}
\label{eq:UCBbound}
R_n \le C \sqrt{K n \log(n)}\,.
\end{align}

(In the future when we formulate statements like this, we will say that $C$ is a universal constant. The constant is universal in the sense that its value does not depend on the particulars of how $K$, $n$, or $E$ as long as these satisfy the constraints mentioned.)

We call the bound on the right hand-side of the above display a worst-case bound, because it holds even for the environment that is the worst for UCB. Worst-case bounds can point to certain weaknesses of an algorithm. But what did we mean when we said that this bound cannot be improved in any significant manner? In simple terms, this means that no matter the policy, there will be an environment on which the policy achieves almost the same regret as the upper bound on the above side.

This can be concisely written in terms of the minimax regret. The minimax regret is defined for a given horizon $n>0$ and a fixed number of actions $K>0$ as follows. Let $\cA_{n,K}$ stand for the set of all possible policies on horizon $n$ and action set $[K]$. Then, the minimax regret for horizon $n$ and environment class $\cE$ (such as $\cE_K$ above) is defined as
\begin{align*}
R_n^*(\cE) = \inf_{A\in \cA_{n,K}} \sup_{E\in \cE} R_n(A,E)\,.
\end{align*}
Note how this value is independent of any specific choice of a policy, but only depends on $n$, $\cE$ and $K$ (the dependence on $K$ is hidden in $\cE$). Further, no matter how a policy $A$ is chosen, $R_n^*(\cE)\le R_n^*(A,\cE) \doteq \sup_{E\in \cE} R_n(A,E)$, i.e., the $n$-round minimax regret of $\cE$ is a lower bound on the worst-case regret of any policy $A$. If we lower bound the minimax regret, the lower bound tells us that no policy at all exist that can do better than the lower bound. In other words, such a bound is independent of the policy chosen, it shows the fundamental hardness of the problem.

A minimax optimal policy is a policy $A$ whose worst-case regret is equal to the minimax regret: $R_n^*(A,\cE) = R_n^*(\cE)$. Finding a minimax optimal policy is often exceptionally hard; hence we often resort to the weaker goal of finding a near-minimax policy, i.e., a policy whose worst-case regret is at most a constant multiple of $R_n^*(\cE)$. Technically, this makes sense only if this holds universally over $n$, $K$ and $\cE$ (within some limits): To define this formally, we would need to demand that the policy takes as input $n,K$, but we skip the formal definition, hoping that the idea is clear. UCB, as we specified earlier, can be thought as such a “universal” policy for the class of all $1$-subgaussian environments (and it does not even use the knowledge of $n$). When near-minimaxity is too strong as defined above, we may relax it by demanding that the ratio of $R_n^*(A,\cE)/R_n^*(\cE)$ is bounded by a logarithmic function of $\max(1,R_n^*(\cE))$; the idea being that the logarithm of any number is so substantially smaller than itself that it can be neglected. UCB is near-minimax optimal in this sense.

The value $R_n^*(\cE)$ is of interest on its own. In particular, a small value of $R_n^*(\cE)$ indicates that the underlying bandit problem is less challenging in the worst-case sense. A core activity in bandit theory (and learning theory, more generally) is to understand what makes $R_n^*(\cE)$ large or small, often focusing on its behavior as a function of the number of rounds $n$. This focus is justified when $n$ gets large, but often other quantities, such as the number of actions $K$, are also important.

Since $R_n^*(\cE_K) \le R_n(\mathrm{UCB},\cE_K)$, the above theorem immediately gives that
\begin{align}
\label{eq:minimaxUCBupper}
R_n^*(\cE_K) \le C \sqrt{ K n \log(n)}\,,
\end{align}
while our earlier discussion of the explore-then-commit strategies $\mathrm{ETC}_{n,m}$ with a fixed commitment time $m$ can be concisely stated as
\begin{align}
\inf_{m} R_n^*(\mathrm{ETC}_{n,m}, \cE_K) \asymp n^{2/3}\,,
\end{align}
where $\asymp$ hides a constant proportionality factor which is independent of $n$ (the factor scales with $K^{1/3}$).
Thus, we see that UCB does better than ETC with the best a priori $n$-dependent choice of $m$ in a worst-case sense, as $n$ gets large. But perhaps there are are even better policies than UCB? The answer was already given above, but let us state it this time in the form of a theorem:

Theorem (Worst-case regret lower-bound): For any $K\ge 2$, $n\ge 2$,
\begin{align}\label{eq:Karmedminimaxlowerbound}
R_n^*(\cE_K) \ge c \sqrt{Kn}
\end{align}
for some universal constant $c>0$.

In particular, we see that UCB is near-minimax optimal. But how to prove that the above result? The intuition is relatively simple and can be understood by just studying Gaussian tails.

Lower bounding ideas

Given $n$ i.i.d. observations from a Gaussian distribution with a known variance of say one and an unknown mean $\mu$, the sample mean of the observations is a sufficient statistic for the mean. This means that the observations can be replaced by the sample mean $\hat\mu$ without losing any information. The distribution of the sample is also Gaussian, with the same mean as the unknown mean and a variance of $1/n$. Now, assume that we told that the unknown mean can take on only two values: Either it is (say) zero, or $\Delta$, which, without loss of generality, we assume to be positive. The task is to decide whether we are in the first (when $\mu=0$), or the second case (when $\mu=\Delta$). As we said before, there is no loss of generality assuming that the decision is based on $\hat\mu$ only. If $n$ is large compared to $\Delta$, intuitively we feel that the decision is easy: If $\hat\mu$ is closer to zero than to $\Delta$, choose zero, otherwise choose $\Delta$. No matter then whether $\mu$ was indeed zero or $\Delta$, the probability of an incorrect choice will be low. How low? An easy upper bound based on our earlier upper bound on the tails of a Gaussian (see here) is
\begin{align*}
(2\pi n (\Delta/2)^2)^{-1/2} \exp(-n (\Delta/2)^2/2) \le \exp(-\frac{n(\Delta/2)^2}{2})\,,
\end{align*}
where the inequality assumed that $2\pi n (\Delta/2)^2\ge 1$). We see that the error probability decays exponentially with $n$, but the upper would still be useless if $\Delta$ was small compared to $n$! Can we do better than this? One might believe the decision procedure could be improved, but the symmetry of the problem makes this seem improbable.

The other possibility is that the upper bound on the Gaussian tails that we use is loose. This turns out not to be the case either. With a calculation slightly more complex than the one given before (using integration by parts, e.g., see here), we can show that if $X$ has the standard normal distribution,
\begin{align*}
\Prob{X>\eps} \ge \left(\frac{1}{\eps}-\frac{1}{\eps^3}\right)\, \frac{\exp( – \frac{\eps^2}{2} )}{\sqrt{2\pi}}
\end{align*}
also holds, showing that there is basically no room for improvement. In particular, this means that if $n(\Delta/2)^2/2=c$ ($\Delta \le \sqrt{8c/n}$, or, equivalently, $n\le 8c/\Delta^2$) then the probability that anyprocedure will make a mistake in either of the two cases when $\mu \in \{0,\Delta\}$ is at least $C c^{-1/2} (1-1/c)\exp(-c)>0$. This puts a lower limit on how reliably we can decide between the two given alternatives. The take home message is that if $n$ is small compared to the differences in the means that we need to test for, then we are facing an impossible mission.

The problem as described is the most classic hypothesis testing problem there is. The ideas underlying this argument are core to any arguments that show an impossibility result in a statistical problem. The next task is to reduce our bandit problem to a hypothesis testing problem as described here.

The high level idea is to select two bandit problem instances. For simplicity, we select two bandit instances $E,E’$ where the reward distributions are Gaussians with a unit variance. The vector of means are in $[0,1]^K$, thus the instances are in $\cE_K$. Let the means of the two instances be $\mu$ and $\mu’$, respectively. Our goal is to choose $E,E’$ (alternatively, $\mu$ and $\mu’$) in such a way that the following two conditions hold simultaneously:

  1. Competition: Algorithms doing well on one instance, cannot do well on the other instance.
  2. Similarity: The instances are “close” so that no matter what policy interacts with them, given the observations, the two instances are indistinguishable as in the hypothesis testing example above.

In particular, for the second requirement we want that any way of interacting with the instances results in data that no decision procedure that needs to decide about which instance it was running on can achieve a low error probability on both instances. The two requirements are clearly conflicting. The first requirement makes us want to choose $\mu$ and $\mu’$ far from each other, while the second requirement makes us want to choose them to be close to each other. A lower bound on the regret follows then a simple reasoning: We lower bound the regret in terms of how well it does on a hypothesis testing problem. The conflict between the two objectives is resolved by choosing the means so that the lower bound is maximized.

This strategy as described works in simple cases, but for our case a slight amendment is necessary. The twist we add is this: We choose $\mu$ in a specific way so that one arm has a slightly higher mean (by a constant $\Delta>0$ to be chosen later) than all the other arms, call this $i_E$. Then we find $i_{E’}\ne i_E$ that was least favored in expectation by the algorithm $A$ and increase the mean payoff of this arm by $2\Delta$ to crease environment $E’$. In particular, all other means are the same in $E$ and $E’$, but $i_{E’}$ is the optimal action in $E’$ and $i_E$ is the optimal action in $E$. Further, the absolute gap between the immediate rewards of these actions is $\Delta$ in both environments.

Note that $i_{E’}$ cannot be used more than $\approx n/K$ times on expectation when $A$ is run on $E$. And the two environments differ only in terms of the mean payoff of exactly of action $i_{E’}$. To make the two algorithms essentially indistinguishable, set $\Delta = 1/\sqrt{n/K}$. This will also mean that when $A$ is run on $E’$, since $\Delta$ is so small, the interaction of $A$ and $E’$ will produce a near-identically distributed sequence of action-choices than the distribution when $A$ was used on $E$.

Now, when $A$ is run on $E$ and $i_E$ is chosen fewer than $n(1-1/K)$ times on expectation then it will have more than $\Delta (n-n(1-1/K)) \approx c\sqrt{Kn}$ regret (here, $c$ is a universal constant “whose value can change line by line”). How about algorithms that choose $i_E$ more than $n(1-1/K)$ times when interacting with $E$? By the indistinguishability of $E$ and $E’$ by $A$, this algorithm will choose $i_E$ almost the same number of times on $E’$, too, which means that it cannot choose $i_{E’}$ too often, and in particular, it will have a regret of at least $c\sqrt{Kn}$ on $E’$. Thus, we see that no algorithm will be able to do well uniformly on all instances.

The worst-case instance in this construction clearly depends on $A$, $n$ and $K$. Thus, the analysis has little relevance, for example, if an algorithm $A$ is run on a fixed instance $E$ and we consider the regret on this instance in the long run. We will return to this question later.

Information theory

In order to make the above argument rigorous (and easily generalizable to multiple other settings) we will rely on some classic tools from information theory and statistics. In particular, we will need the concept of relative entropy, also known as the Kullback-Leibler divergence, named of Solomon Kullback and Richard Leibler (KL divergence, for short).

Relative entropy puts a numerical value on how surprised one should be on average when observing a random quantity $X$ whose distribution is $P$ when one expected that $X$’s distribution will be $Q$. This can also be said to be the extra information we gain when $X$ is observed, again relative to the expectation that $X$’s distribution would have been $Q$. As such, relative entropy can be used to answer question like how many observations we need to notice that the distribution of our observations follow $P\ne Q$, hence its relevance to our problems! To explain these ideas, the best is to start with the case when $X$ is finite-valued.

Assume thus that $X$ takes on finitely many values. Let us first discuss how to define the amount of information that observing $X$ conveys. One way to start is to define information as the amount of communication needed if we want to “tell” a friend about the value we observed. If $X$ takes on $N$ distinct values, say, $X\in [N]$ (we can assume this for simplicity and without loss of generality), we may be thinking of using $\lceil\log_2(N)\rceil$ bits no matter the value of $X$. However, if we have the luxury of knowing the probabilities $p_i = \Prob{X=i}$, $i=1,\dots,N$ and we also have the luxury to agree with our friend on a specific coding to be used (before even observing any values) then we can do much better.

The idea is to use a code that respects the probability distribution of $X$. But how many bits should the code of $i$ use? Intuitively, the smaller $p_i$ is, the larger the length of the code of $i$ should be. The coding also has to be such that our friend can tell when the code of a symbol ended or it will be useless (this is impossible if and only if there exists $i\ne j$ such that the respective codes $c_i,c_j$ are such that $c_i$ is a prefix of $c_j$). It turns out that the optimal bit-length for observation $i$ under the previous conditions is $\log_2 1/p_i$ and is achieved by Huffman-coding. This is meant in a limiting sense, when we observe an $X_1,X_2,\dots$ which are independent copies of $X$. Thus, the optimal average code length for this setting is
\begin{align}\label{eq:entropy}
H(p) = \sum_i p_i \log \frac{1}{p_i}\,.
\end{align}
The astute reader may have noticed that we switched from $\log_2$ to $\log$ here. Effectively, this means that we changed the units in which we measure the length of codes from bits to something else, which happens to be called nats. Of course, changing units does not change the meaning of the definition, just the scale, hence the change should be of no concern. The switch is justified because it simplifies some formulae later.

But is $H$ well-defined at all? (This is a question that should be asked after every definition!) What happens if $p_i=0$ for some $i$? This comes up, for example, when we take $N$ to be larger than the number of possible values that $X$ really takes on, i.e., if we introduce “impossible” values. We expect that introducing such superfluous impossible values should not change the value of $H$. Indeed, it is not hard to see that $\lim_{x\to 0+} x\log(1/x)=0$, suggesting that we should define $p \log(1/p)=0$ when $p=0$ and this is indeed what we will do, which also means that we can introduce as many superfluous symbols as we want without every changing the amount $H$.

Thus, we can agree that for $X\in [N]$, $H(p)$ with $p_i = \Prob{X=i}$ measures the (expected) information content of observing $X$ (actually, in a repeated setting, as described above). Another interpretation of $H$ takes a more global view. According to this view, $H$ measures the amount of uncertainty in $p$ (or, equivalently but redundantly, in $X$). But what do we mean by uncertainty? One approach to define uncertainty is to think of how much one should be surprised to see a particular value of $X$. If $X$ is completely deterministic ($p_i=1$ for some $i$) then there is absolutely no surprise in observing $X$, thus the uncertainty measure should be zero and this is the smallest uncertainty there is. If $X$ is uniformly distributed, then we should be equally surprised by seeing any values of $i$ and the uncertainty should be maximal. If we observe a pair, $(X,Y)$, which are independent of each other, the uncertainty of the pair should be the sum of the uncertainties of $X$ and $Y$. Long story short, it turns out that reasonable definitions of uncertainty actually give rise to $H$ as defined by $\eqref{eq:entropy}$. Since in physics, entropy is the expression used to express the uncertainty of a system, the quantity $H$ is also called the entropy of $p$ (or the entropy of $X$ giving rise to $p$).

Of course, optimal code lengths can also be connected to uncertainty directly: Think of recording the independent observations $X_1,X_2,\dots,X_n$ for $n$ large where $X_i \sim X$ in some notebook. The more uncertainty the distribution of $X$ has, the longer the transcript of $X_1,\dots,X_n$ will be.

We can summarize our discussion so far as follows:

Information = Optimal code length per symbol = Uncertainty

Now, let us return to the definition of relative entropy: As described earlier, relative entropy puts a value on how surprised one should be on average when observing a random quantity $X$ whose distribution is $P$ (equivalently, $p=(p_i)_i$, where $p_i$ is the probability of seeing $X=i$) when one expected that $X$’s distribution will be $Q$ (equivalently, $q=(q_i)$, where $q_i$ is the probability of seeing $X=i$). We also said that relative entropy is the information contained in observing $X$ when $X$ actually comes from $P$ (equivalently, $p$) relative to one’s a priori expecting that $X$ will follow distribution $Q$ (equivalently, $q$).

Let us discuss the second approach: The situation described in more details is this. Information is how much we need to communicate with our friend with whom we agreed on a coding protocol. That we expect $X\sim q$ means that before communication even starts, we agree on using a protocol respecting this knowledge. But then if $X\sim p$ then when we send our friend the messages about our observations, we will use more bits than we could have used had we chosen the protocol that is the best for $p$. More bits means more information? Well, yes and no. More bits means more information relative to one’s expectation that $X\sim q$! However, of course, more bits is not more information relative to knowing the true distribution of course. The code we use has some extra redundancy, in other words it has some excess, which we can say measures the extra information that $X$ contains, relative to our expectation on the distribution of $X$. Working out the math gives that the excess information is
\begin{align}\label{eq:discreterelentropy}
\KL(p,q) &\doteq \sum_i p_i \log \frac{1}{q_i} \, – \,\sum_i p_i \log \frac{1}{p_i} \\
&=\sum_i p_i \log \frac{p_i}{q_i}\,.
\end{align}
If this quantity is larger, we also expect to be able to tell that $p\ne q$ with fewer independent observations sharing $p$ than if this quantity was smaller. For example, when $p_i>0$ and $q_i=0$ for some $i$, the first time we see $i$, we can tell that $p\ne q$. In fact, in this case, $\KL(p,q)=+\infty$: With positive probability a single observation can tell the difference between $p$ and $q$.

Still poking around the definition, what happens when $q_i=0$ and $p_i=0$? This means that the symbol $i$ is superfluous and the value of $\KL(p,q)$ should not be impacted by introducing superfluous symbols. By our earlier convention that $x \log(1/x)=0$ when $x=0$, this works out fine based on the definition after $\doteq$. We also see that the sufficient and necessary condition for $\KL(p,q)<+\infty$ is that for each $i$ such that $q_i=0$, we also have that $p_i=0$. The condition we discovered is also expressed as saying that $p$ is absolutely continuous w.r.t. $q$, which is also written as $p\ll q$. More generally, for two measures $P,Q$ on a common measurable space $(\Omega,\cF)$, we say that $P$ is absolutely continuous with respect to $Q$ (and write $P\ll Q$) if for any $A\in \cF$, $Q(A)=0$ implies that $P(A)=0$ (intuitively, $\ll$ is like $\le$ except that it only constraints the values when the right-hand side is also zero). This brings us back to defining relative entropy between two arbitrary probability distributions $P,Q$ defined over a common probability space. The difficulty we face is that if $X\sim P$ takes on uncountably infinitely many values then we cannot really use the ideas that use communication because no matter what coding we use, we would need infinitely many symbols to describe some values of $X$. How can then be the entropy of $X$ be defined at all? This seems to be a truly fundamental difficulty! Luckily, this impasse gets resolved automatically if we only consider relative entropy. While we cannot communicate $X$, for any finite “discretization” of the the possible values that $X$ can take on, the discretized values can be communicated finitely and all our definitions will work. Formally, if $X$ takes values in the measurable space $(\cX,\cG)$, with $\cX$ possibly having uncountably many elements, a discretization to $[N]$ levels would be specified using some function $f:\cX \to [N]$ that is $\cG/2^{[N]}$-measurable map. Then, the entropy of $P$ relative $Q$, $\KL(P,Q)$ can be defined as
\begin{align*}
\KL(P,Q) \doteq \sup_{f} \KL( P_f, Q_f )\,,
\end{align*}
where $P_f$ is the distribution of $Y=f(X)$ when $X\sim P$ and $Q_f$ is the distribution of $Y=f(X)$ when $X\sim Q$ and the supremum is for all $N\in \N$ and all maps $f$ as defined above. In words, we take all possible discretizations $f$ (with no limit on the “fineness” of the discretization) and define $\KL(P,Q)$ as the excess information when expecting to see $f(X)$ with $X\sim Q$ while reality is $X\sim P$. If this is well-defined (i.e., finite), we expect this to be a reasonable definition. As it turns out and as we shall see it soon, this is indeed a reasonable definition.

To state this, we need the concept of densities, or, Radon-Nykodim derivatives, as they are called in measure theory. Given $P,Q$ as above, the density of $P$ with respect to $Q$ is defined as a $\cF/\cB$-measurable map $f:\Omega \to [0,\infty)$ such that for all $A\in \cF$,
\begin{align}\label{eq:densitydef}
\int_A f(\omega) dQ(\omega) = P(A)\,.
\end{align}
(Here, $\cB$ is the Borel $\sigma$-algebra restricted to $[0,\infty)$.) Recalling that $P(A) = \int_A dP(\omega)$, we would write the above as $f dQ = dP$ meaning that if we integrated this equality over any set $A$, we would get identity. Because of this shorthand notation, when the density of $P$ with respect to $Q$ exists, we also write it as
\begin{align*}
f=\frac{dP}{dQ}\,.
\end{align*}
Note that this equation defines the symbol $\frac{dP}{dQ}$ in terms of the function $f$ that satisfies $\eqref{eq:densitydef}$. In particular, $\frac{dP}{dQ}$ is a nonnegative-valued random variable on $(\Omega,\cF)$. When $\frac{dP}{dQ}$ exists then it follows immediately that $P$ is absolutely continuous with respect to $Q$ (just look at $\eqref{eq:densitydef}$). It turns out that this is both necessary and sufficient for the existence of the density of $P$ with respect to $Q$. This is stated as a theorem in a slightly more general form in that $P,Q$ are not restricted to be probability measures (the condition $P(\Omega)=Q(\Omega)=1$ is thus lifted for them). The definition of density for such measures is still given by $\eqref{eq:densitydef}$. To avoid some pathologies we need to assume that $Q$ is $\sigma$-finite, which means that while $Q(\Omega)=\infty$ may hold, there exists a countable covering $\{A_i\}$ of $\Omega$ with $\cF$-measurable sets such that $Q(A_i)<+\infty$ for each $i$.

Theorem (Radon-Nikodym Theorem):
Let $P,Q$ be measures on a common measurable space $(\Omega,\cF)$ and assume that $Q$ is $\sigma$-finite. Then, the density of $P$ with respect to $Q$, $\frac{dP}{dQ}$, exists if and only if $P$ is absolutely continuous with respect to $Q$. Furthermore, $\frac{dP}{dQ}$ is uniquely defined up to a $Q$-null set, i.e., for any $f_1,f_2$ satisfying $\eqref{eq:densitydef}$, $f_1=f_2$ holds $Q$-almost surely.

Densities work as expected: Recall that the central limit theorem states that under mild conditions, $\sqrt{n} \hat\mu_n$ of $n$ iid random variables $X_1,\dots,X_n$ with common zero mean and variance one converges to a distribution of a standard normal random variable $Z$. Without much thinking, we usually write that the density of this random variable is $g(x) = \frac{1}{\sqrt{2\pi}} \exp(-x^2/2)$. As it happens, this is indeed the density, but now we know that we need to be more precise: This is the density of $Z$ with respect the Lebesgue measure (which we shall often denote by $\lambda$). The densities of “classical” distributions are almost always defined with respect to the Lebesgue measure. A very useful result is the chain rule, which states that if $P\ll Q \ll S$, then $\frac{dP}{dQ} \frac{dQ}{dS} = \frac{dP}{dS}$. The proof of this result follows from the definition of the densities and the “usual machinery”, which means proving the result holds for simple functions and then applying the monotone convergence theorem to take the limit for any measurable function. The chain rule is often used to reduce the calculation of densities to calculation with known densities. Another useful piece of knowledge that we will need is that if $\nu$ is a counting measure on $([N],2^{[N]})$ then if $P$ is a distribution on $([N],2^{[N]})$ then $\frac{dP}{d\nu}(i) = P(\{i\})$ for all $i\in [N]$.

With this much preparation, we are ready to state the promised result:

Theorem (Relative Entropy) :
Let $(\Omega, \cF)$ be a measurable space and let $P$ and $Q$ be measures on this space.
Then,
\begin{align*}
\KL(P, Q) =\begin{cases}
\int \log \left(\frac{dP}{dQ}(\omega)\right) dP(\omega)\,, & \text{if } P \ll Q\,;\\
+\infty\,, & \text{otherwise}\,.
\end{cases}
\end{align*}

Note that by our earlier remark, this reduces to $\eqref{eq:discreterelentropy}$ for discrete measures. Also note that the integral is always well defined (but may diverge to positive infinity). If $\lambda$ is a common dominating $\sigma$-finite measure for $P$ and $Q$ (i.e., $P\ll \lambda$ and $Q\ll \lambda$ both hold) then letting $p = \frac{dP}{d\lambda}$ and $q = \frac{dQ}{d\lambda}$, if also $P\ll Q$, the chain rule gives $\frac{dP}{dQ} \frac{dQ}{d\lambda} = \frac{dP}{d\lambda}$, which lets us write
\begin{align*}
\KL(P,Q) = \int p \log( \frac{p}{q} ) d\lambda\,,
\end{align*}
which is perhaps the best known expression for relative entropy, which is also often used as a definition. Note that for probability measures, a common dominating $\sigma$-finite measure can always be bound. For example, $\lambda = P+Q$ always dominates both $P$ and $Q$.

Relative entropy is a kind of “distance” measure between distributions $P$ and $Q$. In particular, if $P = Q$, then $\KL(P, Q) = 0$ and otherwise $\KL(P, Q) > 0$. Strictly speaking, relative entropy is not a distance because it satisfies neither the triangle inequality nor is it symmetric. Nevertheless, it serves the same purpose.

The relative entropy between many standard distributions is often quite easy to compute. For example, the relative entropy between two Gaussians with means $\mu_1, \mu_2 \in \R$ and common variance $\sigma^2$ is
\begin{align*}
\KL(\mathcal N(\mu_1, \sigma^2), \mathcal N(\mu_2, \sigma^2)) = \frac{(\mu_1 – \mu_2)^2}{2\sigma^2}\,.
\end{align*}
The dependence on the difference in means and the variance is consistent with our intuition. If $\mu_1$ is close to $\mu_2$, then the “difference” between the distributions should be small, but if the variance is very small, then there is little overlap and the difference is large. The relative entropy between two Bernoulli distributions with means $p,q \in [0,1]$
\begin{align*}
\KL(\mathcal B(p), \mathcal B(q)) = p \log \left(\frac{p}{q}\right) + (1-p) \log\left(\frac{(1-p)}{(1-q)}\right)\,,
\end{align*}
where $0 \log (\cdot) = 0$. The divergence is infinite if $q \in \set{0,1}$ and $p \in (0,1)$ (because absolute continuity is violated).

To summarize, we have developed the minimax optimality concept and given a heuristic argument that no algorithm can improve on $O(\sqrt{Kn})$ regret in the worst case when the noise is Gaussian. The formal proof of this result will appear in the next post and relies on several of the core concepts from information theory, which we introduced here. Finally we also stated the definition and existence of the Radon-Nikodym derivative (or density), that unifies (and generalizes) the probability distribution (for discrete spaces as learned in high school) and a probability density.

Notes

Note 1: The worst-case regret has a natural game-theoretic interpretation. This is best described if you imagine a game between a protagonist and an antagonist that works as follows: Given $K>0$ and $n>1$, the protagonist proposes a bandit policy $A$, while the antagonist, after looking at the policy chosen, picks a bandit environment $E$ from the class of environments considered (in our case from the class $\cE_K$). After both players made their choices, the game unfold by the bandit policy deployed in the chosen environment. This leads to some value for the expected regret, which denote by $R_n(A,E)$. The payoff for the antagonist is then $R_n(A,E)$, while the payoff for the protagonist is $-R_n(A,E)$, i.e., the game is zero sum. Both players aim at maximizing their payoffs. The game is completely described by $n,K$ and $\cE$. One characteristic value in a game is its minimax value. As described above, this is a sequential game (the protagonist moves first, followed by the antagonist). The minimax value of this game, from the perspective of the antagonist, is then exactly $R_n^*(\cE)$, while it is $\sup_A \inf_E -R_n(A,E) = -R_n^*(\cE)$ from the perspective of the protagonist.

Note 2: By its definition, $R_n^*(\cE)$ is a worst-case measure. Should we worry about that an algorithm that optimizes for the worst-case may perform very poorly on specific instances? Intuitively, a bandit problem where the action gaps are quite large or quite small should be easier, the former because large gaps require fewer observations to detect, while smaller gaps can be just ignored. Yet, we can perhaps imagine policies that are worst-case optimal whose regret is $R_n^*(\cE)$ regardless of the nature of the instance that they are running on. When such an policy would be used on an “easy” instance as described above, we would certainly be disappointed to see it suffer a large regret. Ideally, what we would like is if in some sense the policy we end up using with would be optimal for every instance, or, in short, if it was instance-optimal. Instance-optimality in its vanilla form is a highly dubious idea: After all, the best regret on any given instance is always zero! How could then an algorithm achieve this for every instance of some nontrivial class $\cE$ (a class $\cE$ is trivial if it contains environments where it is always optimal to choose, say, action one)? This is clearly impossible. The problem comes from that the best policy for a given instance is a very dumb policy for almost all the other instances. We certainly do not wish to choose such a dumb policy! In other words, we want both good performance on individual instances while we also want good worst-case performance on the same horizon $n$. Or we may want to achieve both good performance on individual instances while we also demand good asymptotic behavior on any instance. In any case, the idea is to restrict the class $\cA_{n,K}$ by ruling out all the dumb policies. If $\cA_{n,K}^{\rm smart}$ is the resulting class, then the target is modified to achieve $R_n^*(E) \doteq \inf_{A\in \cA_{n,K}^{\rm smart}} R_n(A,E)$, up to, say, a constant factor. We will return to various notions of instance-optimality later and in particular will discuss whether for some selections of the class of smart policies, instance-optimal (or near-instance optimal) policies exist.

Note 3: The theorem that connects our definition of relative entropies to densities, i.e., the “classic definition”, can be found e.g., Section 5.2 of the information theory book of Gray.

13 thoughts on “Optimality concepts and information theory”

  1. Interesting and very useful! Especially how the relative entropy comes and its connections with other things. Possible small typo: We use more bits then (than) we could have used!

  2. Hi all,
    Hi stuck on the paragraph below,

    “Note that iE′ cannot be used more than ≈n/K times on expectation when A is run on E. … To make the two algorithms essentially indistinguishable, set Δ=1/sqrt(n/K).”

    I was wondering how to decide the value of Δ in order to have indistinguishable environments and its relation to the aforementioned inequality based on multiple integration by parts.

    Many thanks for any help

    1. Hi,
      The high level argument is as follows: The number of observations for this particular arm $i_{E’}$ is at most $n/K$. So the sample mean for this arm will have a standard deviation of $1/\sqrt{n/K}$. If some other arm’s mean is closer than the standard deviation then the two means cannot be told apart.
      The inequality in the previous paragraph is making this more exact.
      I hope this answers your question.
      Cheers,
      Csaba

  3. Thank you for your posting!
    However, I can’t understand that “Now, when A is run on E and i_E is chosen fewer than n(1−1/K) times on expectation then it will have more than Δn(1−1/K)≈c√K√n regret”. why more than?

    1. Good catch! We should have written $\Delta (n-n(1-1/K))$ (suboptimal actions are chosen at least $n-n(1-1/K)$ when the unique optimal action is chosen at most $n(1-1/K)$ times). I have corrected the text.

      1. Thanks to your response!
        I also have one more question about the information theory; You discussed the definition of relative entropy and in the second approach, you assumed that we expect X~q and said

        “But then if X~p then when we send our friend the messages about our observations, we will use more bits than we could have been used had we chosen the protocol that is the best for p”

        I wonder why we have more bits? It looks like that needed bits= bits for p + bits for q. Is it right?

        1. Say you have $N$ different messages you may want to send. For simplicity, denote the different messages by $1,\dots,N$. Let’s say that you decide to use $b_i\ge 0$ bits to encode message $i\in [N]$. The expected number of bits per message assuming that message $i$ is chosen with probability $q_i$ ($q_i\ge 0$, $\sum_i q_i=1$) is $L(q,b) = \sum_{i=1}^N q_i b_i$. The best possible choice for $b$ is $b_i^*(q) = \log(1/q_i)$: For any reasonable encoding scheme (prefix codes), the expected code length is at least $L(q,b^*(q))$: $L(q,b)\ge L(q,b^*(q))$ for $b\in B$ where $B$ is the set of code-lengths for “reasonable encodings”. This is in what sense $b^*(p)$ gives the best possible code-lengths. It follows that $L(p,b^*(q))\ge L(p,b^*(p))$, that is, using the optimal code derived from $q$ under $p$ gives largest code length than using the optimal code derived from $p$ under $p$. Does this make sense?

          1. Um… why the best possible choice for b is b∗i(q)=log(1/qi) ? I understand Huffman-coding prevents prefix-property but i don’t know the reason why it is the best possible choice..

          2. Ah, yes, we skipped this. The easiest and most commonly seen proof works as follows: First, show that the a set of lengths can be code-lengths of prefix-free coding scheme (or more generally, of a coding scheme that can be uniquely decoded) exactly when the set of lengths satisfy what is called the Kraft-McMillan inequality. Then one needs to consider the constrained optimization problem of minimizing the expected length subject to the constraint that the lengths (which are the free, optimization variables) satisfy the Kraft-McMillan. The solution happens to be as stated above. For details, see, e.g., this lecture.

  4. “Formally, if X takes values in the measurable space (X,G), with X possibly having uncountably many elements, a discretization to [N] levels would be specified using some function f:X→[N] that is F/2[N]-measurable map.”

    Should this say “$/mathcal{G}//2^{[N]}$-measurable map”?

Leave a Reply to Csaba Szepesvari Cancel reply

Your email address will not be published. Required fields are marked *