The Bayesian approach to learning starts by choosing a prior probability distribution over the unknown parameters of the world. Then, as the learner makes observation, the prior is updated using Bayes rule to form the posterior, which represents the new beliefs of the learner. The posterior is then used to make decisions. The Bayesian approach is appealing in its simplicity, but skeptics demanding worst-case guarantees might find Bayesian learning unjustified and subjective.

In this and the next few posts, we will show that even the skeptics should find something useful in the Bayesian approach. In this post we show that by treating the learning problem as a **zero-sum multi-stage game**, using Sion’s minimax theorem, the worst-case regret that an optimal Bayesian algorithm suffers is equal to the minimax regret. This is an instance of **minimax duality**: The previous claim can also be phrased as the minimax value and the maximin value (the worst-case Bayesian regret) are the same. At the end of the post we illustrate the power of minimax by duality by proving a simple and intuitive upper bound on the minimax regret for $k$-armed adversarial bandits that is stronger than all previous results. Other applications will come in future posts, along with generalizations of the minimax theorem.

The connection between worst-case regret and worst-case Bayesian regret has been recognized as early as in the 1950s in statistical settings, but the connection is perhaps less well known than it should be and there are many potential applications in bandits, online learning and beyond.

# Bayesian regret

In the Bayesian setting the adversary doesn’t choose a sequence of loss vectors, like in the oblivious adversarial bandit setting. Instead, they choose a **prior** distribution $\nu$ over sequences of loss vectors, or, equivalently, a distribution over loss matrices. In this way, $\nu$ is a distribution over $\cE = [0,1]^{n\times k}$. We assume that $\nu$ is restricted to a non-empty set $\cQ$ of distributions whose choice we postpone to later. For now, it should suffice to know that $\cQ$ is a convex set and contains at least the point masses (the Dirac measures). Intuitively, this latter constraint on $\cQ$ makes the Bayesian adversary at least as powerful as the adversary of the non-Bayesian setting. The convexity is reasonable if we want the adversary to have a rich set of distributions at its disposal.

The main difference to the non-Bayesian setting is that in the Bayesian setting, before learning starts, the learner is given the prior $\nu$. After this, a specific instance $Y \sim \nu$ is sampled secretly. Subsequently the learner and environment interact for $n$ rounds. In each round $t=1,\dots,n$, the learner first chooses its action $A_t \in [k]$ (which may be randomized), then observes its loss $Y_{t,A_t}$. The learner can follow any history-dependent, randomizing policy $\pi$, as usual. Importantly, because the prior $\nu$ is known in advance, the policy can depend on $\nu$.

We define the **Bayesian regret** of a policy $\pi$ with respect to prior $\nu$ as

\begin{align*}

BR_n(\pi,\nu) = \E\left[\sum_{t=1}^n Y_{tA_t} – Y_{tA^*}\right]\,,

\end{align*}

where $A^* = \argmin_{a \in [k]} \sum_{t=1}^n Y_{ta}$ is the optimal action and the expectation is with respect to the randomness both in $Y$ (which determine the random $A^*$ as well), as well as the actions of the learner. If $\nu$ is the Dirac measure supported at $y\in \cE$ (that is, $\Prob{Y=y}=1$), $BR_n(\pi,\nu)=R_n(\pi,y)$, where $R_n(\pi, y)$ is the usual regret:

\begin{align*}

R_n(\pi, y) = \max_{a \in [k]} \E\left[\sum_{t=1}^n y_{tA_t} – y_{ta}\right]\,,

\end{align*}

where now the expectation is only over the randomness of the actions. Now is a good time to remind you that the minimax regret for $k$-armed adversarial bandits is

\begin{align*}

R^*_n = \inf_{\pi \in \Pi} \sup_{y \in \cE} R_n(\pi, y)\,,

\end{align*}

where $\Pi$ is the space of all policies. This means that you choose your policy and *then* the adversary chooses the bandit. The **worst-case Bayesian regret** over $\cQ$ is

\begin{align*}

BR^*_n(\cQ) = \sup_{\nu \in \cQ} \inf_{\pi \in \Pi} BR_n(\pi, \nu)\,.

\end{align*}

Here the order of the sup and inf have been exchanged compared to the order used in the definition of $R_n^*$. This reflects the fact that the adversary chooses a prior first *and then* the learner chooses a policy, giving the learner the chance to adjust its policy based on the prior. The main benefit of this whole approach is that coming up with good prior-dependent policies is often easier than pulling a near minimax optimal policy out of a hat. Yet the next result shows that as long as we only care about the magnitude of the minimax regret, there is no loss in generality in considering this easier problem.

# The minimax result

The main result of this post is the following theorem:

Theorem (Minimax theorem for adversarial bandits):

For an appropriate choice of $\cQ$,

$BR^*_n(\cQ) = R^*_n$

The proof of this result follows from a powerful generalization of von Neumann’s minimax theorem.

Theorem (Sion’s minimax theorem):

Let $X$ and $Y$ be convex subsets of topological vectors spaces and $f : X \times Y \to \R$ be quasiconvex/quasiconcave and upper/lower semicontinuous, respectively in each argument. Suppose that either $X$ or $Y$ is compact. Then,

\begin{align*}

\inf_{x \in X} \sup_{y \in Y} f(x, y) = \sup_{y \in Y} \inf_{x \in X} f(x, y)\,.

\end{align*}

To explain the terminology used in the theorem a bit, a topological vector space is a vector space that is equipped with a topology under which the vector space operations are continuous. Euclidean spaces with the standard topology when open sets are generated by open rectangles are of course topological vector spaces. The topology is needed so that we can talk about continuity. That vector space operations should be continuous in the topology allows usual theorems concerning continuity to be used. A function $f:X \to \R$ is lower semi-continuous at $x_0\in X$ if $\liminf_{x\to x_0}f(x) \ge f(x_0)$. Upper semicontinuity is defined analogously. Finally, $f:X\to \R$ is quasiconvex, if its level sets, $X_L = \{x\in X\,:\, f(x)\le L\}$ are all convex for any $L\in \R$. Again, quasiconcavity is defined analogously. The reader is encouraged to construct counterexamples to the conclusion of the theorem when any of the conditions of the theorem is dropped. Sion’s theorem has quite an easy (if a little magical) topological proof, which we won’t give (see the bibliographic remarks for a reference).

Von Neumann’s theorem is recovered from Sion’s theorem when $X$ and $Y$ are the probability simplex and $f(x, y) = x^\top G y$ for some matrix $G$. Note that in this case both $X$ and $Y$ are compact, when they are viewed as subsets of the respective Euclidean spaces equipped with their usual topology. Further, $f$ is linear both in $x$, and in $y$ (i.e., it is “bilinear”), hence it is continuous, and both convex and concave (hence also quasiconvex and quasiconcave) in both of its arguments.

We won’t prove our minimax theorem in full generality either as the proof is surprisingly technical. Instead, we present the proof for a special case that allows us to show the main non-technical ideas of the proof. Some notes on the difficulties faced and how to address them are presented at the end.

**Proof in a special case:** We consider the case when all losses are binary: $\cE = \{0,1\}^n$. In this case, von Neumann’s theorem is sufficient. The way this avoids the technical difficulties is that the set of deterministic policies is now finite. This also allows us to choose $\cQ$ to be the set of all probability distributions on $\cE$, which is really just the probability simplex of dimension $|\cE|-1$, which in our earlier notation is $\cQ = \cP_{|\cE|-1}$.

The key observation of the proof is that any policy can be written as a mixture of deterministic policies. Let $\PiD$ be the space of deterministic policies and $\pi$ be any policy. As noted above, $\PiD$ is finite.

We claim there exists a distribution $\mu$ on $\PiD$ such that for any $y$,

\begin{align*}

R_n(\pi, y) = \sum_{\pi’ \in \PiD} \mu(\pi’) R_n(\pi’, y)\,.

\end{align*}

In fact, more is true: there is a one-to-one correspondence between (all) policies and distributions over $\PiD$ such that for any $y$, the distribution induced over the set $([k]\times \{0,1\})^n$ of observation histories by following either a policy, or a distribution over $\PiD$, are the same. To see this, take first a policy $\pi$. Then, notice that one can implement $\pi$ by choosing a random variable $U$ uniformly distributed on $[0,1]$ and then choosing the action in round $t$ as a deterministic function $f_t$ of the past actions and observations and $U$ (why?). That is every $\pi$ can be transformed into a sequence of functions $(f_t)_t$ such that $f_t: ([k]\times \{0,1\})^{t-1}\times [0,1] \to [k]$. Note that $(f_1(\cdot,u),\dots,f_n(\cdot,u)$ for any $u\in [0,1]$ is a deterministic policy, that is an element of $\PiD$. It follows that $\mu(\pi’)=\Prob{\,(f_1(\cdot,U),\dots,f_n(\cdot,U))=\pi’\,}$ is suitable. The reverse relationship follows from showing that given a distribution $\mu$ over $\PiD$, the conditional distribution of taken an action in some round $t$ given the previous actions $a_1,\dots,a_{t-1}$ and $y\in \cE$ depends only on the previous actions and the previous observations $y_{1,a_1}, \dots, y_{t-1,a_{t-1}}$.

By this correspondence, from now on we treat distributions on $\PiD$ as policies. With this view, the Bayesian regret for the policy $\pi$ associated with distribution $\mu$ on $\PiD$ is

\begin{align*}

BR_n(\mu, \nu) = \sum_{\pi \in \PiD} \mu(\pi) \sum_{y \in \cE} \nu(y) R_n(\pi, y)\,.

\end{align*}

Note that we have abused notation by letting the Bayesian regret take $\mu$ as an argument. By the above equation, the Bayesian regret is a bilinear function when viewed as a function from distributions over deterministic policies and distributions over loss sequences to the regret. Letting $\cP$ be the space of all distributions over $\PiD$, it holds that

\begin{align*}

BR_n^* = \sup_{\nu \in \cQ} \inf_{\mu \in \cP} BR_n(\mu, \nu)\,.

\end{align*}

On the other hand, the minimax regret is

\begin{align}

R^*_n = \inf_{\mu \in \cP} \sup_{y \in \cE} R_n(\mu, y) = \inf_{\mu \in \cP} \sup_{\nu \in \cQ} BR_n(\mu, \nu)\,,

\label{eq:dualminmaxregret}

\end{align}

where the equality follows because the prior that maximizes the Bayesian regret for a fixed policy is clearly a Dirac on the worst-case bandit for that policy (a linear function is maximized in the extreme points, which, in this case, are the Dirac measures). Then by von Neumann’s theorem,

\begin{align*}

BR_n^* = \sup_{\nu \in \cQ} \inf_{\mu \in \cP} BR_n(\mu, \nu) = \inf_{\mu \in \cP} \sup_{\nu \in \cQ} BR_n(\mu, \nu) = R^*_n\,.

\end{align*}

**QED**

The result continues to hold when $\cE = [0,1]^{n\times k}$ by using Sion’s theorem, but the proof becomes much more technical because the space of all deterministic policies is no longer finite-dimensional and choosing appropriate topologies for which continuity/compactness are both guaranteed is not so straightforward. Another difficulty is that we want to avoid measurability restrictions on policies, because these would not be natural as $R_n^*$ is perfectly well-defined even if no measurability assumptions are put on the policies the learner can use (why?). The trick that avoids being blogged down by measurability issues is to choose $\cQ$ to be the set of finitely supported probability measures on $\cE$. We discuss this and the remaining issues in the notes at the end of the post.

The title of this post mentions **minimax duality**. This terminology comes from optimization theory where it is used in connection to **saddle-point** problems. To give justice to this vast topic would require multiple posts, but just briefly, a saddle-point problem is given by a two argument function, $f: X \times Y \to \R$ where $X,Y$ are non-empty sets. Saddle-point problems ask for $(x^*,y^*)\in X\times Y$ (a “saddle”) such that

\begin{align*}

\sup_{y\in Y} f(x^*,y)=\inf_{x\in X}\sup_{y\in Y} f(x,y) \qquad \text{and} \qquad

\inf_{x\in X} f(x,y^*)=\sup_{y\in Y}\inf_{x\in X} f(x,y)\,.

\end{align*}

Note that when a saddle exist, it follows that $\inf_{x\in X}\sup_{y\in Y} f(x,y) = \sup_{y\in Y} \inf_{x\in X} f(x,y)$ holds (why?). Here, $x$, the argument that we minimize over, is often called the primal variable, while $y$, the argument that we maximize over, is called the dual variable. It is easy to see that $\sup_{y\in Y} \inf_{x\in X} f(x,y)\le \inf_{x\in X}\sup_{y\in Y} f(x,y)$ holds regardless of the choice of $f$. This relationship is called **weak duality**. **Strong**, or **minimax duality** holds when the reverse relationship also holds.

# Bounding the Bayesian regret

In this section we develop an upper bound on the Bayesian regret and then use the minimax theorem to conclude the *existence* of a policy with the same minimax regret.

For the rest of this post we will be in the Bayesian setting. Thus, $Y \in \{0,1\}^{n\times k}$ is sampled from a given prior $\nu$, which is a distribution over $\{0,1\}^{n\times k}$. We let $\Delta_t = Y_{tA_t} – Y_{tA^*}$ where $A^* = \argmin_{a \in [k]} \sum_{t=1}^n Y_{ta}$ is the optimal arm in hindsight. We also let $\cF_t = \sigma(A_1,Y_{1A_1},\ldots,A_t,Y_{tA_t})$ be the $\sigma$-algebra generated by the observed history after $t$ rounds. Then let $\E_t[\cdot] = \E[\cdot \mid \cF_t]$ and $\bbP_t(\cdot) = \bbP(\cdot \mid \cF_t)$. With this notation the Bayesian regret is

\begin{align*}

BR_n = \E\left[\sum_{t=1}^n Y_{tA_t} – Y_{tA^*}\right] = \E\left[\sum_{t=1}^n \E_{t-1}[\Delta_t]\right]\,.

\end{align*}

Of course this depends on the policy and the prior, which have been dropped from the notation to minimize clutter.

The following straightforward theorem is the key to bounding the regret. Remember that for convex function $F : \R^d \to \R\cup\{\infty\}$ with domain $\cD=\{x\in \R^d\,: F(x)<\infty\}$, the Bregman divergence of $F$, $D_F : \cD \times \cD \to \R\cup\{\infty\}$, for $x,y\in \cD$ is defined by

\begin{align*}

D_F(x, y) = F(x) – F(y) – \nabla F_{x-y}(y)\,,

\end{align*}

where $\nabla F_v(y)$ is the directional derivative of $F$ at $y$ in direction $v$. Note that since $F$ is convex, $\nabla F_{x-y}(y)$ always exists, though it is possibly equal to $-\infty$ when $y$ is an element of the boundary of $\cD$. Recall also that the Bregman divergence is always nonnegative. The diameter of a convex set $\cK$ with respect to $F$ is $\diam_F(\cK) = \sup_{x,y \in \cK} F(x) – F(y)$.

Theorem:

Let $(M_t)_{t=0}^n$ be an $\cF_t$-adapted martingale with $M_t \in \cD \subset \R^d$ almost surely and $F : \R^d \to \R$ be a convex function with $\diam_F(\cD) < \infty$. Suppose there exist constants $\alpha, \beta \geq 0$ such

\begin{align*}

\E_{t-1}[\Delta_t] \leq \alpha + \sqrt{\beta \E_{t-1}[D_F(M_t, M_{t-1})]}

\end{align*}

almost surely for all $t$. Then the Bayesian regret is bounded by

\begin{align*}

BR_n \leq n \alpha + \sqrt{\beta \diam_F(\cD)}\,.

\end{align*}

The proof is not very hard. Perhaps less clear is that the assumptions are often satisfied for sensible algorithms. We explain after the proof.

**Proof**

Using the definition of the Bregman divergence,

\begin{align*}

\E_{t-1}[D_F(M_t, M_{t-1})]

&= \E_{t-1}[F(M_t) – F(M_{t-1}) – \nabla F_{M_t-M_{t-1}}(M_{t-1})] \\

&= \E_{t-1}\left[\lim_{h\to 0+} \left( F(M_t) – F(M_{t-1}) – \frac{F((1-h)M_{t-1} + h M_t) – F(M_{t-1})}{h}\right)\right] \\

&\leq \liminf_{h\to 0+} \left(\E_{t-1}[F(M_t)] – F(M_{t-1}) – \frac{1}{h} \E_{t-1}\left[F((1-h)M_{t-1} + h M_t) – F(M_{t-1})\right]\right) \\

&\leq \E_{t-1}[F(M_t)] – F(M_{t-1}) – \limsup_{h\to 0} \frac{F(\E_{t-1}[(1-h)M_{t-1} + h M_t]) – F(M_{t-1})}{h} \\

&= \E_{t-1}[F(M_t)] – F(M_{t-1})\,,

\end{align*}

where the first inequality follows from Fatou’s lemma which can be applied because $F$ is convex, and the second inequality uses Jensen’s inequality and the convexity of $F$ again. The final equality holds because $\E_{t-1}[M_t] = M_{t-1}$. To finish the proof,

\begin{align*}

BR_n

&= \E\left[\sum_{t=1}^n \E_{t-1}[\Delta_t]\right] \\

&\leq \alpha n + \E\left[\sum_{t=1}^n \sqrt{\beta \E_{t-1}[D_F(M_t, M_{t-1})]}\right] \\

&\leq \alpha n + \sqrt{\beta n \E\left[\sum_{t=1}^n \E_{t-1}[D_F(M_t, M_{t-1})]\right]} \\

&\leq \alpha n + \sqrt{\beta n \E\left[\sum_{t=1}^n F(M_t) – F(M_{t-1})\right]} \\

&\leq \alpha n + \sqrt{\beta n \diam_F(\cD)}\,,

\end{align*}

where the first inequality follows from the assumptions, the second from Jensen’s inequality, the third by the calculation at the start of the proof. Finally telescope the sum and use the definition of the diameter.

**QED**

Although the result holds for any martingale, there is at least one natural choice, which is $M_t = P_t^* \in \cP_{k-1}$ where $P_{ta}^* = \bbP(A^* = a \mid \cF_t)$ is the posterior probability that $A^* = a$ based on the observations after round $t$. Using $M_t = P_t^*$ allows for a sensible interpretation of the conditions of the theorem. If you can control the expected regret in terms of the expected change of the posterior, then you will have a regret bound. This corresponds to a ‘learn or suffer regret’ tradeoff: If the regret is large, the posterior is changing a lot (as measured by the Bregman divergence), which means the learner has just learned a lot about $A^*$. The next theorem illustrates an application of these tools.

Theorem:

For $k$-armed adversarial Bernoulli bandits the minimax regret satisfies $R^*_n \leq \sqrt{2nk}$.

**Proof:**

We give the proof for the special case when the rewards are binary, that is $Y \in \{0,1\}^{n\times k}$ as this simplifies the proof a bit. In the notes we comment on the extension of the proof to the general case. Let $F : \cP_{k-1} \to \R$ be the convex function $F(p) = -2\sum_{a=1}^k \sqrt{p_a}$. An easy calculation shows that the Bregman divergence with respect to $F$ is

\begin{align*}

D_F(p, q) = \sum_{a : p_a \neq q_a} \frac{(\sqrt{p_a} – \sqrt{q_a})^2}{\sqrt{q_a}}\,.

\end{align*}

Consider the algorithm that samples $A_t \sim P_{t-1}^*$: By this we also mean that $A_t$ is independent of $Y_t,\dots,Y_n$ given $\cF_{t-1}$. The algorithm that samples actions this way is usually called **Thompson sampling** and was briefly mentioned at the end of our last post. Then

\begin{align*}

\E_{t-1}[\Delta_t]

&= \E_{t-1}[Y_{tA_t} – Y_{tA^*}] \\

&= \sum_{a : P_{t-1,a}^* > 0} P_{t-1,a}^* \left(\E_{t-1}[Y_{ta}] – \E_{t-1}[Y_{ta} \mid A^* = a]\right) \\

&\leq \sum_{a : P_{t-1,a}^* > 0} P_{t-1,a}^* \sqrt{\sum_{x \in \{0,1\}} \left(\sqrt{\bbP_{t-1}(Y_{ta} = x)} – \sqrt{\bbP_{t-1}(Y_{ta} = x \mid A^* = a)}\right)^2} \\

&= \sum_{a : P_{t-1,a}^* > 0} P_{t-1,a}^* \sqrt{\sum_{x \in \{0,1\}} \left(\sqrt{\bbP_{t-1}(Y_{ta} = x)} – \sqrt{\frac{\bbP_{t-1}(A^* = a \mid Y_{ta} = x) \bbP_{t-1}(Y_{ta} = x)}{\bbP_{t-1}(A^* = a)}}\right)^2} \\

&= \sum_{a : P_{t-1,a}^* > 0} P_{t-1,a}^* \sqrt{\sum_{x \in \{0,1\}} \frac{\bbP_{t-1}(Y_{ta} = x)}{\bbP_{t-1}(A^* = a)}\left(\sqrt{\bbP_{t-1}(A^* = a)} – \sqrt{\bbP_{t-1}(A^* = a \mid Y_{ta} = x)}\right)^2} \\

&\leq \sqrt{k^{1/2} \sum_{a : P_{t-1,a}^* > 0} P_{t-1,a}^* \sum_{x \in \{0,1\}} \bbP_{t-1}(Y_{ta} = x)\frac{\left(\sqrt{\bbP_{t-1}(A^* = a)} – \sqrt{\bbP_{t-1}(A^* = a \mid Y_{ta} = x)}\right)^2}{\sqrt{\bbP_{t-1}(A^* = a)}}} \\

&\leq \sqrt{k^{1/2} \sum_{a : P_{t-1,a}^* > 0} P_{t-1,a}^* \sum_{x \in \{0,1\}} \bbP_{t-1}(Y_{ta} = x) \sum_{

\substack{ c \,:\, \bbP_{t-1}(A^* = c) \ne\\

\bbP_{t-1}(A^* = c \mid Y_{ta} = x)}

}\frac{\left(\sqrt{\bbP_{t-1}(A^* = c)} – \sqrt{\bbP_{t-1}(A^* = c \mid Y_{ta} = x)}\right)^2}{\sqrt{\bbP_{t-1}(A^* = c)}}} \\

&= \sqrt{k^{1/2} \E_{t-1}[D_F(P_t^*, P_{t-1}^*)]}\,.

\end{align*}

Here, the first inequality is by Le Cam’s inequality that states that the total variation distance is upper bounded by the Hellinger distance (see the notes). The second inequality follows from two applications of Cauchy-Schwarz inequality, which for $p_i \in \cP_{k-1}$ and arbitrary $a \in \R^k$ shows that

\begin{align*}

\sum p_i a_i = \sum p_i^{1/4} p_i^{3/4} a_i

\leq \sqrt{\sum p_i^{1/2} \sum p_i^{3/2} a_i^2}

\leq \sqrt{k^{1/2} \sum p_i^{3/2} a_i^2}\,.

\end{align*}

We also used the fact that $P_{t-1,a}^* = \bbP_{t-1}(A^* = a)$ by definition and that $A_t$ and $Y_t$ are independent of each other under $\bbP_{t-1}$. A straightforward calculation shows that $\diam_F(\cP_{k-1}) \leq 2\sqrt{k}$. Then, by the previous theorem we have

\begin{align*}

BR_n \leq \sqrt{nk^{1/2} \diam_F(\cP_{k-1})} \leq \sqrt{2nk}\,.

\end{align*}

Since this holds for any prior, the minimax theorem allows us to conclude that $R^*_n = BR^*_n \leq \sqrt{2nk}$ as claimed.

**QED**

# Connection to game theory

In this section we explore the connections between bandits and game theory. This is probably boring for anyone already familiar with the language of game theory. However, we hope that the rest of our readers may find the ideas enlightening.

A big part of game theory is concerned with the so-called two-player zero-sum games. Our setting can be viewed as such a game by treating the adversary as one of the players and the learner as the other.

In the language of game theory, adversarial bandits is a **deterministic, two-player, zero-sum, multistage (or sequential), imperfect information game** where one of the players is the learner, the other player is the environment. The environment has only one chance to make a move (move is action in game theory lingo) and this happens at the beginning when it chooses the outcome $y\in \cE$. After this, the learner makes $n$ “moves” $a_1,\dots,a_n\in [k]$ one by one while it gradually learns about $y$ as dictated by the rules of bandit problems: That is in round $t$, after making the move $a_t$, the learner observes $y_{t,a_t}$. This makes the problem an instance of **imperfect information** **multistage** (or sequential) games (perfect information would mean that each player gets to know all details of the opponent’s last move before it makes its own move). The environment receives the **payoff** $R_n(a,y) = \sum_{t=1}^n y_{t,a_t}-\min_{a\in [k]} \sum_{t=1}^n y_{t,a}$, while the learner receives the payoff $-R_n(a,y)$ (we have overloaded $R_n$ again). An equivalent way of saying this is to say that environment is the **maximizing player**, while the learner is the **minimizing player** and we would say then that $R_n(a,y)$ is the payoff of the maximizing player, and is the **loss** of the minimizing player. The game is zero sum, because the sum of the payoffs of the two players is zero for each possible combinations of the players’ moves. Because the game is zero-sum, what is a win for one of the players is a loss for the others — games of this type describe an inherently competitive scenario.

Finally, the game is **deterministic**, because the outcome is a fixed function of the choices of the players. Of course, both players are allowed to randomize, in which case we take as payoff the expected payoff. In particular, when the environment uses a distribution $\nu$ over $\cE$ and the player uses a policy $\pi$ to randomize the actions it takes, the expected payoff of the environment is exactly the Bayesian regret $BR_n(\pi,\nu)$.

In game theory lingo, the above is the **extensive form** description of the game by describing who does what and when and what do they know, in addition to describing the payoff at the end. The **normal form** of the game would be obtained by “listing” all **pure strategies** for both players and tabulating the payoff for each pair of such pure strategies. Here, a pure strategy is the same as a deterministic policy. For the environment, a pure strategy is just choosing a single outcome $y\in \cE$, but for the player this means choosing a deterministic policy $\pi\in \PiD$. To add to the complete language mix-up, what we called a policy is called a **behavior strategy** in game theory. The policies (or behavior strategies) for the environment are of course trivial (distributions over outcomes), as we restricted the environment to make a single choice at the beginning of the game.

In our game, both players have **perfect recall**: This just means that their respective strategies can depend on the full history of observations. (The opposite of perfect recall is imperfect recall which models players that are forgetful). Our game, when $\cE = \{0,1\}^{n\times k}$ is finite because the number of stages and the number of actions of both players are finite. However, this is not the case when $\cE = [0,1]^{n\times k}$.

The central question in two-player, zero-sum games is to figure out the player’s optimal strategies. The minimizing player’s optimal strategy is defined as the strategy that achieves the smallest worst-case loss. In our case, this is $\overline{BR}_n =\inf_{\pi}\sup_{\nu} BR_n(\pi,\nu)$. The value, $\overline{BR}_n$, is called the **minimax value** of the game. Any strategy $\pi^*$ that achieves this is a **minimax strategy**. Analogously, the maximizing player’s optimal strategy is defined as the strategy that achieves the largest worst-case payoff. In our case, this is

$\underline{BR}_n= \sup_{\nu} \inf_{\pi} BR_n(\pi,\nu)$.

Note that earlier we denoted this value by $BR_n^*$ and we called it the worst-case Bayesian regret. Any strategy that achieves this value is a **maximin strategy**. It follows that a maximin strategy is necessarily a strategy of the maximizing player, a minimax strategy is necessarily a strategy of the minimizing player.

It is not hard to see that the minimax value is at least as large as the maximin value; in our case $\underline{BR}_n \le \overline{BR}_n$ . For this reason, the minimax value is also sometimes called the upper value, while the maximin value is called the lower value. When the minimax and maximin values coincide, we say that the **game has a value**. When the minimax and maximin strategies both exist, we say that the game is **strictly determined**. Connecting to saddle-point problems, we could also say that two-player zero-sum games define a saddle-point problem where the function $f$ is just the payoff function of the maximizing player. It also follows that the pair composed of the minimax and maximin strategies is a saddle.

The key step of the proof of our minimax theorem was the identification of policies with mixtures of deterministic policies. In game theory, this result translates to a celebrated theorem of Kuhn, which, in the language of game theory, states that in a finite, perfect recall game, the behavior strategies of a player can be identified with the set of mixtures of pure strategies of that player. This result is perfectly well suited to the case we considered in this post when the losses are binary valued, but is not suitable when they are continuous-valued because in this case the game is infinite. However, the result continues to hold when we restrict the distributions available to the environment to those that are finitely supported (see Note 3 for more information on how this restriction helps). This restriction seems to be nonstandard in game theory, at least we have not seen this type of restriction on the behavioral strategies. As noted above, this restriction makes perfect sense in our case and should perhaps be considered more frequently in game theory, too.

The outcome of all these terminology is that we can very succinctly summarize our proof presented for the case of binary losses: We can simply say that our minimax theorem is an immediate consequence of Kuhn’s theorem, von Neumann’s minimax theorem and $\eqref{eq:dualminmaxregret}$.

But how does our result relate to existing minimax theorems? After all, one suspects that game theorists must have looked at zero-sum, two-player, perfect recall, imperfect information multistage deterministic games (what a moutful!). Indeed, searching the literature, we can find various minimax theorems that cover similar cases. Although there is a rich and sophisticated literature on this topic, we are not aware of any result implying the above theorem (in the case when the losses can be continuous valued). A short sample of relevant related works are mentioned in Note 11.

# Notes and bibliographic remarks

Note 1: Von Neumann’s minimax theorem is, of course, due to von Neumann. Sion’s minimax theorem is by Sion. A clean topological proof is by Komiya.

Note 2: The application of minimax duality to online learning seems to begin with Abernethy et al who looked at online convex optimization (the so-called “full information setting”). Bubeck et al. applied the idea to prove the existence of algorithms with $\tilde O(\sqrt{n})$ regret for convex bandits in $1$-dimension. The next note gives hints on difficulties that one faces in works like these when action and loss spaces are continuous. Some details seem to be missing in these works in relation to this, but the ideas of this post (mainly using finitely supported distributions) can help to fix these.

Note 3: When $\cE = \{0,1\}^{n \times k}$, then a deterministic policy can be represented as a function $\pi : \cup_{t=1}^n \{0,1\}^{t-1} \to [k]$. Clearly the set of all such functions is finite. When $\cE = [0,1]^{n\times k}$, however, then a deterministic policy is a function $\pi : \cup_{t=1}^n [0,1]^{t-1} \to [k]$. Obviously there are infinitely many such policies. Let $\PiD$ be the space of all such deterministic policies. Because $[k]$ is compact, $\PiD$ is compact with respect to the product topology by Tychonoff’s theorem. Note that $\PiD$ is still ‘very big’ for a compact set. In particular, it is neither sequentially compact, nor is it metrizable. In order for $\PiD$ to be compact it must have `relatively few’ open sets, which makes proving continuity more challenging. For our problem the continuity of the regret eventually works out, but not without a little annoyance. In order to apply Sion’s theorem to this setting we need to be careful about how to define probability measures over deterministic policies. The right approach, as it turns out, is to consider the space of regular probability measures on $(\PiD, \borel(\PiD))$, which is compact with respect to the weak* topology by the Riesz representation theorem and the Banach Alaoglu theorem. This result can be found in the second volume of Bogachev’s books on measure theory (Theorem 8.9.3). All the details are in our recent paper.

Note 4: Another difficulty, briefly mentioned beforehand, is that in the adversarial setting it is unnatural to restrict the set of policies to be probability kernels. After all, in this case, we never have to integrate over the loss space. The remedy to deal with this is to choose $\cQ$ to be set of finitely supported probability measures. After all, the requirements on $Q$ are that it should be (of course) convex (so that Sion’s theorem can be applied) and that it should contain the Dirac’s. The set of finitely supported probability measures is the smallest set that satisfies these two requirements. With this choice, we avoid measurability issues, because all expectations over $\nu$ can be written as finite sums.

Note 3: The theorem bounding the Bayesian regret in terms of the Bregman divergences is by us, building on the work of Russo and Van Roy, who specified the result to the case where the potential function is the negentropy. In this case the expected Bregman divergence is the **mutual information** or **information gain**. Using this potential for bandits leads to a bound of $O(\sqrt{nk \log(k) / 2})$. The main difference is that Le Cam’s inequality is replaced by Pinsker’s inequality.

Note 4: These tools are quite flexible. In the post we proved an optimal bound for $k$-armed adversarial bandits, which also has the best known constants. We mentioned that they can be used for convex bandits, which was groundbreaking at the time (the results in that topic have since been improved). The analysis is known to work for linear bandits, as well as partial monitoring. We plan to cover some of these results in future posts.

Note 5: Le Cam’s inequality bounds the total variation distance by the Hellinger distance. Suppose $\nu$ and $\mu$ are probability measures on common measurable space and $\nu \ll \mu$. Then the Hellinger distance is

\begin{align*}

h(\nu, \mu) = \sqrt{\int \left(1 – \sqrt{\frac{d\nu}{d\mu}}\right)^2 d\mu}\,.

\end{align*}

The total variation distance is $\norm{\nu – \mu}_{\textrm{TV}} = \sup_{A} \nu(A) – \mu(A)$, where the supremum is over all measurable sets. When $X$ is a $[0,1]$-valued random variable, then

\begin{align*}

\int X d\nu – \int X d\mu \leq \norm{\nu – \mu}_{\textrm{TV}} \leq h(\nu, \mu)\,,

\end{align*}

which is the inequality we used in our analysis. You can find a proof in Chapter 2 of Tsybakov’s book.

Note 7: The Bregman divergence is usually defined by $D_F(x, y) = F(x) – F(y) – \ip{\nabla F(y), x – y}$. The advantage of using the directional derivative is that there is no need to assume differentiability of $F$. Directional derivatives are always well defined for convex functions (though they can blow up at the boundary of the domain).

Note 8: Using $F(p) = -2 \sum_{a=1}^k \sqrt{p_a}$ as a regularizer for online mirror descent leads to a bound of $\sqrt{8nk}$ for $k$-armed adversarial bandits. Is it a ‘coincidence’ that the same potential works here? We don’t know!

Note 9: We assumed the losses were Bernoulli in our analysis. Nothing goes wrong for losses in $[0,1]$. The analysis becomes slightly more technical because finite sums must be replaced with integrals and you will need to use Bayes law with Radon-Nikodym derivatives.

Note 10: You can show that an adversary is not made more powerful by playing in $[0,1]^{n\times k}$ rather than $\{0,1\}^{n\times k}$, at least as far as the expected regret is concerned. To see this, suppose you have a policy such that $R_n \leq B$ for any bandit in $\{0,1\}^{n\times k}$. To apply this policy to any bandit simply observe the loss $y_{tA_t}$ and feed the algorithm $\tilde y_{tA_t}$ sampled from a Bernoulli distribution with mean $y_{tA_t}$. This does not change the expected regret and the resulting meta-policy enjoys $R_n \leq B$ for any bandit in $[0,1]^{n\times k}$.

Note 11: As mentioned before, over the years, game theorists worked out a number of different approaches to derive minimax results for the multi-stage zero-sum two-player case. Among others, P. Bernhard (1992, “Information and strategies in dynamic games”, SIAM J. on C. and Opt., 30:1, 212-228) worked with the weak topology, Leao et al. (2000, “Nonstationary zero sum stochastic games with incomplete observation”, CDC, 2278-2283) worked with the so-called weak-strong topology, while Ghosh et al.(2004, “Zero-sum stochastic games with partial information”, J. of Opt. Th. and Appls., 121:1,

99-118) used reduction to completely observable games and then dynamic programming.

Note 12: The analysis of Thompson sampling provides a bound on the Bayesian regret of this algorithm for any prior, which by the minimax theorem implies the *existence* of a policy for which the adversarial regret is at most $\sqrt{nk}$. But we do not claim that Thompson sampling is this policy. This is obviously not the most desirable outcome. At present we do not have a constructive version of this approach, though it’s worth remembering that the Bayesian regret is interesting in its own right. One direction towards finding minimax optimal policies is to investigate **equalizer** policies. Suppose $\nu$ is a prior maximizing the Bayesian optimal regret and $\pi$ is a policy for which $R_n(\pi, y)$ is constant on the support of $\nu$. Then $\pi$ must be a minimax optimal policy. The extent to which this observation is useful remains to be seen. A recent paper (which we have not yet read in detail) on one-armed bandits has some promising empirical results in a simple setting.

Consider we are learning an M*N two players zero-sum game. So one may expect we need steps proportional to M*N to learn the Nash equilibrium. But using the above-mentioned algorithms we actually only need steps proportional to M+N. Is there any intuitive explanation for this improvement? Are we implicitly doing something clever?