- Bandits: A new beginning
- Finite-armed stochastic bandits: Warming up
- First steps: Explore-then-Commit
- The Upper Confidence Bound (UCB) Algorithm
- Optimality concepts and information theory
- More information theory and minimax lower bounds
- Instance dependent lower bounds
- Adversarial bandits
- High probability lower bounds
- Contextual bandits, prediction with expert advice and Exp4
- Stochastic Linear Bandits and UCB
- Ellipsoidal confidence sets for least-squares estimators
- Sparse linear bandits
- Lower bounds for stochastic linear bandits
- Adversarial linear bandits
- Adversarial linear bandits and the curious case of linear bandits on the unit ball

## Adversarial linear bandits and the curious case of the unit ball

According to the main result of the previous post, given any finite action set $\cA$ with $K$ actions $a_1,\dots,a_K\in \R^d$, no matter how an adversary selects the loss vectors $y_1,\dots,y_n\in \R^d$, as long as the action losses $\ip{a_k,y_t}$ are in a known interval, the exponential weights algorithm with linear loss estimation and an exploration strategy that mixes the distribution of the exponential weights algorithm with a fixed exploration strategy chosen to maximize information given $\cA$, enjoys a regret of at most $O(\sqrt{d n \log(K)})$. When the action set is continuous, like the unit ball, one can of course discretize it by putting a fine enough net over it. Optimizing the fineness of the discretization level, for convex action sets with bounded volume, this discretization approach delivers a regret of $O(d \sqrt{ n \log(n) })$: a discretization accuracy of $\epsilon=1/\sqrt{n}$ is sufficient to guarantee that the additional regret due to the discretized action set is bounded by $\sqrt{n}$ at most, and the cardinality of the resulting action-set is $O((1/\epsilon)^d) = O(n^{d/2})$.

Can we do better than this? As we shall see, the answer is yes, at least for the unit ball: In fact we will give an algorithm which avoids any kind of discretization (discretization is great for proving theoretical results, but the exponential-sized action-sets are impractical to work with), and in particular, the algorithm will be computationally efficient, while it will be shown to enjoy an improved regret bound of $O(\sqrt{dn})$. The algorithm given is based on the so-called mirror descent algorithm from optimization and online learning, which is a very powerful algorithm. In fact, the exponential weights algorithm is a special case of mirror descent.

# Online linear optimization and mirror descent

**Mirror descent** (MD) is an algorithm that originated from the convex optimization literature and has later been transported to the online learning literature. Here, we describe it in the online learning framework, more specifically, as an algorithm to solve online linear optimization problems.

In online linear optimization one is given a closed convex set $\cK\subset \R^d$ and the goal is to simultaneously compete with all elements of $\cK$ for all linear environments. an environment instance is just a sequence of vectors $g_1,\dots,g_n\in \R^d$. In round $t$, the learning algorithm chooses a point $x_t\in \cK$ from the constraint set (we assume a deterministic choice here), suffers the loss $\ip{g_t,x_t}$ and then observes $y_t\in \R^d$. The algorithm’s performance is measured using its regret. We will find it convenient to discuss the regret against some fixed competitor $x\in \cK$:

\begin{align*}

R_n(x) = \sum_{t=1}^n \ip{g_t,x_t} – \sum_{t=1}^n \ip{g_t,x}\,.

\end{align*}

Then, $R_n = \max_{x\in \cK} R_n(x)$ is the regret over $\cK$.

The basic version of MD has two extra parameters beyond $n$ and $\cK$: A learning rate $\eta>0$ and a convex, “distance generating function” $F: \cD \to \R$, where $\cD\subset \R^d$ is the domain of $F$. The function $F$ is also often called a “potential function”, or a “regularizer” for reasons that will become clear later.

In the first round MD predicts

\begin{align}

x_1 = \arg\min_{x\in \cK} F(x)\,.

\end{align}

In round $t\ge 1$, after $g_t$ is observed, the prediction $x_{t+1}$ for the next round is computed using

\begin{align}

x_{t+1} = \argmin_{x\in \cK \cap \cD} \eta \ip{g_t,x} + D_F(x,x_t)\,.

\label{eq:mdoneline}

\end{align}

Here, $D_F(x,x_t)$ is the so-called $F$-induced **Bregman divergence** of $x$ from $x_t$ defined by

\begin{align}

D_F(x,x_t) = F(x) – F(x_t) – \ip{ \nabla F(x_t), x-x_t }

\end{align}

with $\nabla F(x_t)$ denoting the derivative of $F$ at $x_t$ (i.e., $\nabla F: \mathrm{dom}(\nabla F) \to \R^d$). The minimization in \eqref{eq:mdoneline} is over $\cK\cap \cD$ because the domain of $D_F(\cdot,x_t)$ if the same as that of $F$.

To get a sense of the divergence function $D_F$, note that $D_F(\cdot,y)$ is the difference between $F(\cdot)$ and its first-order Taylor expansion, $F(y)+\ip{\nabla F(y),\cdot-y}$, about the point $y\in \mathrm{dom}(\nabla F)$. Since $F$ is convex, the linear approximation of $F$ is a lower bound on $F$ and so $D_F(\cdot,y)$ is nonnegative over its domain with $D_F(y,y)=0$ (see the **figure** below for an illustration). Furthermore, $D_F(\cdot,y)$ is convex as it is the sum of a convex and a linear function. Hence, \eqref{eq:mdoneline} is a convex optimization problem.

Since $D_F(x,y)$ is defined only when $y\in \mathrm{dom}(\nabla F)$, for the definition \eqref{eq:mdoneline} to make sense we need to have that $F$ is differentiable at all of $x_1,\dots,x_{t-1}$. For our general regret bound we will thus assume that the following holds:

\begin{align}

x_t\in \mathrm{dom}(\nabla F)\,, \qquad \forall t\in [n+1]\,.

\label{eq:xtingraddomain}

\end{align}

Note that $x_n,x_{n+1} \in \mathrm{dom}(\nabla F)$ is not needed for MD to be well defined on the horizon $n$; we included these as they will be convenient for our analysis (here, $x_{n+1}$ is defined by \eqref{eq:mdoneline} as expected). A sufficient condition for \eqref{eq:xtingraddomain} to hold is that $F$ is a turns out to be a Legendre function (the operations in the definition $F$ are defined in a componentwise manner).

## Mirror maps

The update rule of MD aims to achieve two goals: making the loss of $x_{t+1}$ under $g_t$ small, while at the same time staying close to the previous estimate, thus buying stability for the algorithm. When $F(x) = \frac12 \norm{x}_2^2$ with $\cD = \R^d$ and $\R^d$, we get $\nabla F(x) = x$, $D_F(x,x_t) = \frac12 \norm{x-x_t}_2^2$ and the update rule just corresponds to gradient descent:

\begin{align}

x_{t+1} = x_t – \eta g_t\,.

\label{eq:mdgd}

\end{align}

More generally, when $\cK$ is compact nonempty convex set, the update rule corresponds to projected gradient descent: After computing the gradient update as in \eqref{eq:mdgd}, the resulting vector $\tilde x_{t+1}$, if it is lying outside of $\cK$, needs to be projected back to $\cK$ by calculating the point of $\cK$ which is closest to $\tilde x_{t+1}$ in the Euclidean metric.

This two-step computation works also with other appropriate distance generating functions $F$. In particular, when, for example, $F$ is Legendre, one can show that \eqref{eq:mdoneline} is equivalent to the following two-step method:

\begin{align}

\tilde{x}_{t+1} &= (\nabla F)^{-1}( \nabla F(x_t) – \eta g_t )\,, \label{eq:mdunnorm}\\

x_{t+1} &= \argmin_{x\in \cK\cap \cD} D_F(x,\tilde x_{t+1})\,.\label{eq:mdproj}

\end{align}

Here, $\tilde{x}_{t+1} = \argmin_{x\in \cD} \eta \ip{g_t,x} + D_F(x,x_t)$ is the “unconstrained” optimizer of the objective function minimized by MD. For this to be an unconstrained optimizer, what is needed is that $\nabla F$ blows up at the “boundary” of $\cD$ and that $\cD^\circ$, the interior of $\cD$ is nonempty and is the same as the domain of $\nabla F$, the defining properties of Legendre functions.

When $F$ is Legendre, one can show that $\tilde x_{t+1} \in \cD^\circ$. Thus, in this case differentiating the objective function, $\tilde x_{t+1}$ must satisfy

\begin{align}

0=\eta g_t + \nabla F(\tilde x_{t+1}) – \nabla F(x_t)\,.

\label{eq:uncopt}

\end{align}

from which one obtains \eqref{eq:mdunnorm}. The second step \eqref{eq:mdproj} takes the unconstrained optimizer and projects it to $\cK\cap \cD$. The update rules \eqref{eq:mdunnorm}-\eqref{eq:mdproj} explain the name of the algorithm: The algorithm operates in two spaces: The **primal space** where the predictions “live”, and the **dual space**, where the slopes, or gradients “live”. The name “dual” comes from the viewpoint that gradients, or derivatives are linear functionals over the primal space, which thus “live” in the space of linear functionals, which is also called the “dual”. With this language, the “mirror functions” $\nabla F$ and $(\nabla F)^{-1}$ map between subsets of the primal and dual spaces.

Back to the algorithm, from \eqref{eq:mdunnorm} we see that MD first takes the last prediction $x_t$ and maps it to the dual space through $\nabla F$, where it is combined with the observed loss vector $g_t$ (which has the nature of a gradient). This can be thought as the gradient update step. Next, the dual vector obtained is mapped back to the primal space via $(\nabla F)^{-1}$ to get $\tilde x_{t+1}$. Since this can lie outside of the constraint set, it is mapped back to it in the final step \eqref{eq:mdproj}. For the mathematically or computer-language minded, it should be gratifying that the loss vector $g_t$, a “dual type quantity” is combined with another “dual type quantity” rather than with a “primal quantity”, i.e., the “types” of these quantities are respected. In particular, the gradient update \eqref{eq:mdgd} gains new meaning through this: Here $x_t$ is really $\nabla F(x_t)$ for $F(x) = \frac12 \norm{x}_2^2$, which of course, happens to be equal to $x_t$. However, now we know that this is a mere coincidence!

However pleasing the thought of type-compatibility may be, it remains to be seen that the generality of MD will help. As we shall soon see the regret of MD depends on two quantities: the “diameter” of $\cK$ when measured via $F$, and the magnitudes of $g_t$ when measured in a metric compatible with $F$. One can then show that for many sets $\cK$, by choosing $F$ in an appropriate manner, MD becomes a **dimension-free** algorithm in the sense that its regret will be independent of the dimension $d$. This is a huge advantage when the dimension $d$ can be large, as is often the case in modern applications.

A specific example where choosing a non-Euclidean $F$ is beneficial is when $\cK$ is an $\ell^p$-ball. In this case, an appropriate $F$ gives dimension-free regret bounds, while using $F(x) = \norm12 \norm{x}_2^2$ does not share this quality (the reader is invited to check these after we state our general result). Another special case, mentioned earlier, is when $\cK$ is the simplex of $\R^d$, $F$ is the unnormalized negentropy, $\norm{g_t}_\infty\le 1$ (so that $|\ip{x,g_t}|\le 1$ for any $t$ and $x\in \cK$). While in this case the regret will not be dimension-free, the dimension dependence as compared to standard gradient descent is reduced from a linear dependence to a logarithmic dependence, an exponential improvement. On the top of this, the resulting algorithm, exponential weights, is trivial to implement, while a Euclidean projection to the simplex, is more complex to implement.

## The regret of MD

We start with a simple result from convex analysis. As is well known, the minimizer $x^*$ of a differentiable function $\psi:\R^d \to \R$ must satisfy $\nabla \psi(x^*)=0$, which is known as the **first-order optimality condition** (first-order, because the first derivative of $\psi$ is taken). When $psi$ is convex, this is both a sufficient and necessary condition (this is what we like about convex functions). The first-order optimality condition can also be generalized to constrained minima. In particular, if $\cK\subset \R^d$ is a non-empty convex set and $\psi$ is as before then

\begin{align}

x^*\in \argmin_{x\in \cK} \psi(x) \Leftrightarrow \forall x\in \cK: \ip{ \nabla \psi(x^*),x-x^* } \ge 0\,.

\label{eq:firstorderoptfull}

\end{align}

The necessity of this condition is easy to understand by a geometric reasoning as shown on the picture on the right: Since $x^*$ is a minimizer of $\psi$ over $\cK$, $-\nabla \psi(x^*)$ must be the outer normal of the supporting hyperplane $H_{x^*}$ of $\cK$ at $x^*$ otherwise $x^*$ could be moved by a small amount while staying inside $\cK$ and improving the value of $\psi$. Since $\cK$ is convex, it thus lies entirely on the side of $H_{x^*}$ that $\nabla \psi(x^*)$ points into. This is clearly equivalent to \eqref{eq:firstorderoptfull}. The sufficiency of the condition also follows from this geometric viewpoint as the reader may verify.

The above statement continues to hold with a small modification even when $\psi$ is not everywhere differentiable. In particular, in this case the equivalence \eqref{eq:firstorderoptfull} holds for any $x^*\in \mathrm{dom}(\nabla \psi)$ with the modification that on both sides of the equivalence, $\cK$ should be replaced by $\cK \cap \dom(\psi)$:

Proposition (first-order optimality condition): Let $\psi:\dom(\psi)\to\R$ be a convex function, $\cK\not=\emptyset$, $\cK\subset \R^d$ convex. Then for any $x^*\in \dom(\nabla \psi)$, it holds that

\begin{align}

x^*\in \argmin_{x\in \cK\cap \dom(\psi)} \psi(x)\\

\Leftrightarrow \forall x\in \cK\cap \dom(\psi): \ip{ \nabla \psi(x^*),x-x^* } \ge 0\,.

\label{eq:firstorderopt}

\end{align}

With this we are ready to start bounding the regret of MD. As the potential function $F$ will be kept fixed, to minimize clutter we abbreviate $D_F$ to $D$ in the expressions below.

In round $t\in [n]$, the instantaneous regret of MD against $x\in \cK$ is $\ip{g_t,x_t-x}$. As $x_{t+1} = \argmin_{x\in \cK \cap \cD} \psi_t(x)$ where $\psi_t:\cD \to \R$ is given by $\psi_t(x) = \eta \ip{g_t,x} + D(x,x_t)$, the above first-order optimality condition immediately gives the following result:

Proposition (Instantaneous regret of MD): Let $x\in \cK \cap \cD$. Assume that \eqref{eq:xtingraddomain} holds. Then, for any $t\in [n]$,

\begin{align}

\begin{split}

\ip{g_t,x_t-x} & \le \frac{D(x,x_t)-D(x,x_{t+1})}{\eta} \\

& \qquad + \ip{g_t,x_t-x_{t+1}} – \frac{D(x_{t+1},x_t)}{\eta}\,.

\end{split}

\label{eq:mdir1}

\end{align}

**Proof**: Fix $t\in [n]$. By \eqref{eq:xtingraddomain}, $x_{t+1} \in \dom(\nabla \psi_t)$. Hence, we can apply the first-order optimality proposition to $x_{t+1}$ to get that

\begin{align*}

\ip{ \nabla \psi_t(x_{t+1}),x-x_{t+1} } \ge 0\,.

\end{align*}

Plugging in the definition of $\psi_t$ and lengthy but simple algebraic manipulations give \eqref{eq:mdir1}.

QED.

When summing up the instantaneous regret, the first term on the right-hand side of \eqref{eq:mdir1} telescopes and gives $\frac{D(x,x_1)-D(x,x_{n+1})}{\eta}$. Since $D$ is nonnegative, this can be upper bounded by $D(x,x_1)/\eta$. Further, thanks to the choice of $x_1$, using again the first-order optimality condition, we get that this is upper bounded by $(F(x)-F(x_1))/\eta$:

\begin{align}

\sum_{t=1}^n \frac{D(x,x_t)-D(x,x_{t+1})}{\eta}

\le \frac{F(x)-F(x_1)}{\eta}\,.

\end{align}

To bound the remaining terms in \eqref{eq:mdir1}, we may try to upper bound the inner product $\ip{g_t,x_t-x_{t+1}}$ here by bringing in a norm $\norm{\cdot}$ and its dual $\norm{\cdot}_*$:

\begin{align}

\ip{g_t,x_t-x_{t+1}} \le \norm{g_t}_* \norm{x_t-x_{t+1}}\,,

\label{eq:mdholder}

\end{align}

where we may recall that for a norm $\norm{\cdot}$ its dual $\norm{\cdot}_*$ is defined via $\norm{g}_* = \sup_{x: \norm{x}\le 1} \ip{g,x}$. To combine this with $-D(x_{t+1},x_t)$ is it beneficial if this can be lower bounded by some expression that involves the same norm. To get inspiration consider the case when $F(x) = \frac12 \norm{x}_2^2$. As noted earlier, in this case $D=D_F$ is given by $D_F(x,y) = \frac12 \norm{x-y}_2^2$. This gives the idea that perhaps we should assume that

\begin{align}

D_F(x,y) \ge \frac12 \norm{x-y}^2\,, \quad \forall (x,y)\in \dom(F)\times \dom(\nabla F)\,,

\label{eq:mddfsoc}

\end{align}

which is equivalent to requiring that $F$ is strongly convex w.r.t. $\norm{\cdot}$. To merge \eqref{eq:mdholder} and $-\frac{1}{2\eta} \norm{x_{t}-x_{t+1}}^2$, we may further upper bound the right-hand side of \eqref{eq:mdholder} via using $2ab \le a^2+b^2$. Choosing $a = \eta^{1/2} \norm{g_t}_*$ and $b = \eta^{-1/2} \norm{x_t-x_{t+1}}$, we get that

\begin{align}

\ip{g_t,x_t-x_{t+1}} – \frac{D(x_t,x_{t+1})}{\eta} \le \frac{\eta}{2} \norm{g_t}_*^2\,.

\end{align}

For increased generality (which will prove to be useful later) we may let the $\norm{\cdot}$ be chosen as a function of the round index $t$, leading to the following bound:

Theorem (Regret of MD): Let $\eta>0$, $F$ be a convex function with domain $\cD$. Assume that \eqref{eq:xtingraddomain} holds. Further, for each $t\in [n]$, let $\norm{\cdot}_{(t)}$ be a norm such that

\begin{align}

D_F(x_t,x_{t+1}) \ge \frac12 \norm{ x_t – x_{t+1} }^2_{(t)}

\label{eq:bregsoc}

\end{align}

and let $\norm{\cdot}_{(t)^*}$ be the dual norm of $\norm{\cdot}_{(t)}$. Then, the regret $R_n(x)$ of mirror descent against any $x\in \cK \cap \cD$ satisfies the bound

\begin{align}

R_n(x) \le

\frac{F(x)-F(x_1)}{\eta} + \frac{\eta}{2} \sum_{t=1}^n \norm{g_t}_{(t)^*}^2\,.

\end{align}

While the theorem is presented for the case of a fixed sequence $\{g_t\}_t$, it is not hard to see that the same bound holds even if $g_t$ is chosen as a function $y_1,x_1,\dots,y_{t-1},x_t$, i.e., when the environment is “strategic”. This is entirely analogous to how the basic regret bound for exponential weights continues to hold in the face of strategically chosen losses.

As an example of how to use this theorem consider first the case of $\cK = B_2^d$ (the unit ball in $\R^d$ with respect to the $2$-norm) with $F(x) = \frac12 \norm{x}_2^2$. Note that $\cD = \dom(\nabla F) = \R^d$ and thus \eqref{eq:xtingraddomain} is automatically satisfied. We have $\diam_F(\cK) = \sup_{x\in \cK} F(x) – F(x_1)\le 1$. Choosing $\norm{\cdot}_{(t)} = \norm{\cdot}_2$, we have $\norm{\cdot}_{(t)^*} = \norm{\cdot}_2$. Assume that $\{g_t\}$ is leads to bounded losses and in particular, $|\ip{x,g_t}|\le 1$ for all $t\in [n]$ and $x\in \cK$. Note that this holds if and only if $\norm{g_t}_2\le 1$. Thus we get that the regret of MD (which is just the projected gradient descent algorithm in this case) satisfies

\begin{align}

R_n(x) \le \frac1\eta + \frac{\eta n}{2} = \sqrt{2n}\,,

\end{align}

where for the last step we set $\eta = \sqrt{\frac{2}{n}}$.

As a second example consider the case when $\cK$ is the unit simplex of $\R^d$: $\cK = \{ x\in [0,1]^d\,:\, \sum_i x_i = 1\}$. Note that $\cK$ lies in a $d-1$-dimensional hyperplane of $\R^d$. In this case, choosing $F(x) = x\log (x) – x$, the unnormalized negentropy function, $\cD = [0,\infty)^d$ and we find that $x_1 = (1/d,\dots,1/d)$, since $F$ is negative valued, $\diam_F(\cK) = F(x) – F(x_1) \le -F(x_1) =\log(d)$. Further, $D_F(x,y) = \KL(x,y)=\sum_i x_i \log(x_i/y_i)$ is the relative entropy of $x$ given $y$, which is known to satisfy

\begin{align}

D_F(x,y) \ge \frac12 \norm{x-y}_1^2\,

\label{eq:pinsker}

\end{align}

(this inequality is known as Pinsker’s inequality). Hence, we choose $\norm{\cdot}_{(t)} = \norm{\cdot}_1$ and thus $\norm{\cdot}_{(t)^*} = \norm{\cdot}_\infty$. In this case the assumption that the losses are properly normalized imply that $\norm{g_t}_\infty \le 1$. Putting things together we get that the regret of MD satisfies

\begin{align}

R_n(x) \le \frac{\log d}{\eta} + \frac{\eta n}{2} = \sqrt{2 \log(d) n }\,,

\end{align}

which matches the regret bound we derived earlier for the exponential weights algorithm. Curiously, MD with this choice is in fact the exponential weights algorithm. What is more, there is a one-to-one map of the steps of the two respective proofs, as the reader may verify.

What would have happened if we used $F(x) = \frac12 \norm{x}_2^2$ with $\norm{\cdot}_{(t)} = \norm{\cdot}_2$ instead of the unnormalized negentropy function? While in this case $\diam_F(\cK) \le 1$, $\norm{g_t}_2^2$ could be as large as $d$ (e.g., $g_t = (1,\dots,1)$), which gives $R_n(x) \le \sqrt{2 d n }$. In fact, this is real: MD with this choice can suffer a much larger regret than MD with the unnormalized negentropy potential. Thus we see that at least in this case the increased generality of MD allows a nontrivial improvement of the regret.

The reader is also invited to check the regret bounds that follow with various potential functions when $\cK$ is the unit $\ell^p$ ball, i.e., $\cK = \{ x\,: \norm{x}_p \le 1 \}$ where $\norm{x}_p$ is defined through $\norm{x}_p^p = \sum_i |x_i|^p$, while the loss per round in still in $[-1,1]$. For example, what is the regret for $F(x) = \frac12 \norm{x}_2^d$, or $F(x) = \frac12 \norm{x}_p^2$? For the calculation it may be useful to know that $F$ is $(p-1)$ strongly convex with respect to $\norm{\cdot}_p$, as the reader may also verify.

From these examples we see that two things matter when using MD with some potential function $F$ and a norm $\norm{\cdot}$: the magnitude of $\diam_F(\cK)$ and that of $G = \sup \{ \norm{g}_* \,: \sup_{x\in \cK} |\ip{g,x}| \le 1 \}$. It follows that if $\cK$ is a subset of the unit ball underlying $\norm{\cdot}$, $\norm{g}_* = \sup_{x:\norm{x}\le 1} |\ip{g,x}| \le \sup_{x\in \cK} |\ip{g,x}| \le 1$. Thus, for a subset of unit balls $G\le 1$ regardless the choice of $F$. Thus, in this case the choice of $F$ boils down to minimizing the diameter of $\cK$ under $F$ subject to the constraint \eqref{eq:bregsoc}.

Our next goal will be to design an algorithm based on MD when the action set $\cA$ is the $d$-dimensional unit ball $B_2^d$. It turns out that for this it will be useful to further upper bound the regret of MD. In particular, this way we will get a bound that will be easier to bound than bounding the regret directly. For this, we will assume that $F:\cD\to \R$ is Legendre with a “large enough” domain.

Theorem (Regret for MD with Legendre potential): Let $F:\cD \to \R$ be Legendre such that $\cK \subset \bar \cD$. For any $x\in \cK$, the regret $R_n(x)$ of MD against $x$ satisfies

\begin{align}

R_n(x) \le

\frac{F(x)-F(x_1)}{\eta} + \frac{1}{\eta}\, \sum_{t=1}^n D_F(x_t,\tilde x_{t+1}) \,.

\end{align}

**Proof**: Note that \eqref{eq:xtingraddomain} is satisfied because $\bar \cD \subset \cK$ and $F$ is Legendre. We start from \eqref{eq:mdir1}. Earlier, we argued that the sum of first terms on the right-hand side of this is bounded by $\frac{F(x)-F(x_1)}{\eta}$. Hence, it remains to bound the remaining terms. We claim that

\begin{align*}

\ip{g_t,x_t-x_{t+1}} – \frac{D_F(x_{t+1},x_t)}{\eta}

= D_F(x_t,\tilde x_{t+1}) – D_F(x_{t+1},\tilde x_{t+1})\,,

\end{align*}

from which the results follow since $D_F$ is nonnegative. Now, the above identity can be shown using some algebra and the optimality property of $x_{t+1}$, namely \eqref{eq:uncopt}.

QED.

The reason this result is useful is because $D_F(x_t,\tilde x_{t+1})$ can also be written as the Bregman divergence of $\nabla F(\tilde x_{t+1})$ from $\nabla F(x_t)$ with respect to what is known as the Legendre-Fenchel dual $F^*$ of $F$ and oftentimes this divergence is easier to bound than the expression we had previously. The Legendre-Fenchel dual of $F$ is defined via $F^*(g) \doteq \sup_{x\in \cD} \ip{g,x} – F(x)$. Now, the identity mentioned states that for any $F$ Legendre and any $x,x’ \in \dom(\nabla F)$,

\begin{align}

D_F(x,x’) = D_{F^*}(\nabla F(x’), \nabla F(x))\,.

\label{eq:dualbregman}

\end{align}

In particular, this also means that

\begin{align}

R_n(x) \le

\frac{F(x)-F(x_1)}{\eta} + \frac{1}{\eta}\, \sum_{t=1}^n D_{F^*}(\nabla F(\tilde x_{t+1}), \nabla F( x_{t}) ) \,.

\end{align}

For calculations it is often easier to obtain $F^*$ from the identity $\nabla F^*(g) = (\nabla F)^{-1}(g)$, which holds for all $g\in \dom(\nabla F^*)$. This means that one can find $F^*$ by calculating the primitive function of $(\nabla F)^{-1}$.

# Linear bandits on the unit ball

In accordance with the notation of the previous post, the sequence of losses will be denoted by $y_t$ (and not $g_t$ as above) and we shall assume that

\begin{align}

\sup_{t,a\in \cA} |\ip{y_t,a}| \le 1\,.

\label{eq:mdboundedloss}

\end{align}

Per our discussion this is equivalent to $\max_t \norm{y_t}_2\le 1$.

To use MD we will use randomized estimates of the losses $\{y_t\}$. In particular, in round $t$ the bandit algorithm will choose a random action $A_t\in \cA$ based on which an estimate $\tilde Y_t$ of $y_t$ will be constructed using the observed loss $Y_t = \ip{y_t,A_t}$.

Let $F$ be a convex function on $\cD\subset \R^d$ to be chosen later, let $X_1=x_1= \argmin_{x\in \cA\cap \cD} F(x)$, and for $t\in [n-1]$ let $X_{t+1}$ be the output of MD when used with learning rate $\eta$ on the input $\tilde Y_t$:

\begin{align*}

X_{t+1} = \argmin_{x\in \cA\cap \cD} \eta \ip{\tilde Y_t, x} + D_F(x,x_t)\,.

\end{align*}

(The upper case letters signify that $X_t$ is random, owing to that $A_1,\dots,A_{t-1}$, hence $\tilde Y_1,\dots,\tilde Y_{t-1}$ are random.) We will make sure that $F$ is such that the minimum here can be taken over $\cA$. For uniformity, we also use an uppercase letter to denote the first choice $X_1$ of MD. As long as

\begin{align}

\EE{ A_t|X_t } = X_t\,, \qquad \text{ and } \qquad \EE{ \tilde Y_t| X_t } = y_t

\label{eq:md_atyt}

\end{align}

hold for all $t\in [n]$, the regret $R_n(x)= \EE{ \sum_{t=1}^n \ip{y_t,A_t} – \ip{y_t,x} }$ of this algorithm satisfies

\begin{align*}

R_n(x)

& = \EE{ \sum_{t=1}^n \ip{y_t,X_t} – \ip{y_t,x} }

= \EE{ \sum_{t=1}^n \ip{\tilde Y_t,X_t} – \ip{\tilde Y_t,x} }\,.

\end{align*}

Notice that the last expression inside the expectation is the (random) regret of mirror descent on the (recursively constructed) sequence $\tilde Y_t$. Hence, when the conditions of our theorem are satisfied, for any $x\in \cA \cap \cD$, for $F$ Legendre,

\begin{align}

\label{eq:mdlinbandit}

R_n(x) \le \frac{F(x)-F(x_1)}{\eta } +

\frac{1}{\eta} \sum_{t=1}^n \EE{ D_{F^*}( \nabla F(\tilde X_{t+1}), \nabla F(X_t) ) }\,.

\end{align}

Thus, it remains to choose $A_t,\tilde Y_t$ and $F$ with domain $\cD$ such that $\cK \subset \bar \cD$.

Unsurprisingly, $A_t$ and $\tilde Y_t$ will be chosen similarly to what was done in the previous post. In particular, $\tilde Y_t$ will be a linear estimator with a small twist that will simplify some calculations, while $A_t$ will be selected at random to be either $X_t$ or an exploratory action $U_t$ drawn randomly from some exploration distribution $\pi$ supported on $\cA$. The calculation of the previous post suggests that a $G$-optimal exploration distribution should be used. For the unit ball, it turns out that we can simply choose $\pi$ as the uniform distribution on $\{ \pm e_i \}$ where $\{e_i\}_i$ is the standard Euclidean basis. That this is $G$-optimal follows immediately from the Kiefer-Wolfowitz theorem.

To formally define $A_t$, let $P_t\in (0,1]$ be the probability of exploring in round $t$ ($P_t$ will be chosen appropriately later), let $E_t \sim \mathrm{Ber}(P_t)$, $U_t\sim \pi$ such that $E_t$ and $U_t$ are independent given the information available at the time when they are chosen. Then, to satisfy $\EE{A_t|X_t} = X_t$, we can set

\begin{align}

A_t = E_t U_t + \frac{1-E_t}{1-P_t} X_t\,.

\end{align}

Since $A_t\in \cA = B_2^d$ must hold this gives the constraint that

\begin{align}

P_t \le 1-\norm{X_t}_2 \,.

\label{eq:ptconstraint}

\end{align}

This also means that one must somehow ensure that $\norm{X_t}_2<1$ holds for all $t\in [n]$. One simple way to enforce this is to replace the domain used in MD by $(1-\gamma) \cA$ for some small $\gamma>0$. Since this introduces at most a factor of $\gamma n$ in the regret, as long as $\gamma$ is small enough, the additional regret due to this modification will be negligible.

Having chosen $A_t$, it remains to construct $\tilde Y_t$. Consider

\begin{align*}

\tilde Y_t = Q_t^{-1} E_t A_t Y_t\,, \quad Q_t = \E_t{ E_t A_t A_t^\top }\,,

\end{align*}

where $\E_t{Z} = \EE{Z| X_1,A_1,Y_1,\dots,X_{t-1},A_{t-1},Y_{t-1},X_t }$. Note that this is a slightly modified version of the linear estimator and thus $\E_t{\tilde Y_t} = y_t$. The modification is that in this estimator $A_t$ is multiplied by $E_t$, which means that the estimate is $0$ unless an exploratory action is chosen. While this is not necessary, it simplifies the calculations and in fact slightly improves the bound that we derive. In particular, we have $E_t A_t = U_t$ and hence $Q_t = \frac{P_t}{d} I$ and thus

\begin{align*}

\tilde Y_t = \frac{d\, E_t}{P_t} U_t U_t^\top y_t\,.

\end{align*}

It remains to choose $F$ and the norms. One possibility would be $F(x) = \frac12 \norm{x}_2^2$ which is an appealing choice both because it leads to simple calculations and also because in the full information setting (i.e., when $y_t$ is observed in every round) this choice enjoyed a dimension-free regret bound. It is easy to check that $F=F^*$, hence $D_{F^*}(u,v) = \frac12 \norm{u-v}_2^2$. Hence,

\begin{align}

\frac1\eta \E_t D_{F^*}( \nabla F( \tilde X_{t+1} ), \nabla F( X_t ) )

&=

\frac{\eta}{2} \E_t[ \norm{\tilde Y_t}_2^2 ]

=

\frac{d^2}{P_t^2} \E_t[ E_t y_t^\top U_t U_t^\top U_t U_t^\top y_t ]\nonumber \\

&=

\frac{d^2}{P_t^2} \E_t[ E_t y_t^\top U_t U_t^\top y_t ]

=

\frac{d}{P_t} \norm{y_t}_2^2 \le \frac{d}{P_t}\,,

\label{eq:twonormbound}

\end{align}

where the first equality used identity \eqref{eq:uncopt}. To finish, one could choose $P_t = \gamma$ and set the domain of MD to $(1-\gamma) \cA$, but even with these modifications, the best bound that can be derived scales only with $n^{2/3}$, which is not what we want.

Hence, if the other parameters of the construction are kept fixed, we need to go beyond the Euclidean potential. The potential we consider is

\begin{align*}

F(x) = -\log(1-\norm{x}) – \norm{x}\,, \quad x\in \cD = \{ x’\,:\, \norm{x’}<1 \}\,.
\end{align*}
This can easily be seen Legendre. Further, $\nabla F(x) = \frac{x}{1-\norm{x}}$, and also $F^*(g)=-\log(1+\norm{g})+\norm{g}$.
Using that for $x\ge -1/2$, $\log(1+x)\ge x-x^2$, one can show that for
\begin{align}
\frac{\norm{g}-\norm{h}}{1+\norm{h}}\ge -1/2\,,
\label{eq:ghapart}
\end{align}
it holds that
\begin{align}
D_{F^*}(g,h) & = \frac{1}{1+\norm{h}}\norm{g-h}^2\,.
\end{align}
Plugging in $g=\nabla F(\tilde x_{t+1})$ and $h=\nabla F(x_t)$, we see that \eqref{eq:ghapart} is satisfied, hence, thanks to $1/(1+\norm{\nabla F(X_t)}) = 1-\norm{X_t}$, after some calculation we get
\begin{align*}
\E_t D_{F^*}( \nabla F(\tilde X_{t+1}),\nabla F(x_t))
\le d\,.
\end{align*}
Putting together things, again running MD on $(1-\gamma)B_2^d$ and choosing $\eta$ and $\gamma$ appropriately, we get that
\begin{align*}
R_n(x)\le 3 \sqrt{dn\log(n)}\,.
\end{align*}
Surprisingly, this is smaller than the regret that we got for stochastic bandits on the unit ball by a factor of at least $\sqrt{d}$. However, it will be the subject of the next post to discuss how this could happen.

# Notes and departing thoughts

**Computational complexity of MD**: Usually, \eqref{eq:mdunnorm} is trivial to implement, while \eqref{eq:mdproj} can be quite expensive. However, in some important special cases, the projection can be computed efficiently, even in closed form, e.g., for the simplex and the negentropy potential, or the Euclidean unit ball and the Euclidean potential $F(x) = \frac12\norm{x}_2^2$. In general, one can of course use a convex solver to implement this step, but then one can also use the convex solver to find an approximate optimizer of \eqref{eq:mdoneline} directly.

**Legendre functions**: A strictly convex map $F:\cD \to \R$ is Legendre if the interior $\cD^\circ$ of $\cD$ is nonempty, $\mathrm{dom}(\nabla F) = \cD^\circ$ ($F$ is differentiable over the interior of its domain) and $\nabla F(x)$ diverges as $x$ approaches the boundary of $\cD$. Formally, this last condition means that for any $\{x_n\}_n \subset \mathrm{dom}(\nabla F)$ such that either $\norm{x_n}\to\infty$ or $x_n \to \partial \cD = \bar \cD \setminus \cD^\circ$, $\norm{ \nabla F(x_n) } \to \infty$ as $n\to\infty$. Note that since in a finite-dimensional space all norms are equivalent, it does not matter which norm is used in the above definition. One can show that if the distance generating map $F$ is Legendre then the sequence $\{x_t\}_{t\in [n]}$ computed by MD always satisfies \eqref{eq:mdunnorm}-\eqref{eq:mdproj}, and \eqref{eq:xtingraddomain} also holds.

**Simplex, negentropy, exponential weights**: A canonical example that is instructive to illustrate the concept of Legendre functions is when $\cK = \Delta_d = \{ x\in [0,1]^d\,:\, \sum_i x_i = 1\}$ is the simplex embedded into $\R^d$ and $F(x) = \sum_i x_i \log(x_i)-x_i$ is the unnormalized negentropy function. Here, the domain $\cD$ of $F$ is $\cD = [0,\infty)^d$ where we use the convention that $0 \log(0)=0$ (which is sensible since $\lim_{s\to 0+} s \log(s) = 0$). Then it is easy to see that $\dom(\nabla F) = (0,\infty)^d$ with $\nabla F(x) = \log(x)$ ($\log(\cdot)$ is applied componentwise). Then, $\partial \cD = \{ x\in [0,\infty)^d \,:\, \exists i \, \text{ s.t. } x_i = 0\}$ and whenever $\{x_n\}_n \subset (0,\infty)$ is such that either $\norm{x_n}\to\infty$ or $x_n \to \partial \cD$ as $n\to\infty$ then $\norm{\nabla F(x_n)} \to \infty$. Thus, $F$ is Legendre. As noted earlier, MD in this case is well defined, the computation implemented by MD is captured by the two steps \eqref{eq:mdunnorm}-\eqref{eq:mdproj} and \eqref{eq:xtingraddomain} also holds. In fact, a small computation shows that $x_{t+1}$ in this case satisfies

\begin{align*}

x_{t+1,i} = \frac{ \exp( – \eta g_{ti} ) x_{ti} }{\sum_j \exp( – \eta g_{tj} ) x_{tj} }

= \frac{\exp(-\eta \sum_{s=1}^t g_{si} ) }{\sum_j \exp(-\eta \sum_{s=1}^t g_{sj} ) }\,,

\end{align*}

which we can recognize as the exponential weights algorithm used for example in the previous post, as well as numerous other occasions.

Let $f:\dom(f) \to \R$ be convex, $y\in \dom(\nabla f)$ and let $\norm{\cdot}$ be a norm. As it was pointed out earlier $f$ is then lower bound by its first-order Taylor approximation about the point $y$:

\begin{align*}

f(x) \ge f(y) + \ip{ \nabla f(y), x-y }, \qquad \forall x\in \dom(f)\,.

\end{align*}

If we can add $\frac12 \norm{x-y}^2$ to the lower bound and it still remain a lower bound, i.e., when

\begin{align*}

f(x) \ge f(y) + \ip{ \nabla f(y), x-y } + \frac12 \norm{x-y}^2, \qquad \forall x\in \dom(f)

\end{align*}

then we say that $f$ is **strongly convex** w.r.t. $\norm{\cdot}$ at $y$. When this holds for all $y\in \dom(\nabla f)$, we say that $f$ is strongly convex w.r.t. $\norm{\cdot}$. Often, strong convexity is defined with more flexibility when $\frac12$ above is replaced by $\frac{\alpha}{2}$ with some positive $\alpha>0$:

\begin{align*}

f(x) \ge f(y) + \ip{ \nabla f(y), x-y } + \frac{\alpha}2 \norm{x-y}^2, \qquad \forall x\in \dom(f)

\end{align*}

In this case, we say that $f$ is $\alpha$-strongly convex w.r.t. $\norm{\cdot}$ (at $y$), or that the strong convexity modulus of $f$ at $y$ is $\alpha$. Of course, the two definitions are the same when we scale the norm between the two of them. By reordering the above inequalities, we see that strong convexity can also be expressed as a lower bound on the Bregman-divergence induced by $f$. For example, the second inequality can be written as

\begin{align*}

D_f(x,y) \ge \frac{\alpha}2 \norm{x-y}^2, \quad \forall x\in \dom(f)\,.

\end{align*}

**Strong convexity, Hessians and matrix-weighted norms**: Let $F:\dom(F)\to \R$ be convex and twice differentiable over its domain. We claim that $F$ is strongly convex w.r.t. the (semi-)norm $\norm{\cdot}$ defined by $\norm{x}^2 \doteq x^\top G x$ where $G \preceq \nabla^2 F(y)$ for all $y\in \dom(F)$. To see this note that

\begin{align}

F(x) \ge F(y) + \ip{\nabla F(y),x-y} + \frac12 \norm{x-y}^2

\label{eq:sochessiansoc}

\end{align}

holds if and only if $g(\cdot) \doteq F(\cdot) – \frac12 \norm{\cdot-y}^2$ is convex. Indeed, $g$ is convex if and only if it is an upper bound of its linear approximations. Now, $g(x)\ge g(y) + \ip{\nabla g(y),x-y} = F(y) + \ip{\nabla F(x),x-y}$ and this is indeed equivalent to \eqref{eq:sochessiansoc}. Furthermore, $\nabla^2 g(x) = \nabla^2 F(x)-G$, hence $\nabla^2 g(x)\succeq 0$ by our assumption on $G$.

The price of bandit information on the unit ball is an extra $\sqrt{d}$ factor: On the unit ball the regret of MD under full information is $\Theta(\sqrt{n})$, while it is $\Theta(\sqrt{dn})$ (the lower bound can be shown using the standard proof techniques). Up to log factors, The same, that is, that the price of bandit information is $\sqrt{d}$, holds for the simplex, too. Is this true in general? The answer turns out to be no. The price of bandit information can be as high as $\Theta(d)$ and whether this happens depends on the shape of the action set $\cA$. However, when this happens is not very well understood at the moment.

# References

The results presented here are largely based on

- Sébastien Bubeck and Nicolò Cesa-Bianchi (2012), “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems“, Foundations and Trends in Machine Learning: Vol. 5: No. 1, pp 1-122.

The original reference is

- Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade:

Towards Minimax Policies for Online Linear Optimization with Bandit Feedback. COLT 2012: 41.1-41.14

## Adversarial linear bandits

In the next few posts we will consider adversarial linear bandits, which, up to a crude first approximation, can be thought of as the adversarial version of stochastic linear bandits. The discussion of the exact nature of the relationship between adversarial and stochastic linear bandits is postponed until a later post. In the present post we consider only the the finite action case with $K$ actions.

# The model

The learner is given a finite action set $\cA\subset \R^d$ and the number of rounds $n$. An instance of the adversarial problem is a sequence of loss vectors $y_1,\dots,y_n$ where for each $t\in [n]$, $y_t\in \R^d$ (as usual in the adversarial setting, it is convenient to switch to losses). Our standing assumption will be that the scalar loss for any of the action is in $[-1,1]$:

Assumption (bounded loss): The losses are such that for any $t\in [n]$ and action $a\in \cA$, $| \ip{a,y_t} | \le 1$.

As usual, in each round $t$ the learner selects an action $A_t \in \cA$ and receives and observes a loss $Y_t = \ip{A_t,y_t}$. The learner does not observe the loss vector $y_t$ (if the loss vector is observed, then we call it the *full information setting*, but this is a topic for another blog). The regret of the learner after $n$ rounds is

\begin{align*}

R_n = \EE{\sum_{t=1}^n Y_t} – \min_{a\in \cA} \sum_{t=1}^n \ip{a,y_t}\,.

\end{align*}

Clearly the finite-armed adversarial bandit framework discussed in a previous post is a special case of adversarial linear bandits corresponding to the choice $\cA = \{e_1,\dots,e_d\}$ where $e_1,\dots,e_d$ are the unit vectors of the $d$-dimensional standard Euclidean basis. The strategy will also be an adaptation of the exponential weights strategy that was so effective in the standard finite-action adversarial bandit problem.

# Strategy

As noted, we will adapt the exponential weights algorithm. Like in that setting we need a way to estimate the individual losses for each action, but now we make use of the linear structure to share information between the arms and decrease the variance of our estimators. Let $t\in [n]$ be the index of the current round. Assuming that the loss estimates for action $a\in \cA$ are $(\hat Y_s(a))_{s < t}$, the distribution that the exponential weights algorithm proposes is

\begin{align*}

\tilde P_t(a) = \frac{ \exp\left( -\eta \hat \sum_{s=1}^{t-1} \hat Y_s(a) \right)}

{\sum_{a’\in \cA} \exp\left( -\eta \hat \sum_{s=1}^{t-1} \hat Y_s(a’) \right)}\,,

\qquad a\in \cA

\end{align*}

To control the variance of the loss estimates, it will be useful to mix this distribution with an **exploration distribution**, which we will denote by $\pi$. In particular, $\pi: \cA \to [0,1]$ with $\sum_a\pi(a)=1$. The mixture distribution is

\begin{align*}

P_t(a) = (1-\gamma) \tilde P_t(a) + \gamma \pi(a)\,,

\end{align*}

where $\gamma$ is a constant mixing factor to be chosen later. Then, in round $t$, the algorithm draws an action from $P_t$:

\begin{align*}

A_t \sim P_t(\cdot)\,.

\end{align*}

Recall that $Y_t = \ip{A_t,y_t}$ is the observed loss after taking action $A_t$. The next question is how to estimate $y_t(a) \doteq \ip{a,y_t}$? The idea will be to estimate $y_t$ with some vector $\hat Y_t\in \R^d$ and then letting $\hat Y_t(a) = \ip{a,\hat Y_t}$.

It remains to construct $\hat Y_t$. For this we will use the least-squares estimator $\hat Y_t = R_t A_t Y_t$, where $R_t$ is selected so that $\hat Y_t$ is an unbiased estimate of $y_t$ given the history. In particular, letting $\E_t[X] = \EE{ X \,|\, A_1,\dots,A_{t-1} }$, we calculate

\begin{align*}

\E_t [\hat Y_t ] = R_t \E_t [ A_t A_t^\top ] y = R_t \underbrace{\left(\sum_a P_t(a) a a^\top\right)}_{Q_t} y\,.

\end{align*}

Hence, using $R_t = Q_t^{-1}$, we get $\E_t [ \hat Y_t ] = y$. There is one minor technical detail, which is that $Q_t$ should be non-singular. This means that both the action set $\cA$ and the support of $P_t$ must span $\R^d$. To ensure this is true we will simply assume that $\cA$ spans $\R^d$ and eventually we will choose $P_t$ in such a way that its span is indeed $\R^d$. The assumption that $\cA$ spans $\R^d$ is non-restrictive, since if not we can simply adjust the coordinate system and reduce the dimension.

To summarize, in each round $t$ the algorithm estimates the unknown loss vector $\hat Y_t$ using the least-squares estimator, which is then used to directly estimate the loss for each action.

\begin{align*}

\hat Y_t = Q_t^{-1} A_t Y_t\,,\qquad \hat Y_t(a) = \ip{a,\hat Y_t}\,.

\end{align*}

Given the loss estimates all that remains is to apply the exponential weights algorithm with an appropriate exploration distribution and we will be done. Of course, the devil is in the details, as we shall see.

# Regret analysis

By modifying our previous regret proof and assuming that for each $t\in [n]$,

\begin{align}

\label{eq:exp2constraint}

\eta \hat Y_t(a) \ge -1, \qquad \forall a\in \cA\,,

\end{align}

then the regret is bounded by

\begin{align}

R_n \le \frac{\log K}{\eta} + 2\gamma n + \eta \sum_t \EE{ \sum_a P_t(a) \hat Y_t^2(a) }\,.

\label{eq:advlinbanditbasicregretbound}

\end{align}

Note that we cannot use the proof that leads to the tighter constant ($\eta$ getting replaced by $\eta/2$ in the second term above) because there is no guarantee that the loss estimates will be upper bounded by one. To get a regret bound it remains to set $\gamma, \eta$ so that \eqref{eq:exp2constraint} is satisfied, and we also need to bound $\EE{ \sum_a P_t(a) \hat Y_t^2(a) }$. We start with the latter. Let $M_t = \sum_a P_t(a) \hat Y_t^2(a)$. Since

\begin{align*}

\hat Y_t^2(a) = (a^\top Q_t^{-1} A_t Y_t)^2 = Y_t^2 A_t^\top Q_t^{-1} a a^\top Q_t^{-1} A_t\,,

\end{align*}

we have $M_t = \sum_a P_t(a) \hat Y_t^2(a) = Y_t^2 A_t^\top Q_t^{-1} A_t\le A_t^\top Q_t^{-1} A_t$ and so

\begin{align*}

\E_t[ M_t ] \le \trace \left(\sum_a P_t(a) a a^\top Q_t^{-1} \right) = d\,.

\end{align*}

It remains to choose $\gamma$ and $\eta$. To begin, we strengthen \eqref{eq:exp2constraint} to $|\eta \hat Y_t(a) | \le 1$ and note that since $|Y_t| \leq 1$,

\begin{align*}

|\eta \hat Y_t(a) | = |\eta a^\top Q_t^{-1} A_t Y_t| \le \eta |a^\top Q_t^{-1} A_t|\,.

\end{align*}

Let $Q(\pi) = \sum_{\nu \in \cA} \pi(\nu) \nu \nu^\top$, then $Q_t \succ \gamma Q(\pi)$ and so by Cauchy-Schwartz we have

\begin{align*}

|a^\top Q_t^{-1} A_t| \leq \norm{a}_{Q_t^{-1}} \norm{A_t}_{Q_t^{-1}}

\leq \max_{\nu \in \cA} \nu^\top Q_t^{-1} \nu \leq \frac{1}{\gamma}\max_{\nu \in \cA} \nu^\top Q^{-1}(\pi) \nu\,,

\end{align*}

which implies that

\begin{align*}

|\eta \hat Y_t(a)| \leq \frac{\eta}{\gamma} \max_{\nu \in \cA} \nu^\top Q^{-1}(\pi)\nu\,.

\end{align*}

Since this quantity only depends on the action set and the choice of $\pi$, we can make it small by solving a *design problem*.

\begin{align}

\begin{split}

\text{Find a sampling distribution } \pi \text{ to minimize } \\

\max_{v \in \cA} v^\top Q^{-1}(\pi) v\,.

\end{split}

\label{eq:adlinbanditoptpbl}

\end{align}

If $D$ is the optimal value of the above minimization problem, the constraint \eqref{eq:exp2constraint} will hold whenever $\eta D \le \gamma$. This suggests choosing $\gamma = \eta D$ since \eqref{eq:advlinbanditbasicregretbound} is minimized if we choose a smaller $\gamma$. Plugging into \eqref{eq:advlinbanditbasicregretbound}, we get

\begin{align*}

R_n \le \frac{\log K}{\eta} +\eta n(2D+d) = 2 \sqrt{ (2D+d) \log(K) n }\,,

\end{align*}

where for the last equality we chose $\eta$ to minimize the upper bound. Hence, it remains to calculate, or bound $D$. In particular, in the next section we will show that $D\le d$ (in fact, equality also holds), which leads to the following result:

Theorem (Worst-case regret of adversarial linear bandits):

Let $R_n$ be the expected regret of the exponential weights algorithm as described above. Then, for an appropriate choice of $\pi$ and an appropriate shifting of $\cA$, if the bounded-loss assumption holds for the shifted action set,

\begin{align*}

R_n \le 2 \sqrt{3dn \log(K)}\,.

\end{align*}

## Optimal design, volume minimizing ellipsoids and John’s theorem

It remains to show that $D\le d$. For this we need to choose a norm and a distribution $\pi$ according to \eqref{eq:adlinbanditoptpbl}. Consider now the problem of choosing a distribution $\pi$ on $\cA$ to minimize

\begin{align}

g(\pi) \doteq \max_{v\in \cA} v^\top Q^{-1}(\pi) v\,.

\label{eq:gcrit}

\end{align}

This is known as the **$G$-optimal design problem** in statistics, studied by Kiefer and Wolfowitz in 1960. In statistics, more precisely, in optimal experiment design, $\pi$ would be called a “design”. A $G$-optimal design $\pi$ is the one that minimizes the maximum variance of least-squares predictions over the “design space” $\cA$ when the independent variables are chosen with frequencies proportional to the probabilities given by $\pi$ and the response follows a linear model with independent zero mean noise and constant variance. After all the work we have done for stochastic linear bandits, the reader hopefully does recognize that $v^\top Q^{-1}(\pi) v$ is indeed the variance of the prediction under a least-squares predictor.

Theorem (Kiefer-Wolfowitz, 1960): The following are equivalent:

- $\pi$ is a minimizer of $f(\pi)\doteq -\log \det Q(\pi)$;
- $\pi$ is a minimizer of $g$;
- $g(\pi) = d$.

So there we have it. $D = d$ (with equality!) and our proof of the theorem is complete. A disadvantage of this approach is that it is not very apparent from the above theorem what the optimal sampling distribution $\pi$ actually looks like. We now make an effort to clarify this, and to briefly discuss the computational issues.

Consider the first equivalent claim of the Kiefer-Wolfowitz result which concerns the problem of minimizing $-\log \det Q(\pi)$, which is also is known as the **$D$-optimal design problem** ($D$ after “determinant”). The criterion in $D$-optimal design can be given an information theoretic interpretation, but we will find it more useful to consider it from a geometric angle. For this, we will need to consider ellipsoids.

An **ellipsoid** $E$ with center zero (“central ellipsoid”) is simply the image of the unit ball $B_2^d$ under some nonsingular linear map $L$: $E = L B_2^d$ where for a set $S\subset \R^d$, $LS = \{Lx \,:x\in S\}$. Let $H= (LL^\top)^{-1}$. Then, for any vector $y\in E$, $L^{-1}y\in B_2^d$, or $\norm{L^{-1}y}_2^2 = \norm{y}_{H}^2 \le 1$. We shall denote the ellipsoid $LB_2^d$ by $E(0,H)$ (the $0$ signifies that the center of the ellipsoid is at zero). Now $\vol(E(0,H)) = |\det L| \vol(B_2^d) = \mathrm{const}(d)/\sqrt{\det H} = \exp(\mathrm{const}(d) – \frac12 \log \det H)$. Hence, minimizing $-\log \det Q(\pi)$ is equivalent to maximizing the volume of the ellipsoid $E(0,Q^{-1}(\pi))$. By convex duality, one can then show that this is exactly the same problem as minimizing the volume of the central ellipsoid $E(0,H)$ that contains $\cA$. Fritz John’s celebrated result, which concerns **minimum-volume enclosing ellipsoids** (MVEEs) with no restriction on their center, gives the missing piece:

John’s theorem (1948): Let $\cK\subset \R^d$ be convex, closed and assume that $\mathrm{span}(\cK) = \R^d$. Then there exists a unique MVEE of $\cK$. Furthermore, this MVEE is the unit ball $B_2^d$ if and only if there exists $m\in \N$ contact points (“the core set”) $u_1,\dots,u_m$ that belong to both $\cK$ and the surface of $B_2^d$ and there also exist positive reals $c_1,\dots, c_m$ such that

\begin{align}

\sum_i c_i u_i=0 \quad \text{ and } \quad \sum_i c_i u_i u_i^\top = I.

\label{eq:johncond}

\end{align}

Note that John’s theorem allows the center of the MVEE to be arbitrary. Again, by shifting the set $\cK$, we can always arrange for the center to be at zero. So how can we use this result? Let $\cK = \co(\cA)$ be the convex hull of $\cA$ and let $E = L B_2^d$ be its MVEE with some nonsingular $L$. Without loss of generality, we can shift $\cA$ to allow that the center of this MVEE to be at zero (why? actually, here we lose a factor of two: think of the bounded reward condition).

To pick $\pi$, note that the MVEE of $L^{-1} \cK$ is $L^{-1} L B_2^d = B_2^d$. Hence, by the above result, there exists $u_1,\dots,u_m\in L^{-1}\cK \cap \partial B_2^d$ and positive reals $c_1,\cdots,c_m$ such that \eqref{eq:johncond} holds. Note that $u_1,\dots,u_m$ must also be elements of $L^{-1} cA$ (why?). Therefore, $Lu_1,\dots,L u_m\in \cA \cap \partial E$. Thanks to the rotation property of the trace and that $\norm{u_i}_2 = 1$, taking the trace of the second equality in \eqref{eq:johncond} we see that $\sum_i c_i = d$. Hence, we can choose $\pi(L u_i) = c_i/d$, $i\in [m]$ and set $\pi(a)=0$ for all $a\in \cA \setminus \{ L u_1,\dots, L u_m \}$. Therefore

\begin{align*}

Q(\pi) = \frac{1}{d} L \left(\sum_i c_i u_i u_i^\top \right) L^\top = \frac{1}{d} L L^\top\,

\end{align*}

and so

\begin{align}

\sup_{v \in \cA} v^\top Q^{-1}(\pi) v

\leq \sup_{v\in E=LB_2^d} v^\top Q^{-1}(\pi) v

= d \sup_{u: \norm{u}_2\le 1} (L u)^\top (L L^\top)^{-1} (L u) = d\,.

\label{eq:ellipsoiddesign}

\end{align}

# Notes and departing thoughts

Note 1: In linear algebra, a **frame** of a vector space $V$ with an inner product can be seen as a generalization of the idea of a basis to sets which may be linearly dependent. In particular, given $0A<B<+\infty$, the vectors $v_1,\dots,v_m \in V$ is said to form an $(A,B)$-frame in $V$ if for any $x\in V$, $A\norm{x}^2 \le \sum_k |\ip{x,v_k}|^2 \le B \norm{x}^2$ where $\norm{x}^2 = \ip{x,x}$. Here, $A$ and $B$ are called the lower and upper frame bounds. When $\{v_1,\dots,v_m\}$ forms a frame, it must span $V$ (why?). However, $\span(\{v_1,\dots,v_m\}) = V$ is not a sufficient condition for $\{v_1,\dots,v_m\}$ to be a frame. The frame is called **tight** if $A=B$ and in this case the frame obeys a generalized

Parseval’s identity. If $A = B = 1$ then the frame is called Parseval or normalized. A frame can be used almost as a basis. John’s theorem can be seen as asserting that for any convex body, after an appropriate affine transformation of the body, there exist a tight frame inside the transformed convex body. Indeed, if $u_1,\dots,u_m$ are the vectors whose existence is guaranteed by John’s theorem and $\{c_i\}$ are the corresponding coefficients, $v_i = \sqrt{c_i/d}\, u_i$ will form a tight frame inside $\cK$. Given a frame $\{v_1,\dots,v_m\}$ and a given point $x\in V$ if we define $\norm{x}^2 \doteq \sum_k |\ip{x,v_k}|^2$, this is indeed a norm. For the frame coming from John’s theorem, for any $x\in \cK$, $\norm{x}^2\le 1$. Frames are widely used in harminoc (e.g., wavelet) analysis.

Note 2: The prediction with expert advice problem can also be framed as a linear prediction problem with changing action sets. In this generalization in every round the learner is given $M$ vectors, $a_{1t},\dots,a_{Mt}\in \R^d$, where $a_{it}$ is the recommendation of expert $i$. The environment’s choice for the round is $y_t\in \R^d$. The loss suffered by expert $i$ is $\ip{a_{it},y_t}$. The goal is to compete with the best expert in advice by selecting in every round based on past information one of the experts. If $A_t\in \{a_{1t},\dots,a_{Mt}\}$ is the vector recommended by the expert selected, the regret is

\begin{align*}

R_n = \EE{ \sum_{t=1}^n \ip{A_t,y_t} } – \min_{i\in [M]} \sum_{t=1}^n \ip{ a_{it}, y_t }\,.

\end{align*}

The algorithm discussed can be easily adapted to this case. The only change needed is that in every round one needs to choose a new exploration distribution $\pi_t$ that is adapted to the current “action set” $\cA_t = \{ a_{1t},\dots,a_{Mt} \}$. The regret bound proved still holds for this case. The setting of Exp4 discussed earlier corresponds to when $\cA_t$ is the subset of the $d$-dimensional simplex. The price of the increased generality of the current result is that while Exp4 enjoyed (in the notation of the current post) a regret of $\sqrt{2n (M \wedge d) \log(M) }$, here we would get on $O(\sqrt{n d \log(M)})$ with slightly higher constants.

Note 3 (Computation): The computation of the MVEE is a convex problem when the center of the ellipsoid is fixed and numerous algorithms have been developed for it, enjoying polynomial runtime guarantees exist. In general, the computation of the MVEE is hard. The approximate computation of optimal design is also widely researched; one can use, e.g., the so-called Franke-Wolfe algorithm for this purpose, which is known as Wynn’s method in optimal experimental design. John has also shown (implicitly) that the cardinality of the core set is at most $d(d+3)/2$.

# References

Using John’s theorem for guiding exploration in linear bandits was proposed by

- Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade:

Towards Minimax Policies for Online Linear Optimization with Bandit Feedback. COLT 2012: 41.1-41.14

where the exponential weights algorithm adopted to this setting is called Expanded Exp, or Exp2. Our proof departs from the proofs given here in some minor ways (we tried to make the argument a bit more pedagogical).

The review work

- Sébastien Bubeck and Nicolò Cesa-Bianchi (2012), “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems“, Foundations and Trends in Machine Learning: Vol. 5: No. 1, pp 1-122

also discussed Exp2.

According to our best knowledge, the connection to optimal experimental design through the Kiefer-Wolfowitz theorem and the proof that solely relied on this result has not been pointed out in the literature beforehand, though the connection between the Kiefer-Wolfowitz theorem and MVEEs is well known. It is discussed, for example in the recent nice book of Michael J. Todd

- Michael J. Todd: Minimum-Volume Ellipsoids: Theory and Algorithms, MOS-SIAM Series on Optimization, 2016.

This book also discusses algorithmic issues, the mentioned duality.

The theorem of Kiefer and Wolfowitz is due to them:

- J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960

John’s theorem is due to John Fritz:

- F. John. Extremum problems with inequalities as subsidiary conditions. In Studies and Essays, presented to R. Courant on his 60th birthday, January 8, 1948, pages 187–204. Interscience, New York, 1948.

The duality mentioned in the text was proved by Siber:

- R. Sibson. Discussion on the papers by Wynn and Laycock. Journal of the Royal Statistical Society, 34:181–183, 1972.

See also:

- F. Gürtuna. Duality of ellipsoidal approximations via semi-infinite programming. SIAM Journal on Optimization, 20:1421–1438, 2009.

Finally, one need not find the *exact* optimal solution to the design problem. An approximation up to reasonable accuracy is sufficient for small regret. The issues of finding a good exploration distribution efficiently (even in infinite action sets) are addressed in