According to the main result of the previous post, given any finite action set $\cA$ with $K$ actions $a_1,\dots,a_K\in \R^d$, no matter how an adversary selects the loss vectors $y_1,\dots,y_n\in \R^d$, as long as the action losses $\ip{a_k,y_t}$ are in a known interval, the exponential weights algorithm with linear loss estimation and an exploration strategy that mixes the distribution of the exponential weights algorithm with a fixed exploration strategy chosen to maximize information given $\cA$, enjoys a regret of at most $O(\sqrt{d n \log(K)})$. When the action set is continuous, like the unit ball, one can of course discretize it by putting a fine enough net over it. Optimizing the fineness of the discretization level, for convex action sets with bounded volume, this discretization approach delivers a regret of $O(d \sqrt{ n \log(n) })$: a discretization accuracy of $\epsilon=1/\sqrt{n}$ is sufficient to guarantee that the additional regret due to the discretized action set is bounded by $\sqrt{n}$ at most, and the cardinality of the resulting action-set is $O((1/\epsilon)^d) = O(n^{d/2})$.

Can we do better than this? As we shall see, the answer is yes, at least for the unit ball: In fact we will give an algorithm which avoids any kind of discretization (discretization is great for proving theoretical results, but the exponential-sized action-sets are impractical to work with), and in particular, the algorithm will be computationally efficient, while it will be shown to enjoy an improved regret bound of $O(\sqrt{dn})$. The algorithm given is based on the so-called mirror descent algorithm from optimization and online learning, which is a very powerful algorithm. In fact, the exponential weights algorithm is a special case of mirror descent.

Online linear optimization and mirror descent

Mirror descent (MD) is an algorithm that originated from the convex optimization literature and has later been transported to the online learning literature. Here, we describe it in the online learning framework, more specifically, as an algorithm to solve online linear optimization problems.

In online linear optimization one is given a closed convex set $\cK\subset \R^d$ and the goal is to simultaneously compete with all elements of $\cK$ for all linear environments. an environment instance is just a sequence of vectors $g_1,\dots,g_n\in \R^d$. In round $t$, the learning algorithm chooses a point $x_t\in \cK$ from the constraint set (we assume a deterministic choice here), suffers the loss $\ip{g_t,x_t}$ and then observes $y_t\in \R^d$. The algorithm’s performance is measured using its regret. We will find it convenient to discuss the regret against some fixed competitor $x\in \cK$:
\begin{align*}
R_n(x) = \sum_{t=1}^n \ip{g_t,x_t} – \sum_{t=1}^n \ip{g_t,x}\,.
\end{align*}
Then, $R_n = \max_{x\in \cK} R_n(x)$ is the regret over $\cK$.

The basic version of MD has two extra parameters beyond $n$ and $\cK$: A learning rate $\eta>0$ and a convex, “distance generating function” $F: \cD \to \R$, where $\cD\subset \R^d$ is the domain of $F$. The function $F$ is also often called a “potential function”, or a “regularizer” for reasons that will become clear later.

In the first round MD predicts
\begin{align}
x_1 = \arg\min_{x\in \cK} F(x)\,.
\end{align}
In round $t\ge 1$, after $g_t$ is observed, the prediction $x_{t+1}$ for the next round is computed using
\begin{align}
x_{t+1} = \argmin_{x\in \cK \cap \cD} \eta \ip{g_t,x} + D_F(x,x_t)\,.
\label{eq:mdoneline}
\end{align}
Here, $D_F(x,x_t)$ is the so-called $F$-induced Bregman divergence of $x$ from $x_t$ defined by
\begin{align}
D_F(x,x_t) = F(x) – F(x_t) – \ip{ \nabla F(x_t), x-x_t }
\end{align}
with $\nabla F(x_t)$ denoting the derivative of $F$ at $x_t$ (i.e., $\nabla F: \mathrm{dom}(\nabla F) \to \R^d$). The minimization in \eqref{eq:mdoneline} is over $\cK\cap \cD$ because the domain of $D_F(\cdot,x_t)$ if the same as that of $F$.

To get a sense of the divergence function $D_F$, note that $D_F(\cdot,y)$ is the difference between $F(\cdot)$ and its first-order Taylor expansion, $F(y)+\ip{\nabla F(y),\cdot-y}$, about the point $y\in \mathrm{dom}(\nabla F)$. Since $F$ is convex, the linear approximation of $F$ is a lower bound on $F$ and so $D_F(\cdot,y)$ is nonnegative over its domain with $D_F(y,y)=0$ (see the figure below for an illustration). Furthermore, $D_F(\cdot,y)$ is convex as it is the sum of a convex and a linear function. Hence, \eqref{eq:mdoneline} is a convex optimization problem.

Since $D_F(x,y)$ is defined only when $y\in \mathrm{dom}(\nabla F)$, for the definition \eqref{eq:mdoneline} to make sense we need to have that $F$ is differentiable at all of $x_1,\dots,x_{t-1}$. For our general regret bound we will thus assume that the following holds:
\begin{align}
x_t\in \mathrm{dom}(\nabla F)\,, \qquad \forall t\in [n+1]\,.
\label{eq:xtingraddomain}
\end{align}
Note that $x_n,x_{n+1} \in \mathrm{dom}(\nabla F)$ is not needed for MD to be well defined on the horizon $n$; we included these as they will be convenient for our analysis (here, $x_{n+1}$ is defined by \eqref{eq:mdoneline} as expected). A sufficient condition for \eqref{eq:xtingraddomain} to hold is that $F$ is a turns out to be a Legendre function (the operations in the definition $F$ are defined in a componentwise manner).

Mirror maps

The update rule of MD aims to achieve two goals: making the loss of $x_{t+1}$ under $g_t$ small, while at the same time staying close to the previous estimate, thus buying stability for the algorithm. When $F(x) = \frac12 \norm{x}_2^2$ with $\cD = \R^d$ and $\R^d$, we get $\nabla F(x) = x$, $D_F(x,x_t) = \frac12 \norm{x-x_t}_2^2$ and the update rule just corresponds to gradient descent:
\begin{align}
x_{t+1} = x_t – \eta g_t\,.
\label{eq:mdgd}
\end{align}
More generally, when $\cK$ is compact nonempty convex set, the update rule corresponds to projected gradient descent: After computing the gradient update as in \eqref{eq:mdgd}, the resulting vector $\tilde x_{t+1}$, if it is lying outside of $\cK$, needs to be projected back to $\cK$ by calculating the point of $\cK$ which is closest to $\tilde x_{t+1}$ in the Euclidean metric.

This two-step computation works also with other appropriate distance generating functions $F$. In particular, when, for example, $F$ is Legendre, one can show that \eqref{eq:mdoneline} is equivalent to the following two-step method:
\begin{align}
\tilde{x}_{t+1} &= (\nabla F)^{-1}( \nabla F(x_t) – \eta g_t )\,, \label{eq:mdunnorm}\\
x_{t+1} &= \argmin_{x\in \cK\cap \cD} D_F(x,\tilde x_{t+1})\,.\label{eq:mdproj}
\end{align}
Here, $\tilde{x}_{t+1} = \argmin_{x\in \cD} \eta \ip{g_t,x} + D_F(x,x_t)$ is the “unconstrained” optimizer of the objective function minimized by MD. For this to be an unconstrained optimizer, what is needed is that $\nabla F$ blows up at the “boundary” of $\cD$ and that $\cD^\circ$, the interior of $\cD$ is nonempty and is the same as the domain of $\nabla F$, the defining properties of Legendre functions.

When $F$ is Legendre, one can show that $\tilde x_{t+1} \in \cD^\circ$. Thus, in this case differentiating the objective function, $\tilde x_{t+1}$ must satisfy
\begin{align}
0=\eta g_t + \nabla F(\tilde x_{t+1}) – \nabla F(x_t)\,.
\label{eq:uncopt}
\end{align}
from which one obtains \eqref{eq:mdunnorm}. The second step \eqref{eq:mdproj} takes the unconstrained optimizer and projects it to $\cK\cap \cD$. The update rules \eqref{eq:mdunnorm}-\eqref{eq:mdproj} explain the name of the algorithm: The algorithm operates in two spaces: The primal space where the predictions “live”, and the dual space, where the slopes, or gradients “live”. The name “dual” comes from the viewpoint that gradients, or derivatives are linear functionals over the primal space, which thus “live” in the space of linear functionals, which is also called the “dual”. With this language, the “mirror functions” $\nabla F$ and $(\nabla F)^{-1}$ map between subsets of the primal and dual spaces.

Back to the algorithm, from \eqref{eq:mdunnorm} we see that MD first takes the last prediction $x_t$ and maps it to the dual space through $\nabla F$, where it is combined with the observed loss vector $g_t$ (which has the nature of a gradient). This can be thought as the gradient update step. Next, the dual vector obtained is mapped back to the primal space via $(\nabla F)^{-1}$ to get $\tilde x_{t+1}$. Since this can lie outside of the constraint set, it is mapped back to it in the final step \eqref{eq:mdproj}. For the mathematically or computer-language minded, it should be gratifying that the loss vector $g_t$, a “dual type quantity” is combined with another “dual type quantity” rather than with a “primal quantity”, i.e., the “types” of these quantities are respected. In particular, the gradient update \eqref{eq:mdgd} gains new meaning through this: Here $x_t$ is really $\nabla F(x_t)$ for $F(x) = \frac12 \norm{x}_2^2$, which of course, happens to be equal to $x_t$. However, now we know that this is a mere coincidence!

However pleasing the thought of type-compatibility may be, it remains to be seen that the generality of MD will help. As we shall soon see the regret of MD depends on two quantities: the “diameter” of $\cK$ when measured via $F$, and the magnitudes of $g_t$ when measured in a metric compatible with $F$. One can then show that for many sets $\cK$, by choosing $F$ in an appropriate manner, MD becomes a dimension-free algorithm in the sense that its regret will be independent of the dimension $d$. This is a huge advantage when the dimension $d$ can be large, as is often the case in modern applications.

A specific example where choosing a non-Euclidean $F$ is beneficial is when $\cK$ is an $\ell^p$-ball. In this case, an appropriate $F$ gives dimension-free regret bounds, while using $F(x) = \norm12 \norm{x}_2^2$ does not share this quality (the reader is invited to check these after we state our general result). Another special case, mentioned earlier, is when $\cK$ is the simplex of $\R^d$, $F$ is the unnormalized negentropy, $\norm{g_t}_\infty\le 1$ (so that $|\ip{x,g_t}|\le 1$ for any $t$ and $x\in \cK$). While in this case the regret will not be dimension-free, the dimension dependence as compared to standard gradient descent is reduced from a linear dependence to a logarithmic dependence, an exponential improvement. On the top of this, the resulting algorithm, exponential weights, is trivial to implement, while a Euclidean projection to the simplex, is more complex to implement.

The regret of MD

We start with a simple result from convex analysis. As is well known, the minimizer $x^*$ of a differentiable function $\psi:\R^d \to \R$ must satisfy $\nabla \psi(x^*)=0$, which is known as the first-order optimality condition (first-order, because the first derivative of $\psi$ is taken). When $psi$ is convex, this is both a sufficient and necessary condition (this is what we like about convex functions). The first-order optimality condition can also be generalized to constrained minima. In particular, if $\cK\subset \R^d$ is a non-empty convex set and $\psi$ is as before then
\begin{align}
x^*\in \argmin_{x\in \cK} \psi(x) \Leftrightarrow \forall x\in \cK: \ip{ \nabla \psi(x^*),x-x^* } \ge 0\,.
\label{eq:firstorderoptfull}
\end{align}
The necessity of this condition is easy to understand by a geometric reasoning as shown on the picture on the right: Since $x^*$ is a minimizer of $\psi$ over $\cK$, $-\nabla \psi(x^*)$ must be the outer normal of the supporting hyperplane $H_{x^*}$ of $\cK$ at $x^*$ otherwise $x^*$ could be moved by a small amount while staying inside $\cK$ and improving the value of $\psi$. Since $\cK$ is convex, it thus lies entirely on the side of $H_{x^*}$ that $\nabla \psi(x^*)$ points into. This is clearly equivalent to \eqref{eq:firstorderoptfull}. The sufficiency of the condition also follows from this geometric viewpoint as the reader may verify.

The above statement continues to hold with a small modification even when $\psi$ is not everywhere differentiable. In particular, in this case the equivalence \eqref{eq:firstorderoptfull} holds for any $x^*\in \mathrm{dom}(\nabla \psi)$ with the modification that on both sides of the equivalence, $\cK$ should be replaced by $\cK \cap \dom(\psi)$:

Proposition (first-order optimality condition): Let $\psi:\dom(\psi)\to\R$ be a convex function, $\cK\not=\emptyset$, $\cK\subset \R^d$ convex. Then for any $x^*\in \dom(\nabla \psi)$, it holds that
\begin{align}
x^*\in \argmin_{x\in \cK\cap \dom(\psi)} \psi(x)\\
\Leftrightarrow \forall x\in \cK\cap \dom(\psi): \ip{ \nabla \psi(x^*),x-x^* } \ge 0\,.
\label{eq:firstorderopt}
\end{align}

With this we are ready to start bounding the regret of MD. As the potential function $F$ will be kept fixed, to minimize clutter we abbreviate $D_F$ to $D$ in the expressions below.

In round $t\in [n]$, the instantaneous regret of MD against $x\in \cK$ is $\ip{g_t,x_t-x}$. As $x_{t+1} = \argmin_{x\in \cK \cap \cD} \psi_t(x)$ where $\psi_t:\cD \to \R$ is given by $\psi_t(x) = \eta \ip{g_t,x} + D(x,x_t)$, the above first-order optimality condition immediately gives the following result:

Proposition (Instantaneous regret of MD): Let $x\in \cK \cap \cD$. Assume that \eqref{eq:xtingraddomain} holds. Then, for any $t\in [n]$,
\begin{align}
\begin{split}
\ip{g_t,x_t-x} & \le \frac{D(x,x_t)-D(x,x_{t+1})}{\eta} \\
& \qquad + \ip{g_t,x_t-x_{t+1}} – \frac{D(x_{t+1},x_t)}{\eta}\,.
\end{split}
\label{eq:mdir1}
\end{align}

Proof: Fix $t\in [n]$. By \eqref{eq:xtingraddomain}, $x_{t+1} \in \dom(\nabla \psi_t)$. Hence, we can apply the first-order optimality proposition to $x_{t+1}$ to get that
\begin{align*}
\ip{ \nabla \psi_t(x_{t+1}),x-x_{t+1} } \ge 0\,.
\end{align*}
Plugging in the definition of $\psi_t$ and lengthy but simple algebraic manipulations give \eqref{eq:mdir1}.
QED.

When summing up the instantaneous regret, the first term on the right-hand side of \eqref{eq:mdir1} telescopes and gives $\frac{D(x,x_1)-D(x,x_{n+1})}{\eta}$. Since $D$ is nonnegative, this can be upper bounded by $D(x,x_1)/\eta$. Further, thanks to the choice of $x_1$, using again the first-order optimality condition, we get that this is upper bounded by $(F(x)-F(x_1))/\eta$:
\begin{align}
\sum_{t=1}^n \frac{D(x,x_t)-D(x,x_{t+1})}{\eta}
\le \frac{F(x)-F(x_1)}{\eta}\,.
\end{align}

To bound the remaining terms in \eqref{eq:mdir1}, we may try to upper bound the inner product $\ip{g_t,x_t-x_{t+1}}$ here by bringing in a norm $\norm{\cdot}$ and its dual $\norm{\cdot}_*$:
\begin{align}
\ip{g_t,x_t-x_{t+1}} \le \norm{g_t}_* \norm{x_t-x_{t+1}}\,,
\label{eq:mdholder}
\end{align}
where we may recall that for a norm $\norm{\cdot}$ its dual $\norm{\cdot}_*$ is defined via $\norm{g}_* = \sup_{x: \norm{x}\le 1} \ip{g,x}$. To combine this with $-D(x_{t+1},x_t)$ is it beneficial if this can be lower bounded by some expression that involves the same norm. To get inspiration consider the case when $F(x) = \frac12 \norm{x}_2^2$. As noted earlier, in this case $D=D_F$ is given by $D_F(x,y) = \frac12 \norm{x-y}_2^2$. This gives the idea that perhaps we should assume that
\begin{align}
D_F(x,y) \ge \frac12 \norm{x-y}^2\,, \quad \forall (x,y)\in \dom(F)\times \dom(\nabla F)\,,
\label{eq:mddfsoc}
\end{align}
which is equivalent to requiring that $F$ is strongly convex w.r.t. $\norm{\cdot}$. To merge \eqref{eq:mdholder} and $-\frac{1}{2\eta} \norm{x_{t}-x_{t+1}}^2$, we may further upper bound the right-hand side of \eqref{eq:mdholder} via using $2ab \le a^2+b^2$. Choosing $a = \eta^{1/2} \norm{g_t}_*$ and $b = \eta^{-1/2} \norm{x_t-x_{t+1}}$, we get that
\begin{align}
\ip{g_t,x_t-x_{t+1}} – \frac{D(x_t,x_{t+1})}{\eta} \le \frac{\eta}{2} \norm{g_t}_*^2\,.
\end{align}
For increased generality (which will prove to be useful later) we may let the $\norm{\cdot}$ be chosen as a function of the round index $t$, leading to the following bound:

Theorem (Regret of MD): Let $\eta>0$, $F$ be a convex function with domain $\cD$. Assume that \eqref{eq:xtingraddomain} holds. Further, for each $t\in [n]$, let $\norm{\cdot}_{(t)}$ be a norm such that
\begin{align}
D_F(x_t,x_{t+1}) \ge \frac12 \norm{ x_t – x_{t+1} }^2_{(t)}
\label{eq:bregsoc}
\end{align}
and let $\norm{\cdot}_{(t)^*}$ be the dual norm of $\norm{\cdot}_{(t)}$. Then, the regret $R_n(x)$ of mirror descent against any $x\in \cK \cap \cD$ satisfies the bound
\begin{align}
R_n(x) \le
\frac{F(x)-F(x_1)}{\eta} + \frac{\eta}{2} \sum_{t=1}^n \norm{g_t}_{(t)^*}^2\,.
\end{align}

While the theorem is presented for the case of a fixed sequence $\{g_t\}_t$, it is not hard to see that the same bound holds even if $g_t$ is chosen as a function $y_1,x_1,\dots,y_{t-1},x_t$, i.e., when the environment is “strategic”. This is entirely analogous to how the basic regret bound for exponential weights continues to hold in the face of strategically chosen losses.

As an example of how to use this theorem consider first the case of $\cK = B_2^d$ (the unit ball in $\R^d$ with respect to the $2$-norm) with $F(x) = \frac12 \norm{x}_2^2$. Note that $\cD = \dom(\nabla F) = \R^d$ and thus \eqref{eq:xtingraddomain} is automatically satisfied. We have $\diam_F(\cK) = \sup_{x\in \cK} F(x) – F(x_1)\le 1$. Choosing $\norm{\cdot}_{(t)} = \norm{\cdot}_2$, we have $\norm{\cdot}_{(t)^*} = \norm{\cdot}_2$. Assume that $\{g_t\}$ is leads to bounded losses and in particular, $|\ip{x,g_t}|\le 1$ for all $t\in [n]$ and $x\in \cK$. Note that this holds if and only if $\norm{g_t}_2\le 1$. Thus we get that the regret of MD (which is just the projected gradient descent algorithm in this case) satisfies
\begin{align}
R_n(x) \le \frac1\eta + \frac{\eta n}{2} = \sqrt{2n}\,,
\end{align}
where for the last step we set $\eta = \sqrt{\frac{2}{n}}$.

As a second example consider the case when $\cK$ is the unit simplex of $\R^d$: $\cK = \{ x\in [0,1]^d\,:\, \sum_i x_i = 1\}$. Note that $\cK$ lies in a $d-1$-dimensional hyperplane of $\R^d$. In this case, choosing $F(x) = x\log (x) – x$, the unnormalized negentropy function, $\cD = [0,\infty)^d$ and we find that $x_1 = (1/d,\dots,1/d)$, since $F$ is negative valued, $\diam_F(\cK) = F(x) – F(x_1) \le -F(x_1) =\log(d)$. Further, $D_F(x,y) = \KL(x,y)=\sum_i x_i \log(x_i/y_i)$ is the relative entropy of $x$ given $y$, which is known to satisfy
\begin{align}
D_F(x,y) \ge \frac12 \norm{x-y}_1^2\,
\label{eq:pinsker}
\end{align}
(this inequality is known as Pinsker’s inequality). Hence, we choose $\norm{\cdot}_{(t)} = \norm{\cdot}_1$ and thus $\norm{\cdot}_{(t)^*} = \norm{\cdot}_\infty$. In this case the assumption that the losses are properly normalized imply that $\norm{g_t}_\infty \le 1$. Putting things together we get that the regret of MD satisfies
\begin{align}
R_n(x) \le \frac{\log d}{\eta} + \frac{\eta n}{2} = \sqrt{2 \log(d) n }\,,
\end{align}
which matches the regret bound we derived earlier for the exponential weights algorithm. Curiously, MD with this choice is in fact the exponential weights algorithm. What is more, there is a one-to-one map of the steps of the two respective proofs, as the reader may verify.

What would have happened if we used $F(x) = \frac12 \norm{x}_2^2$ with $\norm{\cdot}_{(t)} = \norm{\cdot}_2$ instead of the unnormalized negentropy function? While in this case $\diam_F(\cK) \le 1$, $\norm{g_t}_2^2$ could be as large as $d$ (e.g., $g_t = (1,\dots,1)$), which gives $R_n(x) \le \sqrt{2 d n }$. In fact, this is real: MD with this choice can suffer a much larger regret than MD with the unnormalized negentropy potential. Thus we see that at least in this case the increased generality of MD allows a nontrivial improvement of the regret.

The reader is also invited to check the regret bounds that follow with various potential functions when $\cK$ is the unit $\ell^p$ ball, i.e., $\cK = \{ x\,: \norm{x}_p \le 1 \}$ where $\norm{x}_p$ is defined through $\norm{x}_p^p = \sum_i |x_i|^p$, while the loss per round in still in $[-1,1]$. For example, what is the regret for $F(x) = \frac12 \norm{x}_2^d$, or $F(x) = \frac12 \norm{x}_p^2$? For the calculation it may be useful to know that $F$ is $(p-1)$ strongly convex with respect to $\norm{\cdot}_p$, as the reader may also verify.

From these examples we see that two things matter when using MD with some potential function $F$ and a norm $\norm{\cdot}$: the magnitude of $\diam_F(\cK)$ and that of $G = \sup \{ \norm{g}_* \,: \sup_{x\in \cK} |\ip{g,x}| \le 1 \}$. It follows that if $\cK$ is a subset of the unit ball underlying $\norm{\cdot}$, $\norm{g}_* = \sup_{x:\norm{x}\le 1} |\ip{g,x}| \le \sup_{x\in \cK} |\ip{g,x}| \le 1$. Thus, for a subset of unit balls $G\le 1$ regardless the choice of $F$. Thus, in this case the choice of $F$ boils down to minimizing the diameter of $\cK$ under $F$ subject to the constraint \eqref{eq:bregsoc}.

Our next goal will be to design an algorithm based on MD when the action set $\cA$ is the $d$-dimensional unit ball $B_2^d$. It turns out that for this it will be useful to further upper bound the regret of MD. In particular, this way we will get a bound that will be easier to bound than bounding the regret directly. For this, we will assume that $F:\cD\to \R$ is Legendre with a “large enough” domain.

Theorem (Regret for MD with Legendre potential): Let $F:\cD \to \R$ be Legendre such that $\cK \subset \bar \cD$. For any $x\in \cK$, the regret $R_n(x)$ of MD against $x$ satisfies
\begin{align}
R_n(x) \le
\frac{F(x)-F(x_1)}{\eta} + \frac{1}{\eta}\, \sum_{t=1}^n D_F(x_t,\tilde x_{t+1}) \,.
\end{align}

Proof: Note that \eqref{eq:xtingraddomain} is satisfied because $\bar \cD \subset \cK$ and $F$ is Legendre. We start from \eqref{eq:mdir1}. Earlier, we argued that the sum of first terms on the right-hand side of this is bounded by $\frac{F(x)-F(x_1)}{\eta}$. Hence, it remains to bound the remaining terms. We claim that
\begin{align*}
\ip{g_t,x_t-x_{t+1}} – \frac{D_F(x_{t+1},x_t)}{\eta}
= D_F(x_t,\tilde x_{t+1}) – D_F(x_{t+1},\tilde x_{t+1})\,,
\end{align*}
from which the results follow since $D_F$ is nonnegative. Now, the above identity can be shown using some algebra and the optimality property of $x_{t+1}$, namely \eqref{eq:uncopt}.
QED.

The reason this result is useful is because $D_F(x_t,\tilde x_{t+1})$ can also be written as the Bregman divergence of $\nabla F(\tilde x_{t+1})$ from $\nabla F(x_t)$ with respect to what is known as the Legendre-Fenchel dual $F^*$ of $F$ and oftentimes this divergence is easier to bound than the expression we had previously. The Legendre-Fenchel dual of $F$ is defined via $F^*(g) \doteq \sup_{x\in \cD} \ip{g,x} – F(x)$. Now, the identity mentioned states that for any $F$ Legendre and any $x,x’ \in \dom(\nabla F)$,
\begin{align}
D_F(x,x’) = D_{F^*}(\nabla F(x’), \nabla F(x))\,.
\label{eq:dualbregman}
\end{align}
In particular, this also means that
\begin{align}
R_n(x) \le
\frac{F(x)-F(x_1)}{\eta} + \frac{1}{\eta}\, \sum_{t=1}^n D_{F^*}(\nabla F(\tilde x_{t+1}), \nabla F( x_{t}) ) \,.
\end{align}
For calculations it is often easier to obtain $F^*$ from the identity $\nabla F^*(g) = (\nabla F)^{-1}(g)$, which holds for all $g\in \dom(\nabla F^*)$. This means that one can find $F^*$ by calculating the primitive function of $(\nabla F)^{-1}$.

Linear bandits on the unit ball

In accordance with the notation of the previous post, the sequence of losses will be denoted by $y_t$ (and not $g_t$ as above) and we shall assume that
\begin{align}
\sup_{t,a\in \cA} |\ip{y_t,a}| \le 1\,.
\label{eq:mdboundedloss}
\end{align}
Per our discussion this is equivalent to $\max_t \norm{y_t}_2\le 1$.

To use MD we will use randomized estimates of the losses $\{y_t\}$. In particular, in round $t$ the bandit algorithm will choose a random action $A_t\in \cA$ based on which an estimate $\tilde Y_t$ of $y_t$ will be constructed using the observed loss $Y_t = \ip{y_t,A_t}$.

Let $F$ be a convex function on $\cD\subset \R^d$ to be chosen later, let $X_1=x_1= \argmin_{x\in \cA\cap \cD} F(x)$, and for $t\in [n-1]$ let $X_{t+1}$ be the output of MD when used with learning rate $\eta$ on the input $\tilde Y_t$:
\begin{align*}
X_{t+1} = \argmin_{x\in \cA\cap \cD} \eta \ip{\tilde Y_t, x} + D_F(x,x_t)\,.
\end{align*}
(The upper case letters signify that $X_t$ is random, owing to that $A_1,\dots,A_{t-1}$, hence $\tilde Y_1,\dots,\tilde Y_{t-1}$ are random.) We will make sure that $F$ is such that the minimum here can be taken over $\cA$. For uniformity, we also use an uppercase letter to denote the first choice $X_1$ of MD. As long as
\begin{align}
\EE{ A_t|X_t } = X_t\,, \qquad \text{ and } \qquad \EE{ \tilde Y_t| X_t } = y_t
\label{eq:md_atyt}
\end{align}
hold for all $t\in [n]$, the regret $R_n(x)= \EE{ \sum_{t=1}^n \ip{y_t,A_t} – \ip{y_t,x} }$ of this algorithm satisfies
\begin{align*}
R_n(x)
& = \EE{ \sum_{t=1}^n \ip{y_t,X_t} – \ip{y_t,x} }
= \EE{ \sum_{t=1}^n \ip{\tilde Y_t,X_t} – \ip{\tilde Y_t,x} }\,.
\end{align*}
Notice that the last expression inside the expectation is the (random) regret of mirror descent on the (recursively constructed) sequence $\tilde Y_t$. Hence, when the conditions of our theorem are satisfied, for any $x\in \cA \cap \cD$, for $F$ Legendre,
\begin{align}
\label{eq:mdlinbandit}
R_n(x) \le \frac{F(x)-F(x_1)}{\eta } +
\frac{1}{\eta} \sum_{t=1}^n \EE{ D_{F^*}( \nabla F(\tilde X_{t+1}), \nabla F(X_t) ) }\,.
\end{align}
Thus, it remains to choose $A_t,\tilde Y_t$ and $F$ with domain $\cD$ such that $\cK \subset \bar \cD$.

Unsurprisingly, $A_t$ and $\tilde Y_t$ will be chosen similarly to what was done in the previous post. In particular, $\tilde Y_t$ will be a linear estimator with a small twist that will simplify some calculations, while $A_t$ will be selected at random to be either $X_t$ or an exploratory action $U_t$ drawn randomly from some exploration distribution $\pi$ supported on $\cA$. The calculation of the previous post suggests that a $G$-optimal exploration distribution should be used. For the unit ball, it turns out that we can simply choose $\pi$ as the uniform distribution on $\{ \pm e_i \}$ where $\{e_i\}_i$ is the standard Euclidean basis. That this is $G$-optimal follows immediately from the Kiefer-Wolfowitz theorem.

To formally define $A_t$, let $P_t\in (0,1]$ be the probability of exploring in round $t$ ($P_t$ will be chosen appropriately later), let $E_t \sim \mathrm{Ber}(P_t)$, $U_t\sim \pi$ such that $E_t$ and $U_t$ are independent given the information available at the time when they are chosen. Then, to satisfy $\EE{A_t|X_t} = X_t$, we can set
\begin{align}
A_t = E_t U_t + \frac{1-E_t}{1-P_t} X_t\,.
\end{align}
Since $A_t\in \cA = B_2^d$ must hold this gives the constraint that
\begin{align}
P_t \le 1-\norm{X_t}_2 \,.
\label{eq:ptconstraint}
\end{align}
This also means that one must somehow ensure that $\norm{X_t}_2<1$ holds for all $t\in [n]$. One simple way to enforce this is to replace the domain used in MD by $(1-\gamma) \cA$ for some small $\gamma>0$. Since this introduces at most a factor of $\gamma n$ in the regret, as long as $\gamma$ is small enough, the additional regret due to this modification will be negligible.

Having chosen $A_t$, it remains to construct $\tilde Y_t$. Consider
\begin{align*}
\tilde Y_t = Q_t^{-1} E_t A_t Y_t\,, \quad Q_t = \E_t{ E_t A_t A_t^\top }\,,
\end{align*}
where $\E_t{Z} = \EE{Z| X_1,A_1,Y_1,\dots,X_{t-1},A_{t-1},Y_{t-1},X_t }$. Note that this is a slightly modified version of the linear estimator and thus $\E_t{\tilde Y_t} = y_t$. The modification is that in this estimator $A_t$ is multiplied by $E_t$, which means that the estimate is $0$ unless an exploratory action is chosen. While this is not necessary, it simplifies the calculations and in fact slightly improves the bound that we derive. In particular, we have $E_t A_t = U_t$ and hence $Q_t = \frac{P_t}{d} I$ and thus
\begin{align*}
\tilde Y_t = \frac{d\, E_t}{P_t} U_t U_t^\top y_t\,.
\end{align*}

It remains to choose $F$ and the norms. One possibility would be $F(x) = \frac12 \norm{x}_2^2$ which is an appealing choice both because it leads to simple calculations and also because in the full information setting (i.e., when $y_t$ is observed in every round) this choice enjoyed a dimension-free regret bound. It is easy to check that $F=F^*$, hence $D_{F^*}(u,v) = \frac12 \norm{u-v}_2^2$. Hence,
\begin{align}
\frac1\eta \E_t D_{F^*}( \nabla F( \tilde X_{t+1} ), \nabla F( X_t ) )
&=
\frac{\eta}{2} \E_t[ \norm{\tilde Y_t}_2^2 ]
=
\frac{d^2}{P_t^2} \E_t[ E_t y_t^\top U_t U_t^\top U_t U_t^\top y_t ]\nonumber \\
&=
\frac{d^2}{P_t^2} \E_t[ E_t y_t^\top U_t U_t^\top y_t ]
=
\frac{d}{P_t} \norm{y_t}_2^2 \le \frac{d}{P_t}\,,
\label{eq:twonormbound}
\end{align}
where the first equality used identity \eqref{eq:uncopt}. To finish, one could choose $P_t = \gamma$ and set the domain of MD to $(1-\gamma) \cA$, but even with these modifications, the best bound that can be derived scales only with $n^{2/3}$, which is not what we want.

Hence, if the other parameters of the construction are kept fixed, we need to go beyond the Euclidean potential. The potential we consider is
\begin{align*}
F(x) = -\log(1-\norm{x}) – \norm{x}\,, \quad x\in \cD = \{ x’\,:\, \norm{x’}<1 \}\,. \end{align*} This can easily be seen Legendre. Further, $\nabla F(x) = \frac{x}{1-\norm{x}}$, and also $F^*(g)=-\log(1+\norm{g})+\norm{g}$. Using that for $x\ge -1/2$, $\log(1+x)\ge x-x^2$, one can show that for \begin{align} \frac{\norm{g}-\norm{h}}{1+\norm{h}}\ge -1/2\,, \label{eq:ghapart} \end{align} it holds that \begin{align} D_{F^*}(g,h) & = \frac{1}{1+\norm{h}}\norm{g-h}^2\,. \end{align} Plugging in $g=\nabla F(\tilde x_{t+1})$ and $h=\nabla F(x_t)$, we see that \eqref{eq:ghapart} is satisfied, hence, thanks to $1/(1+\norm{\nabla F(X_t)}) = 1-\norm{X_t}$, after some calculation we get \begin{align*} \E_t D_{F^*}( \nabla F(\tilde X_{t+1}),\nabla F(x_t)) \le d\,. \end{align*} Putting together things, again running MD on $(1-\gamma)B_2^d$ and choosing $\eta$ and $\gamma$ appropriately, we get that \begin{align*} R_n(x)\le 3 \sqrt{dn\log(n)}\,. \end{align*} Surprisingly, this is smaller than the regret that we got for stochastic bandits on the unit ball by a factor of at least $\sqrt{d}$. However, it will be the subject of the next post to discuss how this could happen.

Notes and departing thoughts

Computational complexity of MD: Usually, \eqref{eq:mdunnorm} is trivial to implement, while \eqref{eq:mdproj} can be quite expensive. However, in some important special cases, the projection can be computed efficiently, even in closed form, e.g., for the simplex and the negentropy potential, or the Euclidean unit ball and the Euclidean potential $F(x) = \frac12\norm{x}_2^2$. In general, one can of course use a convex solver to implement this step, but then one can also use the convex solver to find an approximate optimizer of \eqref{eq:mdoneline} directly.

Legendre functions: A strictly convex map $F:\cD \to \R$ is Legendre if the interior $\cD^\circ$ of $\cD$ is nonempty, $\mathrm{dom}(\nabla F) = \cD^\circ$ ($F$ is differentiable over the interior of its domain) and $\nabla F(x)$ diverges as $x$ approaches the boundary of $\cD$. Formally, this last condition means that for any $\{x_n\}_n \subset \mathrm{dom}(\nabla F)$ such that either $\norm{x_n}\to\infty$ or $x_n \to \partial \cD = \bar \cD \setminus \cD^\circ$, $\norm{ \nabla F(x_n) } \to \infty$ as $n\to\infty$. Note that since in a finite-dimensional space all norms are equivalent, it does not matter which norm is used in the above definition. One can show that if the distance generating map $F$ is Legendre then the sequence $\{x_t\}_{t\in [n]}$ computed by MD always satisfies \eqref{eq:mdunnorm}-\eqref{eq:mdproj}, and \eqref{eq:xtingraddomain} also holds.

Simplex, negentropy, exponential weights: A canonical example that is instructive to illustrate the concept of Legendre functions is when $\cK = \Delta_d = \{ x\in [0,1]^d\,:\, \sum_i x_i = 1\}$ is the simplex embedded into $\R^d$ and $F(x) = \sum_i x_i \log(x_i)-x_i$ is the unnormalized negentropy function. Here, the domain $\cD$ of $F$ is $\cD = [0,\infty)^d$ where we use the convention that $0 \log(0)=0$ (which is sensible since $\lim_{s\to 0+} s \log(s) = 0$). Then it is easy to see that $\dom(\nabla F) = (0,\infty)^d$ with $\nabla F(x) = \log(x)$ ($\log(\cdot)$ is applied componentwise). Then, $\partial \cD = \{ x\in [0,\infty)^d \,:\, \exists i \, \text{ s.t. } x_i = 0\}$ and whenever $\{x_n\}_n \subset (0,\infty)$ is such that either $\norm{x_n}\to\infty$ or $x_n \to \partial \cD$ as $n\to\infty$ then $\norm{\nabla F(x_n)} \to \infty$. Thus, $F$ is Legendre. As noted earlier, MD in this case is well defined, the computation implemented by MD is captured by the two steps \eqref{eq:mdunnorm}-\eqref{eq:mdproj} and \eqref{eq:xtingraddomain} also holds. In fact, a small computation shows that $x_{t+1}$ in this case satisfies
\begin{align*}
x_{t+1,i} = \frac{ \exp( – \eta g_{ti} ) x_{ti} }{\sum_j \exp( – \eta g_{tj} ) x_{tj} }
= \frac{\exp(-\eta \sum_{s=1}^t g_{si} ) }{\sum_j \exp(-\eta \sum_{s=1}^t g_{sj} ) }\,,
\end{align*}
which we can recognize as the exponential weights algorithm used for example in the previous post, as well as numerous other occasions.

Let $f:\dom(f) \to \R$ be convex, $y\in \dom(\nabla f)$ and let $\norm{\cdot}$ be a norm. As it was pointed out earlier $f$ is then lower bound by its first-order Taylor approximation about the point $y$:
\begin{align*}
f(x) \ge f(y) + \ip{ \nabla f(y), x-y }, \qquad \forall x\in \dom(f)\,.
\end{align*}
If we can add $\frac12 \norm{x-y}^2$ to the lower bound and it still remain a lower bound, i.e., when
\begin{align*}
f(x) \ge f(y) + \ip{ \nabla f(y), x-y } + \frac12 \norm{x-y}^2, \qquad \forall x\in \dom(f)
\end{align*}
then we say that $f$ is strongly convex w.r.t. $\norm{\cdot}$ at $y$. When this holds for all $y\in \dom(\nabla f)$, we say that $f$ is strongly convex w.r.t. $\norm{\cdot}$. Often, strong convexity is defined with more flexibility when $\frac12$ above is replaced by $\frac{\alpha}{2}$ with some positive $\alpha>0$:
\begin{align*}
f(x) \ge f(y) + \ip{ \nabla f(y), x-y } + \frac{\alpha}2 \norm{x-y}^2, \qquad \forall x\in \dom(f)
\end{align*}
In this case, we say that $f$ is $\alpha$-strongly convex w.r.t. $\norm{\cdot}$ (at $y$), or that the strong convexity modulus of $f$ at $y$ is $\alpha$. Of course, the two definitions are the same when we scale the norm between the two of them. By reordering the above inequalities, we see that strong convexity can also be expressed as a lower bound on the Bregman-divergence induced by $f$. For example, the second inequality can be written as
\begin{align*}
D_f(x,y) \ge \frac{\alpha}2 \norm{x-y}^2, \quad \forall x\in \dom(f)\,.
\end{align*}

Strong convexity, Hessians and matrix-weighted norms: Let $F:\dom(F)\to \R$ be convex and twice differentiable over its domain. We claim that $F$ is strongly convex w.r.t. the (semi-)norm $\norm{\cdot}$ defined by $\norm{x}^2 \doteq x^\top G x$ where $G \preceq \nabla^2 F(y)$ for all $y\in \dom(F)$. To see this note that
\begin{align}
F(x) \ge F(y) + \ip{\nabla F(y),x-y} + \frac12 \norm{x-y}^2
\label{eq:sochessiansoc}
\end{align}
holds if and only if $g(\cdot) \doteq F(\cdot) – \frac12 \norm{\cdot-y}^2$ is convex. Indeed, $g$ is convex if and only if it is an upper bound of its linear approximations. Now, $g(x)\ge g(y) + \ip{\nabla g(y),x-y} = F(y) + \ip{\nabla F(x),x-y}$ and this is indeed equivalent to \eqref{eq:sochessiansoc}. Furthermore, $\nabla^2 g(x) = \nabla^2 F(x)-G$, hence $\nabla^2 g(x)\succeq 0$ by our assumption on $G$.

The price of bandit information on the unit ball is an extra $\sqrt{d}$ factor: On the unit ball the regret of MD under full information is $\Theta(\sqrt{n})$, while it is $\Theta(\sqrt{dn})$ (the lower bound can be shown using the standard proof techniques). Up to log factors, The same, that is, that the price of bandit information is $\sqrt{d}$, holds for the simplex, too. Is this true in general? The answer turns out to be no. The price of bandit information can be as high as $\Theta(d)$ and whether this happens depends on the shape of the action set $\cA$. However, when this happens is not very well understood at the moment.

References

The results presented here are largely based on

Sébastien Bubeck and Nicolò Cesa-Bianchi (2012), “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems“, Foundations and Trends in Machine Learning: Vol. 5: No. 1, pp 1-122.

The original reference is

Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade:
Towards Minimax Policies for Online Linear Optimization with Bandit Feedback. COLT 2012: 41.1-41.14

Adversarial linear bandits and the curious case of the unit ball