# Applications and origin

> or would you pull the right arm a few more times, thinking that maybe the initial poorer average payoff is due to “bad luck” only?

You could illustrate this notion of “bad luck” by plotting an example of probability distribution for each arm and where the 5 first pulls fall.

# The language of bandits

> For a fixed environment, the worst-case regret is just a shift and a sign change of the total reward: Maximizing reward is the same as minimizing regret.

I’ve struggled with this, probably because I cannot understand the regret intuitively. What’s the point of looking at what happens when we keep taking the same action?

You could make this sentence clearer by being more explicit: “For a deterministic environment, the total reward gained when a is used for all n rounds is constant, so the worst-case regret is just…”.

> A good learner achieves sublinear worst-case regret: That is, if Rn is the regret after n rounds of interaction of a learner, Rn=o(n).

I don’t understand this. Is it a definition of a good learner?

> The astute reader may notice that regret compares payoffs of fixed actions

The term “fixed” is ambiguous here as you use it above as “deterministic” (“For a fixed environment, the worst-case regret is just a shift and a sign change of the total reward”).

> In particular, in a non-stationary environment

Sounds risky to assume the reader is familiar with this notion of stationarity.

Thanks!

]]>Thanks for the feedback, you are indeed right. I changed $a$ to $k$ in the definition of $D_t$.

Cheers,

Csaba ]]>

Here’s a quick comment. In the “Stochastic Linear Bandits” section, the regret equation contains the expression “a ∈ D_t”. However, just under it, we see that the definition of D_t is:

D_t={phi(c_t, a): a ∈ [K]}

I feel that using ‘a’ as a dummy variable here twice decreases the readability. If that was intentional and I’m misunderstanding something, please let me know!

]]>“Indeed, since $i$ is suboptimal in $\nu$ and $i$ is optimal in $\nu’$ and for any $j\neq i$, $\Delta_j'(\nu) \geq \lambda – \Delta_i$”

It should be (prime in the last inequality is wrong)

“Indeed, since $i$ is suboptimal in $\nu$ and $i$ is optimal in $\nu’$ and for any $j\neq i$, $\Delta_j(\nu’) \geq \lambda – \Delta_i$” ]]>

– Csaba ]]>

Maybe it might be corrected like this: \tilde L_n = \sum_{t=1}^n \ Y_{t}\ ]]>

The choice of f(t) is very delicate and one has to be careful about comparisons when the underlying variance is different. When the noise is Gaussian with variance V you generally want the confidence interval to look like

sqrt(2V / T_k log f(t))

where f(t) = t + O(log^p(t)) for as small p as your analysis will allow. Actually for the Gaussian case you may use f(t) = t, but even for subgaussian we do not know if this works. Roughly speaking things are easy if sum_t 1/f(t) converges. Now in Shipra’s notes the rewards are bounded in [0,1]. Of course they cannot be Gaussian, but the maximum variance (or subgaussian constant) of [0,1] bounded rewards is 1/4. If we substitute this into the formula above you have a confidence interval of

sqrt(2(1/4) / T_k log f(t)) = sqrt(1/(2T_k) log f(t))

Choosing f(t) = t^2 yields the choice in those notes and this is definitely summable. In general, if what you care about is expected regret, then you want f(t) to be “as small as possible”, but you will pay a price for this in terms of the variance of the regret, so take care. Finally, there are lots of more sophisticated choices. For example the arm-dependent confidence interval

sqrt(2V/T_k log(t/T_k)) will give you a big boost. I recently wrote a big literature review on all these choices. Feel free to email me if you want a copy.

Best,

Tor

]]>