Comments for Bandit Algorithms

Comment on Ellipsoidal Confidence Sets for Least-Squares Estimators by Tor Lattimore

Tor Lattimore — Sat, 14 Mar 2020 14:09:36 +0000

Yes. It is assumed that V is non-singular, which is possible only when $(a_s)$ span $\R^d$.

Comment on Ellipsoidal Confidence Sets for Least-Squares Estimators by Zeyad

Zeyad — Fri, 13 Mar 2020 14:32:23 +0000

Hi,

To derive equation (1), do we need some additional assumptions on the choice of $a_s$ (e.g. they form a basis)?

Best,
Zeyad Emam

Comment on Bayesian/minimax duality for adversarial bandits by Tiancheng Yu

Tiancheng Yu — Mon, 17 Feb 2020 23:02:00 +0000

Consider we are learning an M*N two players zero-sum game. So one may expect we need steps proportional to M*N to learn the Nash equilibrium. But using the above-mentioned algorithms we actually only need steps proportional to M+N. Is there any intuitive explanation for this improvement? Are we implicitly doing something clever?

Comment on Ellipsoidal Confidence Sets for Least-Squares Estimators by Claire

Claire — Fri, 14 Feb 2020 17:49:55 +0000

Thanks for your reply!

Could you kindly point out a reference for the general setting, or briefly mention what the corresponding martingale is in the general setting?

Comment on Ellipsoidal Confidence Sets for Least-Squares Estimators by Tor Lattimore

Tor Lattimore — Fri, 14 Feb 2020 17:45:28 +0000

Hi Claire,

Yes, this result holds very generally. Only a martingale structure is being used and even the estimates can be any appropriately measurable function.

Comment on Ellipsoidal Confidence Sets for Least-Squares Estimators by Claire

Claire — Wed, 15 Jan 2020 13:27:46 +0000

Hi,

Thanks a lot for your great blog and book!

I was reading Exercise 20.12 in the book on the sequential likelihood ratio confidence set extracted from Lemma 2 of Lai and Robbins (1985). This construction seems to be for the iid bandit. Can it be generalized to linear bandit as well?

Comment on Stochastic Linear Bandits and UCB by Tor Lattimore

Tor Lattimore — Thu, 05 Dec 2019 15:48:54 +0000

For the first question. Off the top of my head I would guess they are considering an action set like the sphere where the curvature allows you to achieve a sqrt(T) regret using an explore-then-commit algorithm.

For the second question, you might start with this paper on generalised linear bandits: https://arxiv.org/pdf/1706.00136.pdf. Wouter Koolen and Remy Degenne also recently presented an algorithm using online learning to incrementally update the “policy”, but in the structured setting. I think that paper has not appeared yet.

Comment on Stochastic Linear Bandits and UCB by Tiancheng Yu

Tiancheng Yu — Sun, 03 Nov 2019 16:38:20 +0000

Hi authors

I am recently reading the chapters on the stochastic linear bandit and find the material covered here super useful. There are two things I get a little bit confused:
1.Unlike the chapters on finite-arm bandits, here ETC type of method (like PEGE in “Linearly Parameterized Bandits” by Paat Rusmevichientong et.al) is not covered. In this early paper, the author claims that PEGE could also achieve $\sqrt{T}$ regret because after $c$ rounds of uniform exploration the regret shrink as $\frac{1}{c}$. This sounds very counter-intuitive because finite arm case is finitely a special case of this and we already know this is impossible. What do you think of that?
2.In the proposed methods like LinUCB, a least square problem is solved in each step. I wonder if anyone has tried to use SGD style method instead of solving the least square directly. Of course in that case the construction of UCB can be quite different. I am only curious about this possibility.

Thanks！

Comment on Stochastic Linear Bandits and UCB by Csaba Szepesvari

Csaba Szepesvari — Sat, 12 Oct 2019 06:31:05 +0000

I guess the text is not very clear. In this case the $\beta_t$s are defined by the requirement that $\langle \hat \theta_t-\theta_*, a \rangle \le 2 \norm{a}_{V_{t-1}^{-1}} \sqrt{\beta_t}$ should hold with the desired probability and for any $a$ suboptimal action, which can be satisfied without having the extra $\sqrt{d}$ in $\beta$. The proof of Theorem 19.2 still goes through without any other changes.

Comment on Stochastic Linear Bandits and UCB by Csaba Szepesvari

Csaba Szepesvari — Sat, 12 Oct 2019 06:10:35 +0000

If I may paraphrase, I guess you want to have a shift-invariant result where the regret would scale with the *span* $s=\max_{a,a’} \langle \theta_*,a-a’\rangle$ of the mean rewards rather than scaling with the “scale” $H=\max(1,\max_a |\langle \theta_*, a \rangle|)$, which is what is hidden in Assumption 19.1(a). It seems though that the proof of Theorem 19.2 could be rewritten to use that $r_t \le s$, which would give a bound that essentially scales with $\max(s,2)$. If the algorithm is also made scale invariant, one can then drop the $\max(\cdot)$ from here and the regret will scale with $s$ (even when $s\to 0$). However, the algorithm we ultimately have uses regularized least-squares (cf. Chapter 20). This is not scale invariant: The regularization constant sets the scale. Some sort of adaptive regularization would be needed to avoid this (although the scale effect washes out with time even if we don’t do anything but there is still a finite-time effect). The algorithm of Chapter 22 will be scale invariant on the other hand. The setting consider here is the real “in-between case”: The action set is fixed, and finite, like in the standard finite-armed bandit case. Accordingly, this algorithm will be shift and scale invariant. Resolving the general case remains for future work!