Nevertheless, ideas based on optimism can work in reinforcement learning with some modification and assumptions. Somehow you need to construct confidence sets about what the world is like and then act as if the world is as nice as plausibly possible. You’ll have to make assumptions to do this.

Another big caveat. Optimism works well in bandits because you can never suffer too much regret with one wrong decision. This is obviously not true more generally. So caution is advised!

Tali Sharot’s book “Optimism Bias” is a nice exploration of optimistic behavior in humans, which maybe would interest you.

]]>Usually, the same life decision could not be made twice. That is, even if we are able to reduce uncertainty by making a specific choice, we would never be able to choose among the same alternative choices again. Is UCB still applicable in such situations? ]]>

Why is it $ \max_{p \in \mathcal{A}} R_n(p) + 1 $? Shouldn’t it be $ \max_{p \in \mathcal{A}} R_n(p) + (k-1) $? If we take wlog $ \max_{p \in \mathcal{A}} = [1 – (k-1)/n, 1/n, \ldots, 1/n] $, $ \max_{p \in P_{k-1}} = [1, 0, \ldots, 0] $ and $ y_t = [1, 0, \ldots, 0] $ for every $t$, then $R_n \leq \max_{p \in \mathcal{A}} R_n(p) + (k – 1) $, right?

]]>I’m wondering if the regret bounds given here (or Chapter 19 of the text) can be used to show that LinUCB is Hannan consistent? That is, the theorem shows the regret is sub-linear with probability 1-delta, which can be used to show the expected regret is sub-linear. But can it be proven that the regret of LinUCB is sublinear with probability 1? (If I have the definition of Hannan consistency correct…)

]]>I am not really seeing how this is happening and how that would make it ‘too large’. Could you please explain this?

]]>I am not really seeing how this is happening and how that would make it ‘too large’. Could you please explain this?

]]>