I’m wondering if the regret bounds given here (or Chapter 19 of the text) can be used to show that LinUCB is Hannan consistent? That is, the theorem shows the regret is sub-linear with probability 1-delta, which can be used to show the expected regret is sub-linear. But can it be proven that the regret of LinUCB is sublinear with probability 1? (If I have the definition of Hannan consistency correct…)

]]>I am not really seeing how this is happening and how that would make it ‘too large’. Could you please explain this?

]]>I am not really seeing how this is happening and how that would make it ‘too large’. Could you please explain this?

]]>Now we can write this event as a union of other events:

$A = \cup_s A_s$ where $A_s$ is the event that $\hat \mu_1(t-1) + \sqrt{2/T_1(t-1) \log f(t)} \le \mu_1 – \epsilon \text{ and } T_1(t-1) = s$.

On the event $A_s$ we have $T_1(t-1) = s$, so $A_s$ can also be written as the event that $\hat \mu_{1,s} + \sqrt{2/s \log f(t)} \le \mu_1 – \epsilon \text{ and } T_1(t-1) = s$. Now this is a subset of the event $B_s$ defined to occur when $\hat \mu_{1,s} + \sqrt{2/s \log f(t)} \le \mu_1 – \epsilon$ (clearly if $A_s$ happens, then $B_s$ also happens). So now we have $\Prob{A} = \Prob{\cup_s A_s} \le \sum_s \Prob{A_s} \le \sum_s \Prob{B_s}$.

First inequality is the union bound. Second inequality because $B_s$ is a subset of $A_s$. The price we pay for the fact that $T_1(t-1)$ is a random quantity is that we must sum over all its possible values when applying our concentration inequality.

]]>But you are right that I'm interpreting the above probability as a conditional. If I instead thought of it as P((^mu_i(t-1) – mu_i < g(T_i(t-1)) & (T_i(t-1) was reached)), it makes much more sense. Under this interpretation, if I take the strategy of stopping when x_i=1, then (using my previous example) when T_i(t-1)=3:

P(^mu_i(t-1) – mu_i < g(3)) = P(observing 0, 0, 1 from arm i).

And in general, P(^mu_i(t-1) – mu_i < g(T_i(t-1)) is the probability of observing a particular set of data. Given that, the union bound makes perfect sense.

Is that a reasonably way of thinking about this?

I know you keep mentioning a simpler bound with an easier proof, but it's driving me nuts that I don't understand this one.

Thanks,

Peter

Probably the misconception is coming from your definition of P(^mu(T_i=3) – mu_i < g(3)). My interpretation of your argument is that this is meant to be a conditional probability. These are very difficult to handle in sequential settings and we avoid them. Our argument comes from the following view. To each arm associate a big stack of rewards, which are sampled independently at the beginning of the game and not observed. Each time the learner chooses an action, it gets the top reward on the stack corresponding to that arm. Then ^mu_{1,s} is the mean of the first s rewards in the stack corresponding to arm 1. Since these really are independent we can apply our concentration bound to show that with high probability ^mu_{1,s} + sqrt(2/s log f(t)) is never much smaller than mu_1 - epsilon for any s. Now the event F that ^mu_1(t-1) + sqrt(2/T_1(t-1) log f(t)) < mu_1 - epsilon is definitely a subset of the union of F_s with s in [n] where F_s is the event that ^mu_{1,s} + sqrt(2/s log f(t)) < mu_1 - epsilon. This is true because T_1(t-1) must be in [n] = {1,2,...,n}. Hence Prob(^mu_1(t-1) + sqrt(2/T_1(t-1) log f(t)) < mu_1 - epsilon) <= sum_{s=1}^n Prob(^mu_{1,s} + sqrt(2/s log f(t)) < mu_1 - epsilon)) Notice that in the left-hand side probability there is no conditioning. By the way, in the pdf book we have two versions of UCB, the first of which has slightly worse bounds than what we present here, but an easier proof (see Chapters 7 and 8).

]]>