^mu_1(t-1) + sqrt(2/T_1(t-1) log f(t)) <= mu_1 - epsilon Now we can write this event as a union of other events: A = union A_s where A_s is the event that ^mu_1(t-1) + sqrt(2/T_1(t-1) log f(t)) <= mu_1 - epsilon AND T_1(t-1) = s On the event A_s we have T_1(t-1) = s, so A_s can also be written as the event that ^mu_{1,s} + sqrt(2/s log f(t)) <= mu_1 - epsilon AND T_1(t-1) = s Now this is a subset of the event B_s defined to occur when ^mu_{1,s} + sqrt(2/s log f(t)) <= mu_1 - epsilon (clearly if A_s happens, then B_s also happens). So now we have Prob(A) = Prob(union A_s) <= sum_s Prob(A_s) <= sum_s Prob(B_s) First inequality is the union bound. Second inequality because B_s is a subset of A_s. The price we pay for the fact that T_1(t-1) is a random quantity is that we must sum over all its possible values when applying our concentration inequality.

]]>But you are right that I'm interpreting the above probability as a conditional. If I instead thought of it as P((^mu_i(t-1) – mu_i < g(T_i(t-1)) & (T_i(t-1) was reached)), it makes much more sense. Under this interpretation, if I take the strategy of stopping when x_i=1, then (using my previous example) when T_i(t-1)=3:

P(^mu_i(t-1) – mu_i < g(3)) = P(observing 0, 0, 1 from arm i).

And in general, P(^mu_i(t-1) – mu_i < g(T_i(t-1)) is the probability of observing a particular set of data. Given that, the union bound makes perfect sense.

Is that a reasonably way of thinking about this?

I know you keep mentioning a simpler bound with an easier proof, but it's driving me nuts that I don't understand this one.

Thanks,

Peter

Probably the misconception is coming from your definition of P(^mu(T_i=3) – mu_i < g(3)). My interpretation of your argument is that this is meant to be a conditional probability. These are very difficult to handle in sequential settings and we avoid them. Our argument comes from the following view. To each arm associate a big stack of rewards, which are sampled independently at the beginning of the game and not observed. Each time the learner chooses an action, it gets the top reward on the stack corresponding to that arm. Then ^mu_{1,s} is the mean of the first s rewards in the stack corresponding to arm 1. Since these really are independent we can apply our concentration bound to show that with high probability ^mu_{1,s} + sqrt(2/s log f(t)) is never much smaller than mu_1 - epsilon for any s. Now the event F that ^mu_1(t-1) + sqrt(2/T_1(t-1) log f(t)) < mu_1 - epsilon is definitely a subset of the union of F_s with s in [n] where F_s is the event that ^mu_{1,s} + sqrt(2/s log f(t)) < mu_1 - epsilon. This is true because T_1(t-1) must be in [n] = {1,2,...,n}. Hence Prob(^mu_1(t-1) + sqrt(2/T_1(t-1) log f(t)) < mu_1 - epsilon) <= sum_{s=1}^n Prob(^mu_{1,s} + sqrt(2/s log f(t)) < mu_1 - epsilon)) Notice that in the left-hand side probability there is no conditioning. By the way, in the pdf book we have two versions of UCB, the first of which has slightly worse bounds than what we present here, but an easier proof (see Chapters 7 and 8).

]]>x_i \in {0,1}; for definiteness let’s say P(x_i=1) = 1/2, so mu_i = 1/2.

strategy: choose arm i until x_i=1; then stop. under this strategy, if T_i = 3 then ^mu_i = 1/3.

let g(s) = 1/2 if s=3 and -1 if s \ne 3.

with this setup,

P(^mu(T_i=3) – mu_i < g(3)) = 1,

whereas

sum_s P(^mu_{i,s} – mu_i < g(s)) = P(^mu_{i,3} – mu_i < 1/2) < 1.

Thus, P(^mu(T_i=3) – mu_i < g(3)) is not less than sum_s P(^mu_{i,s} – mu_i < g(s)).

I'm guessing I have a serious misconception, but I can't figure out where.

Thanks,

Peter

P.S. The first time I left a comment, it took me forever to get past the robot with the missing eye, because its right eye is missing, and it asks to add the left eye. Is that part of the test? ðŸ˜‰

]]>The core is that F is indeed a subset of the union of all A_s and so the probability of F is less than the probability of the union. The second important part is that concentration analysis can bound the probability of A_s.

By the way, a slightly more straightforward analysis of a simpler algorithm is given in Chapter 7 of the book.

]]>Let A_s be the event {T(t-1) = s and ^mu_s – mu <= g(s)}. Then the event F = {^mu_{T(t-1)} - mu <= g(T(t-1))} is a subset of the union of A_1,...,A_n since T(t-1) must be between 1 and n. Then the union bound says that Prob(F) <= Prob(union A_t) <= sum_t Prob(A_t)

]]>P(\hat{mu}(t-1) – mu_1 < g(n,T(t-1))) \le sum_s P(\hat{mu}(s) < g(n,s))

Presumably this should hold for arbitrary g(n,s). But I've been trying to convince myself of that for days, with no luck. Is there an easy way to see (or prove) it?

Thanks,

P

Re sublinear regret and the definition of good learners: Yes, this was meant to be a minimum requirement for what a good learner should aim for.

Re next mention of fixed action: I hope my explanation to the first concern clears this up, too.

Stationary, non-stationary: Yes, we were ahead of ourselves here, though this particular paragraph is luckily not essential for understanding the rest. In the book, we have a separate chapter dealing with nonstationary environments, which could also help. I am not sure whether you are looking for an explanation here, but just in case, by stationary we simply meant that the laws governing the response of the environment (how rewards are generated in response to the learner pulling arms) do not change over time. Then, nonstationarity is just when this does not hold.

]]>