Problem Setting

Generally, the problem in our concern is the Stochastic Multi-armed Bandits. In this setting, the player is presented with a collection of actions, or arms to pull each round of play. Each arm distributes a random reward \(X_{i, t}\) that follows some subgaussian distribution over \([0,1]\), with \(\mathbb{E}[X_{i, t}] = \mu_i\). The goal of the player is to minimize the cumulative regret. Suppose the player chooses arm \(a_t\) at round \(t\), the regret at round \(T\) could be written as

\begin{equation} R_T = \max\limits_{j} \mathbb{E}[\sum_{t=1}^{T}(X_{j, t}-X_{a_t, t})] \end{equation}

Another way to write this is to use the reward gap \(\Delta_i = \max_j \mu_j - \mu_i\). In this way, the regret could also be written as

\begin{equation} R_T = \mathbb{E}[\sum_{t=1}^{T} \Delta_{a_t}] = \sum_{i}^n \Delta_i \mathbb{E}[T_i] \end{equation} in which \(T_i\) is the number of times arm \(i\) is chosen.

Classic Tools

Upper Confidence Bound(UCB)

Idea

Approximate the reward of each arm by the estimated upper bound (or lower bound) of the arm’s reward.

Algorithm

To be done.

Regret Bound

We prove a more general form of the bound. Suppose the UCB formula used in the algorithm is \(UCB_{i,t} = \hat{\mu_{i, T_i}} + \sqrt{\frac{\alpha \log t}{2T_i}}\), where \(\hat{\mu_{i,t}}\) is the average (empirical estimate) of rewards get so far from arm \(i\). We could bound the cumulative regret as

Theorem 1(Regret Bound for UCB) For \(T \ge 1\), \begin{equation} R_T \le \sum_{i,\Delta_i > 0} 2\alpha \frac{\log T}{\Delta_i} + \frac{2\alpha}{\alpha-1}\Delta_i \end{equation}

Proof. Suppose, without loss of generality, the first arm is the optimal one. From the formula of \(R_T\) we know that the key is to bound \(\mathbb{E}[T_i]\). If we don’t select the optimal arm, there could be two scenarios: either UCB is not an effective bound anymore or that the number of sampling is not enough to make \(UCB_{i,t} < UCB_{1, t}\). We need to bound the probability of these cases happening.

We first analyze the latter case. In this case, we have

\[\hat{\mu_{i, T_i}} + \sqrt{\frac{\alpha \log t}{2T_i}} \ge \hat{\mu_{1, T_1}} + \sqrt{\frac{\alpha \log t}{2T_1}}\]

Since in this case we assume that our estimate (UCB) still works, we need two events to hold:

  • \[LB_t: \mu_i \ge \hat{\mu_{i, T_i}} - \sqrt{\frac{\alpha \log t}{2T_i}}\]
  • \[UB_t: \mu_1 \le \hat{\mu_{1, T_1}} + \sqrt{\frac{\alpha \log t}{2T_1}}\]

The intuition is that if \(LB\) doesn’t hold, \(UCB_i\) overestimates \(\mu_i\). Similarly we could intuitionally explain \(UB_t\). With these two assumptions, we have

\[\mu_i + \sqrt{\frac{2\alpha \log t}{T_i}} \ge \mu_1.\]

With some modification to this inequality and \(t \le T\), we have

\[T_i \le \frac{2\alpha\log T}{\Delta_i^2}.\]

Next, we try to bound the probability the former case happens. Formally speaking, this means \(LB_t^c \cup UB_t^c\), which could be estimated using union bound. \(LB_t^c\) is equivalent to \(\hat{\mu_{i, T_i}} - \mu_i > \sqrt{\frac{\alpha \log t}{2T_i}}\). Since \(\mu_i = \mathbb{E}[\hat{\mu_{i, T_i}}]\), we could use Chernoff bound to bound the probability of \(LB_t^c\):

\[P(LB_t^c) = P(\hat{\mu_{i, T_i}} - \mu_i > \sqrt{\frac{\alpha \log t}{2T_i}}) \le exp\left(\frac{-2t\alpha \log t}{2T_i}\right) \le \frac1{t^\alpha}\]

The last inequality is derived from \(T_i \le t\). Similarly, we could also prove that \(P(UP_t^c) \le \frac1{t^\alpha}\).

Thus, we have

\begin{array}{ccc} \mathbb{E}[T_i] & = & \sum_{t=1}^T P(a_t = i) \\ & \le & \frac{2\alpha \log T}{\Delta_i^2} + \sum_{t=1}^T P(UB_t^c \vee LB_t^c)) \\ & \le & \frac{2\alpha \log T}{\Delta_i^2} + 2\sum_{t=1}^T \frac1{t^\alpha} \\ & \le & \frac{2\alpha \log T}{\Delta_i^2} + \frac{2\alpha}{\alpha-1} \\ \end{array}

Substituting \(\mathbb{E}[T_i]\) in the formula of \(R_T\) yields the result.\(\tag*{$\blacksquare$}\)

Thompson Sampling

To be done.