Home Upper Confidence Bound (UCB)
Post
Cancel

Upper Confidence Bound (UCB)

What is UCB?

UCB finds a way to combine “exploration” with “exploitation” to get the best result.

Algorithm of UCB



Let’s assume that we are planning an advertisement. There are three steps to get the result:

  • Step 1.
    At each round $n$, we consider two numbers for each ad $i$:

    • $N_{i}(n)$ : The number of times the ad $i$ was selected up to round $n$.
    • $R_{i}(n)$ : The sum of rewards of the ad $i$ up to round $n$.
  • Step 2.
    From these two numbers, we compute:

    • the average reward of ad $i$ up to round $n$.
    $\bar{r_{i}} = \frac{R_{i}(n)}{N_{i}(n)}$
    • the confidence interval $\bar{r_{i}}\ -\ \Delta_{i}(n),\ \bar{r_{i}}\ +\ \Delta_{i}(n)$ at round $n$ with
    $\Delta_{i}(n)\ =\ \sqrt{\frac{3log(n)}{2N_{i}(n)}}$
  • Step 3. We select the ad $i$ that has the maximum UCB($\bar{r_{i}}\ +\ \Delta_{i}(n)$).



The process of deriving the result value of UCB



Let’s assume we have an initial distribution for ads.



And I’m going to take this value and put it on the vertical axis.



Assume starting points from the distributions D1, D2, D3, D4, and D5. I’ll put them at the same level, assuming that all the same profits return.



Then, they create confidence bands. These bands are built with a very high level of certainty and contain actual returns or actual expected returns.



After that, the algorithm will choose one box randomly. It returns either true or false depending on the observed value. In our case, we chose D3 and it returned false. If so, the red dotted line goes down, and the confidence interval (box) shrinks because we have additional observation, we are more confident in our predictions.



In the next step, we find the one with the highest confidence bound. If there is more than one highest confidence bound, choose randomly. Then, repeat the above steps.



If we repeat what we have done so far, the best value is found.





Example



Code



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import math

N = 10000
d = 10
ads_selected = []
numbers_of_selections = [0] * d
sums_of_rewards = [0] * d
total_reward = 0
for n in range(N):
  ad = 0
  max_upper_bound = 0
  for i in range(d):
    if numbers_of_selections[i] > 0:
      average_reward = sums_of_rewards[i] / numbers_of_selections[i]
      delta_i = math.sqrt((3/2) * (math.log(n+1)/numbers_of_selections[i]))
      upper_bound = average_reward + delta_i
    else:
      upper_bound = 1e400
    if upper_bound > max_upper_bound:
      max_upper_bound = upper_bound
      ad = i
  ads_selected.append(ad)
  numbers_of_selections[ad] += 1
  reward = dataset.values[n, ad]
  sums_of_rewards[ad] += reward
  total_reward += reward

plt.hist(ads_selected)
plt.title('Histogram of ads selections')
plt.xlabel('Ads')
plt.ylabel('Number of times each ad was selected')
plt.show()



Result



This post is licensed under CC BY 4.0 by the author.