Home Thompson Sampling
Post
Cancel

Thompson Sampling

What is Thompson Sampling?

Unlike UCB, which selects a high confidence interval, Thompson sampling creates a probabilistic perception.

Algorithm of Thompson Sampling



Let’s assume that we are planning an advertisement.
There are three steps to get the result:

  • Step 1.
    At each round $n$, we consider two numbers for each ad $i$:
    • $N_{i}^{1}(n)$ : the number of times the ad $i$ got reward 1 up to round $n$.
    • $N_{i}^{0}(n)$ : the number of times the ad $i$ got reward 0 up to round $n$.

  • Step 2.
    For each ad $i$, we take a random draw from the distribution below:
$\theta_{i}(n) = \beta(N_{i}^{1}(n)\ +\ 1,\ N_{i}^{0}(n)\ +\ 1)$


  • Step 3.
    Select the ad that has the highest $\theta_{i}(n)$.

The process of deriving the result value of Thompson Sampling



You can see the graph. The x-axis represents the expected profit from the ad, and the bar represents the center of the distribution.



After that, Let’s assume we try a round to blue ad and get the result like:



Thompson Sampling creates a new distribution by incorporating the results into a distribution.



As above, apply it to yellow ad and blue ad and get the result.



As you can see, the expected value produced by the machine does not match the actual distribution. To solve this problem, we can help machine by giving real expected value.
And the distribution will be change like:



Repeat the above process until the distribution created by the machine is sufficiently refined.



As you can see in the image above, the yellow ad will give us the highest profit.



Example



Code



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import random

N = 10000
d =10
ads_selected =[]
numbers_of_reward_1 = [0] * d
numbers_of_reward_0 = [0] * d
total_reward = 0
for n in range(N):
  ad = 0
  max_random = 0
  for i in range(d):
    # Step 2
    random_beta = random.betavariate(numbers_of_reward_1[i] + 1, numbers_of_reward_0[i] + 1)
    # Step 3
    if random_beta > max_random:
      max_random = random_beta
      ad = i
  ads_selected.append(ad)
  reward = dataset.values[n, ad]
  if reward == 1:
    numbers_of_reward_1[ad] += 1
  elif reward == 0:
    numbers_of_reward_0[ad] += 1
  total_reward += reward


plt.hist(ads_selected)
plt.title('Histogram of ads selections')
plt.xlabel('Ads')
plt.ylabel('Number of times each ad was selected')
plt.show()     



Result





This post is licensed under CC BY 4.0 by the author.