Thompson Sampling

## Overview

Thompson Sampling is a probabilistic algorithm used for decision-making in multi-armed bandit problems. It balances exploration and exploitation by maintaining probability distributions of potential rewards for each option.

Unlike Upper Confidence Bound (UCB), which is deterministic and updates every round, Thompson Sampling is probabilistic and can handle delayed feedback. This makes it practical for applications such as online advertising, where updates can occur in batches rather than in real-time.

## Algorithm Steps

1. **Initialize**: Maintain two counters for each ad:

- The number of times an ad received a reward of 1.

- The number of times an ad received a reward of 0.

2. **Sampling**: At each round, for each ad:

- Draw a random sample from a Beta distribution using the recorded rewards.

3. **Selection**: Choose the ad with the highest sampled value.

4. **Update**: Record the reward obtained from the chosen ad and update the counters accordingly.

## Code Implementation

### Importing Libraries

```python

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

import random # For beta distribution sampling

```

### Loading Dataset

```python

dataset = pd.read_csv('Ads_CTR_Optimisation.csv')

```

### Thompson Sampling Implementation

```python

N = 500 # Number of rounds (users clicking on ads)

d = 10 # Number of ads

ads_selected = [] # List to store selected ads

# Step 1: Initialize counters for rewards

numbers_of_reward_1 = [0] * d # Number of times ad i got reward 1

numbers_of_reward_0 = [0] * d # Number of times ad i got reward 0

total_reward = 0 # Total accumulated reward

# Step 2: Loop over each round

for n in range(0, N):

ad = 0 # Selected ad for this round

max_random = 0 # Maximum beta distribution sample in this round

# Step 3: Draw a sample from the beta distribution for each ad

for i in range(0, d):

random_beta = random.betavariate(numbers_of_reward_1[i] + 1, numbers_of_reward_0[i] + 1)

if random_beta > max_random:

max_random = random_beta

ad = i # Fixing incorrect index assignment

ads_selected.append(ad) # Append selected ad for this round

reward = dataset.values[n, ad] # Get the reward for the selected ad

# Update reward counters

if reward == 1:

numbers_of_reward_1[ad] += 1

else:

numbers_of_reward_0[ad] += 1

total_reward += reward # Update total reward

```

### Visualizing the Results

```python

plt.hist(ads_selected, bins=np.arange(d+1)-0.5, edgecolor='black')

plt.title('Histogram of Ads Selections')

plt.xlabel('Ads')

plt.ylabel('Number of times each ad was selected')

plt.xticks(range(d))

plt.show()

```

## Comparison: UCB vs. Thompson Sampling

| Feature | UCB (Upper Confidence Bound) | Thompson Sampling |

|----------------------|-----------------------------|-------------------|

| Nature | Deterministic | Probabilistic |

| Update Requirement | Every round | Can be updated in batches |

| Randomness | None | Uses random sampling |

| Practicality | Works well with immediate updates | Suitable for delayed feedback |

## Conclusion

Thompson Sampling is often more practical than UCB in real-world applications due to its ability to handle delayed feedback and batch updates. This makes it a preferred choice for problems like online advertising, where ads can be updated periodically rather than immediately.