Skip to main content

Multi-armed Bandits

The multi-armed bandit problem is the simplest reinforcement learning setting in Sutton and Barto's development. There is no state transition structure and no delayed reward. On each step the agent chooses one of kk actions and receives a reward drawn from that action's distribution. The difficulty is not planning through time, but balancing exploitation of the action that currently looks best against exploration of actions that might be better.

Bandits are important because they isolate a core statistical tension that appears everywhere else in reinforcement learning. A value estimate can only improve for actions tried often enough, but choosing apparently suboptimal actions has an opportunity cost. Action-value methods, ϵ\epsilon-greedy exploration, optimistic initial values, upper-confidence bounds, and gradient bandits are all different answers to that tension.

Definitions

In a kk-armed bandit, the action set is A={1,,k}\mathcal{A}=\{1,\dots,k\}. Each action aa has an unknown true action value

q(a)=E[RtAt=a].q_*(a) = \mathbb{E}[R_t \mid A_t=a].

The learner keeps an estimate Qt(a)Q_t(a) of q(a)q_*(a). The sample-average estimate after Nt(a)N_t(a) selections is

Qt(a)=sum of rewards received after action aNt(a).Q_t(a) = \frac{\text{sum of rewards received after action } a}{N_t(a)}.

The incremental form is

Qn+1=Qn+1n(RnQn),Q_{n+1} = Q_n + \frac{1}{n}\left(R_n - Q_n\right),

where nn counts selections of that action. More generally, constant step size α\alpha gives

Qn+1=Qn+α(RnQn),Q_{n+1} = Q_n + \alpha(R_n - Q_n),

which is useful when rewards are nonstationary.

The ϵ\epsilon-greedy action rule chooses a greedy action with probability 1ϵ1-\epsilon and a random action with probability ϵ\epsilon. Upper-confidence-bound action selection chooses

At=argmaxa[Qt(a)+clntNt(a)],A_t = \arg\max_a \left[ Q_t(a) + c\sqrt{\frac{\ln t}{N_t(a)}} \right],

where c>0c\gt 0 controls exploration. Gradient bandits do not estimate reward means directly as the action selection rule. They learn preferences Ht(a)H_t(a) and use a softmax policy:

Pr(At=a)=exp(Ht(a))bexp(Ht(b)).\Pr(A_t=a) = \frac{\exp(H_t(a))}{\sum_b \exp(H_t(b))}.

Key results

The incremental update

new estimateold estimate+step size×error\text{new estimate} \leftarrow \text{old estimate} + \text{step size} \times \text{error}

is a pattern that recurs through the book. In bandits the error is RnQnR_n-Q_n. In temporal-difference learning it becomes a reward plus a bootstrapped next value minus the current value.

Sample averages converge to true means in stationary problems under repeated sampling. Their weakness is inertia: after many observations, the effective step size 1/n1/n becomes tiny. If the reward distribution changes, old data dominate the estimate. Constant step size methods weight recent observations more heavily. Expanding the recursion shows that

Qn+1=(1α)nQ1+i=1nα(1α)niRi,Q_{n+1} = (1-\alpha)^n Q_1 + \sum_{i=1}^{n} \alpha(1-\alpha)^{n-i}R_i,

so rewards receive exponentially decaying weights.

Optimistic initial values encourage early exploration by making untried actions look attractive. This works naturally with greedy selection in stationary tasks, but it is a temporary exploration device. Once all actions have been tried and estimates settle, optimism fades.

UCB exploration explicitly adds uncertainty bonuses. Actions selected rarely have larger bonuses because Nt(a)N_t(a) is small. As an action is sampled more often, the uncertainty term shrinks. This implements a more targeted form of exploration than ϵ\epsilon-greedy, which continues to select random actions even when some are clearly poor.

Gradient bandits are useful when the goal is to learn a stochastic policy directly. With average reward baseline Rˉt\bar R_t, the preference update is

Ht+1(a)=Ht(a)+α(RtRˉt)(1a=Atπt(a)).H_{t+1}(a) = H_t(a) + \alpha(R_t-\bar R_t)(\mathbb{1}_{a=A_t}-\pi_t(a)).

The baseline reduces variance without changing the expected direction of improvement. This idea foreshadows policy gradient methods.

Regret is another useful way to interpret bandit behavior, even when the chapter's main experiments report average reward and percentage of optimal action selections. If the optimal action has value q(a)q_*(a_*), then choosing action AtA_t incurs expected opportunity cost q(a)q(At)q_*(a_*)-q_*(A_t). Exploration is worthwhile only if the information gained can reduce later opportunity cost. This framing explains why purely random exploration is easy to implement but statistically crude: it does not ask which uncertainty is most valuable to resolve.

Bandits also clarify the difference between evaluation and selection. An action-value estimate is a statistical summary of past rewards, while an action-selection rule is a decision procedure using those summaries. The same estimates can be paired with greedy selection, ϵ\epsilon-greedy selection, UCB bonuses, or softmax probabilities. Later RL algorithms preserve this separation: one part estimates values or preferences, and another part converts them into behavior with some amount of exploration.

Contextual bandits add state-like information without adding delayed consequences. The agent observes a situation, chooses an action, and receives an immediate reward, but the action does not change a future state in the MDP sense. This makes contextual bandits a useful halfway point between supervised prediction and full reinforcement learning.

Visual

MethodMain state keptExploration mechanismStrengthLimitation
Greedy sample averageQ(a)Q(a), N(a)N(a)None after initializationSimple and exploits fastCan lock onto a noisy early winner
ϵ\epsilon-greedyQ(a)Q(a), N(a)N(a)Random action with probability ϵ\epsilonRobust baselineWastes trials on clearly bad actions
Constant-step ϵ\epsilon-greedyQ(a)Q(a)Same as ϵ\epsilon-greedyTracks nonstationarityEstimates do not converge exactly in stationary cases
Optimistic initial valuesQ(a)Q(a)High initial estimatesSimple directed early explorationWeak for long nonstationary tasks
UCBQ(a)Q(a), N(a)N(a)Uncertainty bonusTargets under-sampled actionsNeeds time count and tuning constant
Gradient banditPreferences H(a)H(a)Softmax probabilitiesLearns stochastic preferencesSensitive to step size and reward scale

Worked example 1: Incremental sample-average update

Problem: Action aa is selected four times and gives rewards 11, 33, 22, and 66. Start with Q1(a)=0Q_1(a)=0 only as a placeholder before the first observed reward. Compute the sample-average estimate after each reward using the incremental rule.

Step 1: After the first reward, n=1n=1:

Q2=Q1+11(R1Q1)=0+1(10)=1.\begin{aligned} Q_2 &= Q_1 + \frac{1}{1}(R_1-Q_1) \\ &= 0 + 1(1-0) \\ &= 1. \end{aligned}

Step 2: After the second reward, n=2n=2:

Q3=1+12(31)=1+1=2.\begin{aligned} Q_3 &= 1 + \frac{1}{2}(3-1) \\ &= 1 + 1 \\ &= 2. \end{aligned}

Step 3: After the third reward, n=3n=3:

Q4=2+13(22)=2.\begin{aligned} Q_4 &= 2 + \frac{1}{3}(2-2) \\ &= 2. \end{aligned}

Step 4: After the fourth reward, n=4n=4:

Q5=2+14(62)=2+1=3.\begin{aligned} Q_5 &= 2 + \frac{1}{4}(6-2) \\ &= 2 + 1 \\ &= 3. \end{aligned}

Check: The direct average is (1+3+2+6)/4=12/4=3(1+3+2+6)/4=12/4=3. The checked final estimate is Q5(a)=3Q_5(a)=3.

Worked example 2: UCB action choice

Problem: At time t=100t=100, three actions have estimates and counts:

Q(a1)=1.0,Q(a2)=1.3,Q(a3)=0.8,Q(a_1)=1.0,\quad Q(a_2)=1.3,\quad Q(a_3)=0.8, N(a1)=50,N(a2)=5,N(a3)=45.N(a_1)=50,\quad N(a_2)=5,\quad N(a_3)=45.

Use UCB with c=2c=2. Which action is selected?

Step 1: Compute ln(100)4.6052\ln(100)\approx 4.6052.

Step 2: Compute each exploration bonus:

b1=24.6052/50=20.09210.6070,b2=24.6052/5=20.92101.9194,b3=24.6052/45=20.10230.6396.\begin{aligned} b_1 &= 2\sqrt{4.6052/50} = 2\sqrt{0.0921} \approx 0.6070, \\ b_2 &= 2\sqrt{4.6052/5} = 2\sqrt{0.9210} \approx 1.9194, \\ b_3 &= 2\sqrt{4.6052/45} = 2\sqrt{0.1023} \approx 0.6396. \end{aligned}

Step 3: Add bonuses to estimates:

u1=1.0+0.6070=1.6070,u2=1.3+1.9194=3.2194,u3=0.8+0.6396=1.4396.\begin{aligned} u_1 &= 1.0 + 0.6070 = 1.6070, \\ u_2 &= 1.3 + 1.9194 = 3.2194, \\ u_3 &= 0.8 + 0.6396 = 1.4396. \end{aligned}

Step 4: Select the largest UCB score. The checked answer is a2a_2, even though it has only a moderately higher estimate, because it has been sampled much less often.

Code

import numpy as np

rng = np.random.default_rng(7)
k = 10
steps = 1000
epsilon = 0.1

true_q = rng.normal(loc=0.0, scale=1.0, size=k)
Q = np.zeros(k)
N = np.zeros(k, dtype=int)
rewards = []

for t in range(steps):
if rng.random() < epsilon:
action = rng.integers(k)
else:
action = rng.choice(np.flatnonzero(Q == Q.max()))

reward = rng.normal(true_q[action], 1.0)
N[action] += 1
Q[action] += (reward - Q[action]) / N[action]
rewards.append(reward)

print("True best action:", int(np.argmax(true_q)))
print("Estimated best action:", int(np.argmax(Q)))
print("Average reward:", float(np.mean(rewards)))
print("Counts:", N.tolist())

Common pitfalls

  • Evaluating exploration only by the immediate reward of exploratory actions. Exploration is useful because it improves future action selection.
  • Forgetting that sample averages assume stationarity. In nonstationary bandits, constant step sizes are usually a better fit.
  • Setting ϵ=0\epsilon=0 too early. Greedy methods can permanently favor an action that was lucky in early samples.
  • Treating optimistic initial values as a complete exploration strategy. They mainly help at the beginning of learning.
  • Applying UCB without handling actions with Nt(a)=0N_t(a)=0. In practice, select each action once or define untried actions as maximally urgent.
  • Comparing gradient bandit preferences H(a)H(a) as if they were value estimates. Preferences determine probabilities but are not expected rewards.

Connections