Multi-armed Bandits
The multi-armed bandit problem is the simplest reinforcement learning setting in Sutton and Barto's development. There is no state transition structure and no delayed reward. On each step the agent chooses one of actions and receives a reward drawn from that action's distribution. The difficulty is not planning through time, but balancing exploitation of the action that currently looks best against exploration of actions that might be better.
Bandits are important because they isolate a core statistical tension that appears everywhere else in reinforcement learning. A value estimate can only improve for actions tried often enough, but choosing apparently suboptimal actions has an opportunity cost. Action-value methods, -greedy exploration, optimistic initial values, upper-confidence bounds, and gradient bandits are all different answers to that tension.
Definitions
In a -armed bandit, the action set is . Each action has an unknown true action value
The learner keeps an estimate of . The sample-average estimate after selections is
The incremental form is
where counts selections of that action. More generally, constant step size gives
which is useful when rewards are nonstationary.
The -greedy action rule chooses a greedy action with probability and a random action with probability . Upper-confidence-bound action selection chooses
where controls exploration. Gradient bandits do not estimate reward means directly as the action selection rule. They learn preferences and use a softmax policy:
Key results
The incremental update
is a pattern that recurs through the book. In bandits the error is . In temporal-difference learning it becomes a reward plus a bootstrapped next value minus the current value.
Sample averages converge to true means in stationary problems under repeated sampling. Their weakness is inertia: after many observations, the effective step size becomes tiny. If the reward distribution changes, old data dominate the estimate. Constant step size methods weight recent observations more heavily. Expanding the recursion shows that
so rewards receive exponentially decaying weights.
Optimistic initial values encourage early exploration by making untried actions look attractive. This works naturally with greedy selection in stationary tasks, but it is a temporary exploration device. Once all actions have been tried and estimates settle, optimism fades.
UCB exploration explicitly adds uncertainty bonuses. Actions selected rarely have larger bonuses because is small. As an action is sampled more often, the uncertainty term shrinks. This implements a more targeted form of exploration than -greedy, which continues to select random actions even when some are clearly poor.
Gradient bandits are useful when the goal is to learn a stochastic policy directly. With average reward baseline , the preference update is
The baseline reduces variance without changing the expected direction of improvement. This idea foreshadows policy gradient methods.
Regret is another useful way to interpret bandit behavior, even when the chapter's main experiments report average reward and percentage of optimal action selections. If the optimal action has value , then choosing action incurs expected opportunity cost . Exploration is worthwhile only if the information gained can reduce later opportunity cost. This framing explains why purely random exploration is easy to implement but statistically crude: it does not ask which uncertainty is most valuable to resolve.
Bandits also clarify the difference between evaluation and selection. An action-value estimate is a statistical summary of past rewards, while an action-selection rule is a decision procedure using those summaries. The same estimates can be paired with greedy selection, -greedy selection, UCB bonuses, or softmax probabilities. Later RL algorithms preserve this separation: one part estimates values or preferences, and another part converts them into behavior with some amount of exploration.
Contextual bandits add state-like information without adding delayed consequences. The agent observes a situation, chooses an action, and receives an immediate reward, but the action does not change a future state in the MDP sense. This makes contextual bandits a useful halfway point between supervised prediction and full reinforcement learning.
Visual
| Method | Main state kept | Exploration mechanism | Strength | Limitation |
|---|---|---|---|---|
| Greedy sample average | , | None after initialization | Simple and exploits fast | Can lock onto a noisy early winner |
| -greedy | , | Random action with probability | Robust baseline | Wastes trials on clearly bad actions |
| Constant-step -greedy | Same as -greedy | Tracks nonstationarity | Estimates do not converge exactly in stationary cases | |
| Optimistic initial values | High initial estimates | Simple directed early exploration | Weak for long nonstationary tasks | |
| UCB | , | Uncertainty bonus | Targets under-sampled actions | Needs time count and tuning constant |
| Gradient bandit | Preferences | Softmax probabilities | Learns stochastic preferences | Sensitive to step size and reward scale |
Worked example 1: Incremental sample-average update
Problem: Action is selected four times and gives rewards , , , and . Start with only as a placeholder before the first observed reward. Compute the sample-average estimate after each reward using the incremental rule.
Step 1: After the first reward, :
Step 2: After the second reward, :
Step 3: After the third reward, :
Step 4: After the fourth reward, :
Check: The direct average is . The checked final estimate is .
Worked example 2: UCB action choice
Problem: At time , three actions have estimates and counts:
Use UCB with . Which action is selected?
Step 1: Compute .
Step 2: Compute each exploration bonus:
Step 3: Add bonuses to estimates:
Step 4: Select the largest UCB score. The checked answer is , even though it has only a moderately higher estimate, because it has been sampled much less often.
Code
import numpy as np
rng = np.random.default_rng(7)
k = 10
steps = 1000
epsilon = 0.1
true_q = rng.normal(loc=0.0, scale=1.0, size=k)
Q = np.zeros(k)
N = np.zeros(k, dtype=int)
rewards = []
for t in range(steps):
if rng.random() < epsilon:
action = rng.integers(k)
else:
action = rng.choice(np.flatnonzero(Q == Q.max()))
reward = rng.normal(true_q[action], 1.0)
N[action] += 1
Q[action] += (reward - Q[action]) / N[action]
rewards.append(reward)
print("True best action:", int(np.argmax(true_q)))
print("Estimated best action:", int(np.argmax(Q)))
print("Average reward:", float(np.mean(rewards)))
print("Counts:", N.tolist())
Common pitfalls
- Evaluating exploration only by the immediate reward of exploratory actions. Exploration is useful because it improves future action selection.
- Forgetting that sample averages assume stationarity. In nonstationary bandits, constant step sizes are usually a better fit.
- Setting too early. Greedy methods can permanently favor an action that was lucky in early samples.
- Treating optimistic initial values as a complete exploration strategy. They mainly help at the beginning of learning.
- Applying UCB without handling actions with . In practice, select each action once or define untried actions as maximally urgent.
- Comparing gradient bandit preferences as if they were value estimates. Preferences determine probabilities but are not expected rewards.