n-step Bootstrapping
n-step methods sit between one-step temporal-difference learning and full Monte Carlo learning. Instead of updating from one reward plus one bootstrapped estimate, or waiting for a complete episode return, an n-step method looks ahead for rewards and then bootstraps. This gives a continuum of targets: is ordinary TD, while reaching the episode end gives a Monte Carlo target.
Sutton and Barto use n-step bootstrapping to show that algorithms are not isolated tricks. Prediction, SARSA, off-policy learning, tree-backup methods, and later eligibility traces can all be understood as different ways to combine sampled rewards, bootstrapped estimates, and policy probabilities across multiple time steps.
Definitions
The n-step return for state-value prediction is
when occurs before episode termination. If the episode ends at time , the return truncates at the terminal reward and has no bootstrap term:
The n-step TD update is
For control, n-step SARSA uses action values:
Off-policy n-step learning introduces importance-sampling ratios over the part of the trajectory whose action probabilities differ between target policy and behavior policy :
Tree-backup methods avoid ordinary importance sampling by backing up expectations over actions not selected, combining sampled state transitions with expected action branches under the target policy.
Key results
The n-step return creates an explicit bias-variance knob. Small uses more bootstrapping, so it can have lower variance but more bias from inaccurate value estimates. Large uses more sampled reward information, so it can reduce bootstrap bias but increase variance and delay updates.
The update timing matters. At time , the algorithm can update the value for because the required rewards through and state are known. In episodic tasks, a common implementation keeps arrays of recent states, actions, and rewards, then updates the state whose n-step target has just become available.
n-step SARSA extends one-step SARSA directly. It remains on-policy when the sampled actions come from the same policy being improved. If the policy is -greedy with respect to , then the n-step return includes the consequences of that exploratory policy.
Off-policy n-step learning is more delicate than one-step off-policy TD because probability ratios multiply across several actions. Long ratios can have high variance. This is one reason Sutton and Barto introduce per-decision importance sampling, control variates, and tree-backup algorithms.
The tree-backup idea replaces a sampled next action with an expected backup over all possible actions under the target policy, recursively. It is especially important because it points toward algorithms that are off-policy without relying on large ordinary importance-sampling products.
n-step methods also prepare the ground for eligibility traces. A forward-view -return can be seen as a geometrically weighted average of n-step returns. Eligibility traces then provide a backward-view mechanism for implementing a similar credit assignment online.
The backup diagram for n-step methods is best understood as a sliding window over experience. At each new time step, one more reward becomes available, and the oldest state in the window can be updated. Near the end of an episode, the window naturally shrinks because termination cuts off the bootstrap. Correctly handling these boundary cases is one of the main implementation challenges, and many errors in n-step code are off-by-one mistakes rather than conceptual misunderstandings.
The choice of is environment-dependent. In a task with dense, low-noise rewards, one-step TD may already propagate enough information. In sparse-reward tasks, longer returns can carry the rare reward back faster. In highly stochastic tasks, however, long returns may add so much variance that learning slows. This is why Sutton and Barto present n-step methods as a family rather than as a search for one universally best horizon.
For control, n-step targets also interact with policy improvement. If the policy changes rapidly, the later actions inside a stored n-step return may have been generated by an older policy. On-policy algorithms usually tolerate this as part of continual learning, but it explains why step sizes and policy improvement rates matter.
The chapter's unifying message is that backup length is a design dimension. Once that is visible, one-step TD, Monte Carlo, tree backup, and later trace methods become related points in a larger space of algorithms.
Visual
| Target | Formula shape | Update delay | Variance | Bootstrap bias |
|---|---|---|---|---|
| One-step TD | 1 step | Low | Higher | |
| n-step TD | rewards plus | steps | Medium | Medium |
| Monte Carlo | Rewards to episode end | To termination | Higher | None from bootstrapping |
| Tree backup | Sampled states, expected actions | steps | Often lower than IS | Depends on estimates |
Worked example 1: Three-step return for prediction
Problem: Starting at time , the next rewards are , , and . The state at has current estimate . Let . Compute the three-step return.
Step 1: Write the formula:
Step 2: Substitute numbers:
Step 3: Compute each term:
Step 4: Add:
Check: The target includes exactly three sampled rewards and then one bootstrap estimate. The checked answer is .
Worked example 2: n-step SARSA update
Problem: A two-step SARSA target has rewards , , , and . The old value is and . Compute the update.
Step 1: Write the two-step action-value return:
Step 2: Substitute:
Step 3: Compute terms:
Step 4: Add:
Step 5: Update:
Check: The target is larger than the old value, so the estimate increases. The checked answer is .
Code
import numpy as np
rng = np.random.default_rng(2)
n_states = 7 # terminal states 0 and 6; nonterminal 1..5
V = np.zeros(n_states)
alpha, gamma, n = 0.1, 1.0, 3
def generate_episode():
s = 3
states = [s]
rewards = [0.0] # rewards indexed so rewards[t+1] is valid
while 0 < s < 6:
s += rng.choice([-1, 1])
rewards.append(1.0 if s == 6 else 0.0)
states.append(s)
return states, rewards
for _ in range(1000):
states, rewards = generate_episode()
T = len(states) - 1
for tau in range(T):
horizon = min(tau + n, T)
G = 0.0
for i in range(tau + 1, horizon + 1):
G += (gamma ** (i - tau - 1)) * rewards[i]
if tau + n < T:
G += (gamma ** n) * V[states[tau + n]]
s_tau = states[tau]
V[s_tau] += alpha * (G - V[s_tau])
print(np.round(V[1:6], 3))
Common pitfalls
- Including the wrong number of rewards. An n-step return uses rewards through , then bootstraps from time .
- Bootstrapping after termination. If the episode ends before , the terminal continuation value is zero.
- Forgetting update delay. The value at time cannot be updated with an n-step target until the needed future rewards have been observed.
- Multiplying importance-sampling ratios over the wrong range. The range depends on whether the update is for state values or action values.
- Assuming larger is always better. Larger can reduce bootstrap bias but may greatly increase variance.
- Treating tree-backup as just another sampled SARSA return. It uses expectations over target-policy actions for branches not sampled.