Skip to main content

n-step Bootstrapping

n-step methods sit between one-step temporal-difference learning and full Monte Carlo learning. Instead of updating from one reward plus one bootstrapped estimate, or waiting for a complete episode return, an n-step method looks ahead for nn rewards and then bootstraps. This gives a continuum of targets: n=1n=1 is ordinary TD, while nn reaching the episode end gives a Monte Carlo target.

Sutton and Barto use n-step bootstrapping to show that algorithms are not isolated tricks. Prediction, SARSA, off-policy learning, tree-backup methods, and later eligibility traces can all be understood as different ways to combine sampled rewards, bootstrapped estimates, and policy probabilities across multiple time steps.

Definitions

The n-step return for state-value prediction is

Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnV(St+n),G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n V(S_{t+n}),

when t+nt+n occurs before episode termination. If the episode ends at time T<t+nT\lt t+n, the return truncates at the terminal reward and has no bootstrap term:

Gt:T=Rt+1+γRt+2++γTt1RT.G_{t:T} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1}R_T.

The n-step TD update is

V(St)V(St)+α(Gt:t+nV(St)).V(S_t) \leftarrow V(S_t) + \alpha\left(G_{t:t+n}-V(S_t)\right).

For control, n-step SARSA uses action values:

Gt:t+n=Rt+1++γn1Rt+n+γnQ(St+n,At+n).G_{t:t+n} = R_{t+1} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n Q(S_{t+n},A_{t+n}).

Off-policy n-step learning introduces importance-sampling ratios over the part of the trajectory whose action probabilities differ between target policy π\pi and behavior policy bb:

ρt:h=k=thπ(AkSk)b(AkSk).\rho_{t:h} = \prod_{k=t}^{h}\frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}.

Tree-backup methods avoid ordinary importance sampling by backing up expectations over actions not selected, combining sampled state transitions with expected action branches under the target policy.

Key results

The n-step return creates an explicit bias-variance knob. Small nn uses more bootstrapping, so it can have lower variance but more bias from inaccurate value estimates. Large nn uses more sampled reward information, so it can reduce bootstrap bias but increase variance and delay updates.

The update timing matters. At time t+nt+n, the algorithm can update the value for StS_t because the required rewards through Rt+nR_{t+n} and state St+nS_{t+n} are known. In episodic tasks, a common implementation keeps arrays of recent states, actions, and rewards, then updates the state whose n-step target has just become available.

n-step SARSA extends one-step SARSA directly. It remains on-policy when the sampled actions come from the same policy being improved. If the policy is ϵ\epsilon-greedy with respect to QQ, then the n-step return includes the consequences of that exploratory policy.

Off-policy n-step learning is more delicate than one-step off-policy TD because probability ratios multiply across several actions. Long ratios can have high variance. This is one reason Sutton and Barto introduce per-decision importance sampling, control variates, and tree-backup algorithms.

The tree-backup idea replaces a sampled next action with an expected backup over all possible actions under the target policy, recursively. It is especially important because it points toward algorithms that are off-policy without relying on large ordinary importance-sampling products.

n-step methods also prepare the ground for eligibility traces. A forward-view λ\lambda-return can be seen as a geometrically weighted average of n-step returns. Eligibility traces then provide a backward-view mechanism for implementing a similar credit assignment online.

The backup diagram for n-step methods is best understood as a sliding window over experience. At each new time step, one more reward becomes available, and the oldest state in the window can be updated. Near the end of an episode, the window naturally shrinks because termination cuts off the bootstrap. Correctly handling these boundary cases is one of the main implementation challenges, and many errors in n-step code are off-by-one mistakes rather than conceptual misunderstandings.

The choice of nn is environment-dependent. In a task with dense, low-noise rewards, one-step TD may already propagate enough information. In sparse-reward tasks, longer returns can carry the rare reward back faster. In highly stochastic tasks, however, long returns may add so much variance that learning slows. This is why Sutton and Barto present n-step methods as a family rather than as a search for one universally best horizon.

For control, n-step targets also interact with policy improvement. If the policy changes rapidly, the later actions inside a stored n-step return may have been generated by an older policy. On-policy algorithms usually tolerate this as part of continual learning, but it explains why step sizes and policy improvement rates matter.

The chapter's unifying message is that backup length is a design dimension. Once that is visible, one-step TD, Monte Carlo, tree backup, and later trace methods become related points in a larger space of algorithms.

Visual

TargetFormula shapeUpdate delayVarianceBootstrap bias
One-step TDRt+1+γV(St+1)R_{t+1}+\gamma V(S_{t+1})1 stepLowHigher
n-step TDnn rewards plus γnV(St+n)\gamma^n V(S_{t+n})nn stepsMediumMedium
Monte CarloRewards to episode endTo terminationHigherNone from bootstrapping
Tree backupSampled states, expected actionsnn stepsOften lower than ISDepends on estimates

Worked example 1: Three-step return for prediction

Problem: Starting at time tt, the next rewards are Rt+1=1R_{t+1}=1, Rt+2=2R_{t+2}=2, and Rt+3=3R_{t+3}=3. The state at t+3t+3 has current estimate V(St+3)=10V(S_{t+3})=10. Let γ=0.5\gamma=0.5. Compute the three-step return.

Step 1: Write the formula:

Gt:t+3=Rt+1+γRt+2+γ2Rt+3+γ3V(St+3).G_{t:t+3} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 V(S_{t+3}).

Step 2: Substitute numbers:

Gt:t+3=1+0.5(2)+0.52(3)+0.53(10).G_{t:t+3} = 1 + 0.5(2) + 0.5^2(3) + 0.5^3(10).

Step 3: Compute each term:

0.5(2)=1,0.52(3)=0.25(3)=0.75,0.53(10)=0.125(10)=1.25.0.5(2)=1,\qquad 0.5^2(3)=0.25(3)=0.75,\qquad 0.5^3(10)=0.125(10)=1.25.

Step 4: Add:

Gt:t+3=1+1+0.75+1.25=4.G_{t:t+3}=1+1+0.75+1.25=4.

Check: The target includes exactly three sampled rewards and then one bootstrap estimate. The checked answer is 44.

Worked example 2: n-step SARSA update

Problem: A two-step SARSA target has rewards Rt+1=1R_{t+1}=-1, Rt+2=4R_{t+2}=4, γ=0.9\gamma=0.9, and Q(St+2,At+2)=5Q(S_{t+2},A_{t+2})=5. The old value is Q(St,At)=2Q(S_t,A_t)=2 and α=0.2\alpha=0.2. Compute the update.

Step 1: Write the two-step action-value return:

Gt:t+2=Rt+1+γRt+2+γ2Q(St+2,At+2).G_{t:t+2}=R_{t+1}+\gamma R_{t+2}+\gamma^2Q(S_{t+2},A_{t+2}).

Step 2: Substitute:

Gt:t+2=1+0.9(4)+0.92(5).G_{t:t+2}=-1+0.9(4)+0.9^2(5).

Step 3: Compute terms:

0.9(4)=3.6,0.92(5)=0.81(5)=4.05.0.9(4)=3.6,\qquad 0.9^2(5)=0.81(5)=4.05.

Step 4: Add:

Gt:t+2=1+3.6+4.05=6.65.G_{t:t+2}=-1+3.6+4.05=6.65.

Step 5: Update:

Qnew(St,At)=2+0.2(6.652)=2+0.2(4.65)=2.93.\begin{aligned} Q_{\text{new}}(S_t,A_t) &= 2 + 0.2(6.65-2) \\ &= 2 + 0.2(4.65) \\ &= 2.93. \end{aligned}

Check: The target is larger than the old value, so the estimate increases. The checked answer is 2.932.93.

Code

import numpy as np

rng = np.random.default_rng(2)
n_states = 7 # terminal states 0 and 6; nonterminal 1..5
V = np.zeros(n_states)
alpha, gamma, n = 0.1, 1.0, 3

def generate_episode():
s = 3
states = [s]
rewards = [0.0] # rewards indexed so rewards[t+1] is valid
while 0 < s < 6:
s += rng.choice([-1, 1])
rewards.append(1.0 if s == 6 else 0.0)
states.append(s)
return states, rewards

for _ in range(1000):
states, rewards = generate_episode()
T = len(states) - 1
for tau in range(T):
horizon = min(tau + n, T)
G = 0.0
for i in range(tau + 1, horizon + 1):
G += (gamma ** (i - tau - 1)) * rewards[i]
if tau + n < T:
G += (gamma ** n) * V[states[tau + n]]
s_tau = states[tau]
V[s_tau] += alpha * (G - V[s_tau])

print(np.round(V[1:6], 3))

Common pitfalls

  • Including the wrong number of rewards. An n-step return uses rewards Rt+1R_{t+1} through Rt+nR_{t+n}, then bootstraps from time t+nt+n.
  • Bootstrapping after termination. If the episode ends before t+nt+n, the terminal continuation value is zero.
  • Forgetting update delay. The value at time tt cannot be updated with an n-step target until the needed future rewards have been observed.
  • Multiplying importance-sampling ratios over the wrong range. The range depends on whether the update is for state values or action values.
  • Assuming larger nn is always better. Larger nn can reduce bootstrap bias but may greatly increase variance.
  • Treating tree-backup as just another sampled SARSA return. It uses expectations over target-policy actions for branches not sampled.

Connections