Temporal-Difference Learning
Temporal-difference learning combines two ideas that earlier appeared separately: sampling from Monte Carlo methods and bootstrapping from dynamic programming. A TD method learns from actual experience, but it does not wait for the final return. It updates a value estimate toward a target that includes the observed reward plus the current estimate of the next state's value.
This makes TD learning central in Sutton and Barto's account. TD(0), SARSA, Q-learning, Expected SARSA, and Double Q-learning are compact algorithms that explain much of tabular RL practice. They also introduce themes that continue through approximation: bootstrapping bias, online learning, off-policy control, and maximization bias.
Definitions
TD(0) prediction for a fixed policy updates
The quantity
is the TD error. It measures the discrepancy between the current value and a one-step bootstrapped target.
SARSA is on-policy TD control. Its name comes from the transition tuple
The update is
Q-learning is off-policy TD control:
Expected SARSA replaces the sampled next action value with the expectation under the current policy:
Double Q-learning maintains two value estimates and uses one to select the maximizing action while the other evaluates it, reducing maximization bias.
Key results
TD methods can learn before an episode ends. This is a major operational advantage over Monte Carlo learning for long or continuing tasks. A single transition gives a valid TD update because the target uses and the current estimate for the remaining return.
TD(0) has a bias-variance tradeoff relative to Monte Carlo prediction. The TD target is biased when is inaccurate, but it often has lower variance because it does not include all future random rewards. Monte Carlo targets are unbiased samples of the return but can have high variance.
For tabular prediction under a fixed policy, TD(0) converges to under standard stochastic approximation conditions, such as sufficient state visitation and decreasing step sizes satisfying
SARSA learns the value of the policy it actually follows. If the behavior policy is -greedy, SARSA's learned action values reflect the costs of exploratory actions. This matters in tasks where exploration can be dangerous or costly.
Q-learning learns the greedy target policy while following an exploratory behavior policy, provided sufficient exploration continues. This is off-policy control: the update target uses rather than the action actually taken next.
Expected SARSA can reduce variance relative to SARSA because it uses an expectation over next actions instead of a sample. It can also be off-policy if the expectation is taken under a target policy different from the behavior policy.
Maximization bias occurs because the maximum of noisy estimates tends to overestimate the maximum true value. Double learning reduces this by decoupling selection from evaluation.
TD learning is often described as learning from a guess, but the phrase should be read carefully. The update target is partly a guess because it includes or , yet it is anchored by a real observed reward and transition. This makes TD updates local in time and cheap to compute. The cost is that early errors can bootstrap into other estimates, which is why step size, exploration, and sufficient revisitation remain important.
The cliff-walking example in Sutton and Barto is a useful interpretation of SARSA versus Q-learning. SARSA's values include the possibility that the current exploratory policy may choose a bad action near the cliff, so it tends to learn a safer path when exploration continues. Q-learning learns the value of the greedy path, even while exploratory behavior sometimes falls. Neither update is universally "better"; they answer different questions about the relationship between behavior and target policy.
Expected SARSA is a reminder that sampling is optional in parts of the target when probabilities are known. If the next policy distribution is available, averaging over actions can remove action-selection noise. This is a small version of the expected-backup versus sample-backup tradeoff that also appears in dynamic programming and planning.
Visual
| Method | Target uses | Policy relation | Learns online? | Typical use |
|---|---|---|---|---|
| TD(0) | Prediction for fixed | Yes | State-value prediction | |
| SARSA | On-policy | Yes | Control with exploration costs included | |
| Q-learning | Off-policy | Yes | Greedy optimal control | |
| Expected SARSA | On- or off-policy | Yes | Lower-variance TD control | |
| Double Q-learning | Cross-evaluated max | Off-policy | Yes | Reducing maximization bias |
Worked example 1: TD(0) prediction update
Problem: In state , the current estimate is . The agent receives reward , moves to , and . Let and . Compute the TD error and updated value.
Step 1: Compute the TD target:
Step 2: Compute the TD error:
Step 3: Apply the update:
Check: The target is above the old estimate, so the value should increase. It does, from to .
Worked example 2: Comparing SARSA and Q-learning targets
Problem: A transition has , , and old . In next state , action values are
The behavior policy actually selects safe as . With , compute the SARSA and Q-learning updates for .
Step 1: SARSA target uses the actually selected next action:
Step 2: SARSA update:
Step 3: Q-learning target uses the greedy next action:
Step 4: Q-learning update:
Check: The updates move in opposite directions because SARSA evaluates the exploratory behavior that chose safe, while Q-learning evaluates a greedy target that would choose risky.
Code
import numpy as np
rng = np.random.default_rng(3)
n_states, n_actions = 5, 2
Q = np.zeros((n_states, n_actions))
alpha, gamma, epsilon = 0.2, 0.9, 0.1
def step(s, a):
# Simple chain. action 1 moves right, action 0 moves left.
ns = min(4, s + 1) if a == 1 else max(0, s - 1)
reward = 1.0 if ns == 4 else -0.01
done = ns == 4
return ns, reward, done
def eps_greedy(s):
if rng.random() < epsilon:
return int(rng.integers(n_actions))
return int(rng.choice(np.flatnonzero(Q[s] == Q[s].max())))
for episode in range(200):
s = 0
a = eps_greedy(s)
done = False
while not done:
ns, r, done = step(s, a)
na = eps_greedy(ns)
target = r if done else r + gamma * Q[ns, na]
Q[s, a] += alpha * (target - Q[s, a])
s, a = ns, na
print(np.round(Q, 3))
print("Greedy actions:", np.argmax(Q, axis=1).tolist())
Common pitfalls
- Calling any one-step update "TD" without identifying the target. SARSA, Q-learning, and Expected SARSA differ exactly in the next-value part of the target.
- Forgetting to zero out the next value at terminal states.
- Treating Q-learning's behavior as greedy during training. It can follow an exploratory behavior policy while learning a greedy target policy.
- Assuming TD is always better than Monte Carlo. TD often learns faster, but bootstrapping can introduce bias and instability with approximation.
- Ignoring maximization bias. A max over noisy estimates can be overly optimistic even if each estimate is unbiased.
- Using too large a step size and mistaking oscillation for exploration.