Eligibility Traces
Eligibility traces provide a bridge between one-step TD methods and Monte Carlo methods. They let recent states or state-action pairs remain eligible for credit when later TD errors arrive. The forward view describes the target as a weighted mixture of n-step returns. The backward view gives an online mechanism: maintain a trace vector, update it each step, and use the current TD error to adjust all recently eligible predictions.
Sutton and Barto use eligibility traces to unify many earlier ideas. TD(), SARSA(), true online TD(), Watkins's Q(), and tree-backup traces differ in how traces are accumulated, cleared, or weighted, but they share the idea that credit assignment should extend over time without waiting for a full episode.
Definitions
The -return is a geometrically weighted average of n-step returns:
in the continuing case, with appropriate truncation in episodic tasks. The parameter controls how much weight goes to longer returns. gives the one-step TD target. approaches the Monte Carlo return in episodic tasks.
For linear value prediction, the accumulating eligibility trace is
The TD() update is
where
For tabular state values, is a one-hot vector, so the trace for the current state increases while previous traces decay by .
SARSA() uses traces over state-action pairs:
Watkins's Q() is an off-policy control method that cuts traces when a nongreedy action is selected, preserving a connection to Q-learning's greedy target.
True online TD() modifies the backward view so that it more exactly matches the online forward view at every time step, not only approximately or at episode end.
Key results
Eligibility traces solve a practical credit-assignment problem. One-step TD propagates information one step per update. Monte Carlo propagates complete returns but waits until episode end and can have high variance. TD() can propagate a TD error backward through many recent states immediately, with influence decaying according to .
The forward view is conceptually clean: update toward . The backward view is computationally useful: update with . Under appropriate conditions, these views are equivalent for offline updates, and true online TD() sharpens the equivalence for online learning with changing weights.
The trace parameter is not just "memory length." The effective decay is , so discounting also shortens credit assignment. With and , a trace decays by each step before any new activation.
Accumulating traces can exceed for repeatedly visited states or features. Replacing traces set the active trace to instead of adding , which can improve behavior in some tabular tasks by preventing repeated visits from producing very large traces. With function approximation, trace variants require more care.
Control with traces must respect the policy relationship. On-policy SARSA() can keep traces through exploratory actions because it evaluates the exploratory policy. Watkins's Q() cuts traces after nongreedy actions because Q-learning's target is greedy, not the exploratory behavior.
Eligibility traces can be understood as a memory of responsibility. A state visited several steps ago may still deserve part of the current TD error if its trace has not decayed away. This does not mean the old state caused the new reward in a philosophical sense; it means the update rule assigns credit according to recency, discounting, and the chosen trace parameter. The trace is therefore an algorithmic credit-assignment device, not an additional environment model.
In function approximation, the trace vector lives in parameter space rather than as a table over states. If is linear, the trace accumulates feature vectors. If the approximator is nonlinear, the trace accumulates gradients. This makes traces compatible with approximation in principle, but it also means that feature scaling and changing gradients affect how credit is distributed.
True online TD() is important because ordinary accumulating-trace TD can diverge from the intended online forward view as weights change during an episode. The true-online correction keeps the computational advantages of the backward view while matching the forward-view target more faithfully at each step.
Visual
| value | Target behavior | Credit assignment | Typical effect |
|---|---|---|---|
| One-step TD | Only current state strongly updated | Low variance, more bootstrap bias | |
| Between and | Mixture of n-step returns | Recent states get decayed credit | Balanced learning speed and variance |
| episodic | Monte Carlo-like | Long credit assignment to episode start | Less bootstrap bias, higher variance |
| With trace cutting | Greedy off-policy consistency | Stops credit after nongreedy actions | Useful for Watkins's Q() |
Worked example 1: Tabular TD(lambda) trace and update
Problem: Three states have current values , , . The agent is in , the previous traces are , , , and it receives reward before moving to . Let , , and . Use accumulating traces and compute the updates.
Step 1: Decay old traces:
Step 2: Increment the trace for the current state :
Now , , and .
Step 3: Compute TD error:
Step 4: Update each value:
Check: The current state receives the largest update, but the previously eligible state also receives credit.
Worked example 2: Lambda-return as a weighted mixture
Problem: Suppose the one-step, two-step, and three-step returns from a time are , , and , and the episode ends at . Let . Compute the truncated episodic -return:
Step 1: Compute weights:
The final term gets the remaining probability mass because the episode ends.
Step 2: Multiply each return:
Step 3: Add:
Check: The weights sum to , and the result lies between the smallest and largest component return. The checked answer is .
Code
import numpy as np
rng = np.random.default_rng(9)
n_states = 7
V = np.zeros(n_states)
alpha, gamma, lam = 0.1, 1.0, 0.8
def episode():
s = 3
states = [s]
rewards = []
while 0 < s < 6:
s += rng.choice([-1, 1])
rewards.append(1.0 if s == 6 else 0.0)
states.append(s)
return states, rewards
for _ in range(1000):
z = np.zeros(n_states)
states, rewards = episode()
for t, r in enumerate(rewards):
s = states[t]
sp = states[t + 1]
z *= gamma * lam
z[s] += 1.0
target_next = 0.0 if sp in (0, 6) else V[sp]
delta = r + gamma * target_next - V[s]
V += alpha * delta * z
V[0] = V[6] = 0.0
print(np.round(V[1:6], 3))
Common pitfalls
- Treating as the only decay factor. Traces decay by , not just .
- Forgetting to reset traces at episode boundaries.
- Mixing forward-view and backward-view formulas without checking whether the algorithm is offline, online, accumulating, replacing, or true online.
- Letting traces grow unintentionally in loops with accumulating traces. Replacing traces may be more appropriate in some tabular control tasks.
- Failing to cut traces in Watkins's Q() after nongreedy actions.
- Assuming traces remove the need for exploration. They improve credit assignment after experience occurs; they do not guarantee useful experience.