On-policy Prediction with Approximation
Tabular methods store one value per state or state-action pair. That is impossible when states are numerous, continuous, or described by rich observations. On-policy prediction with approximation replaces the table with a parameterized function . The learning problem becomes one of adjusting weights so predictions of return are accurate for states encountered under a policy.
Sutton and Barto present approximation as more than a scaling trick. It changes the objective, the stability questions, and the role of features. With linear methods, tile coding, radial basis functions, Fourier bases, and neural networks, value learning becomes a bridge between reinforcement learning and supervised learning, but the targets are still generated by interaction and bootstrapping.
Definitions
An approximate state-value function is written
where is a weight vector. In linear approximation,
where is a feature vector.
A common prediction objective is mean squared value error under a state distribution :
If the true value target were available, a stochastic gradient update would be
Because is unknown, Monte Carlo uses as a sampled target:
Semi-gradient TD uses the one-step TD target but treats the target as fixed when differentiating:
Key results
The "semi" in semi-gradient matters. The TD target contains through , but the update does not include the derivative of the target. This is not ordinary gradient descent on mean squared error. It is a practical bootstrapping method that is stable for important on-policy linear cases but can fail in more general off-policy settings.
For linear approximation, the gradient is simply
so semi-gradient TD becomes
with
Feature construction is crucial. Coarse coding activates broad overlapping regions. Tile coding uses multiple offset grids to create sparse binary features with local generalization. Fourier bases represent smooth functions through cosine features. Radial basis functions use distance to centers. Neural networks learn features and values together but bring nonlinear optimization issues.
Approximation creates generalization. Updating one state changes predictions for other states with overlapping features or shared network parameters. This can be beneficial when the representation captures meaningful similarity, but harmful when unrelated states interfere.
The distribution in the objective is not neutral. On-policy learning emphasizes states visited by the policy. States rarely visited under have little influence on the learned weights, even if their value errors are large.
The approximation objective also reveals why there may be no weight vector that represents the true value exactly. If the feature space cannot express , learning must choose a compromise. With linear features, the learned value is constrained to a subspace spanned by the feature vectors. With neural networks, the function class is larger but optimization is nonconvex and still shaped by data distribution. Approximation error is therefore not a bug in the algorithm; it is part of the problem definition once tables are abandoned.
Step-size choice becomes more delicate under approximation. In a table, updating one state's entry changes only that state. With shared features, a large update can disrupt many predictions at once. Sparse binary features such as tile coding help because the number of active features is controlled, making reasonable step-size scaling easier. Dense features or neural networks usually require more careful normalization and optimizer choices.
Least-squares and memory-based methods in the chapter show that gradient descent is not the only possible approximation tool. The common thread is still prediction of return under a policy; the difference is how the function approximator represents and fits those predictions.
This is also where notation begins to matter more. The estimate is a deterministic function of weights, while and rewards are random variables generated by interaction. Keeping those roles separate prevents many derivation errors.
Visual
| Representation | Feature shape | Generalization pattern | Typical strength | Typical weakness |
|---|---|---|---|---|
| Aggregation | One active group | Same value inside group | Very simple | Blocky approximation |
| Coarse coding | Overlapping receptive fields | Nearby states share features | Smooth local generalization | Needs coverage design |
| Tile coding | Sparse binary tilings | Local and computationally efficient | Strong tabular-to-continuous bridge | Many implementation details |
| Fourier basis | Global cosine features | Smooth global functions | Compact for smooth values | Poor for sharp discontinuities |
| RBFs | Distance-to-center features | Smooth local bumps | Intuitive geometry | Center and width selection |
| Neural network | Learned nonlinear features | Data-driven | Flexible | Stability and tuning |
Worked example 1: Linear semi-gradient TD update
Problem: A value approximator has weights and features , . The reward is , , and . Compute the semi-gradient TD update.
Step 1: Current prediction:
Step 2: Next prediction:
Step 3: TD target:
Step 4: TD error:
Step 5: Linear gradient is , so
Check: Since the current prediction was too low relative to the target, weights move in the direction that increases .
Worked example 2: Feature overlap and generalization
Problem: A linear value function uses three binary features. State has and state has . Initial weights are . A Monte Carlo update from state has target and . How does this update affect predictions for both and ?
Step 1: Prediction for before the update:
Step 2: Error:
Step 3: Weight update:
Step 4: New prediction for :
Step 5: New prediction for :
Check: Updating also changed because both states share the second feature. This is the central benefit and risk of approximation.
Code
import torch
torch.manual_seed(0)
gamma = 0.9
alpha = 0.05
# Five states on a line, represented by one scalar feature s / 4.
values = torch.nn.Linear(1, 1, bias=True)
optimizer = torch.optim.SGD(values.parameters(), lr=alpha)
transitions = [
(0.0, 0.0, 1.0, False),
(1.0, 0.0, 2.0, False),
(2.0, 0.0, 3.0, False),
(3.0, 1.0, 4.0, True),
]
for epoch in range(200):
for s, r, sp, done in transitions:
s_tensor = torch.tensor([[s / 4.0]])
sp_tensor = torch.tensor([[sp / 4.0]])
v = values(s_tensor)
with torch.no_grad():
target = torch.tensor([[r]]) if done else torch.tensor([[r]]) + gamma * values(sp_tensor)
loss = 0.5 * (target - v).pow(2)
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
grid = torch.arange(5, dtype=torch.float32).view(-1, 1) / 4.0
print(values(grid).view(-1).round(decimals=3))
Common pitfalls
- Treating function approximation as a table with compression. Shared parameters mean every update can affect many states.
- Forgetting that semi-gradient TD is not true gradient descent on the usual value-error objective.
- Using features with wildly different scales without adjusting step sizes.
- Assuming better supervised-learning fit always means better control. Prediction accuracy under one state distribution may not support policy improvement elsewhere.
- Ignoring the on-policy state distribution. The approximator learns most about states visited by the policy.
- Expecting nonlinear neural networks to inherit all tabular convergence guarantees.