Skip to main content

On-policy Prediction with Approximation

Tabular methods store one value per state or state-action pair. That is impossible when states are numerous, continuous, or described by rich observations. On-policy prediction with approximation replaces the table with a parameterized function v^(s,w)\hat v(s,\mathbf{w}). The learning problem becomes one of adjusting weights so predictions of return are accurate for states encountered under a policy.

Sutton and Barto present approximation as more than a scaling trick. It changes the objective, the stability questions, and the role of features. With linear methods, tile coding, radial basis functions, Fourier bases, and neural networks, value learning becomes a bridge between reinforcement learning and supervised learning, but the targets are still generated by interaction and bootstrapping.

Definitions

An approximate state-value function is written

v^(s,w)Vπ(s),\hat v(s,\mathbf{w}) \approx V_\pi(s),

where wRd\mathbf{w}\in\mathbb{R}^d is a weight vector. In linear approximation,

v^(s,w)=wx(s),\hat v(s,\mathbf{w}) = \mathbf{w}^\top \mathbf{x}(s),

where x(s)\mathbf{x}(s) is a feature vector.

A common prediction objective is mean squared value error under a state distribution μ\mu:

VE(w)=sμ(s)[Vπ(s)v^(s,w)]2.\overline{\mathrm{VE}}(\mathbf{w}) = \sum_s \mu(s)\left[V_\pi(s)-\hat v(s,\mathbf{w})\right]^2.

If the true value target were available, a stochastic gradient update would be

wt+1=wt+α[Vπ(St)v^(St,wt)]v^(St,wt).\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\left[V_\pi(S_t)-\hat v(S_t,\mathbf{w}_t)\right] \nabla \hat v(S_t,\mathbf{w}_t).

Because Vπ(St)V_\pi(S_t) is unknown, Monte Carlo uses GtG_t as a sampled target:

wt+1=wt+α[Gtv^(St,wt)]v^(St,wt).\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\left[G_t-\hat v(S_t,\mathbf{w}_t)\right] \nabla \hat v(S_t,\mathbf{w}_t).

Semi-gradient TD uses the one-step TD target but treats the target as fixed when differentiating:

wt+1=wt+α[Rt+1+γv^(St+1,wt)v^(St,wt)]v^(St,wt).\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\left[ R_{t+1}+\gamma\hat v(S_{t+1},\mathbf{w}_t)-\hat v(S_t,\mathbf{w}_t) \right]\nabla \hat v(S_t,\mathbf{w}_t).

Key results

The "semi" in semi-gradient matters. The TD target contains wt\mathbf{w}_t through v^(St+1,wt)\hat v(S_{t+1},\mathbf{w}_t), but the update does not include the derivative of the target. This is not ordinary gradient descent on mean squared error. It is a practical bootstrapping method that is stable for important on-policy linear cases but can fail in more general off-policy settings.

For linear approximation, the gradient is simply

v^(s,w)=x(s),\nabla \hat v(s,\mathbf{w}) = \mathbf{x}(s),

so semi-gradient TD becomes

wt+1=wt+αδtx(St),\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha\delta_t\mathbf{x}(S_t),

with

δt=Rt+1+γwtx(St+1)wtx(St).\delta_t = R_{t+1}+\gamma\mathbf{w}_t^\top\mathbf{x}(S_{t+1}) -\mathbf{w}_t^\top\mathbf{x}(S_t).

Feature construction is crucial. Coarse coding activates broad overlapping regions. Tile coding uses multiple offset grids to create sparse binary features with local generalization. Fourier bases represent smooth functions through cosine features. Radial basis functions use distance to centers. Neural networks learn features and values together but bring nonlinear optimization issues.

Approximation creates generalization. Updating one state changes predictions for other states with overlapping features or shared network parameters. This can be beneficial when the representation captures meaningful similarity, but harmful when unrelated states interfere.

The distribution μ\mu in the objective is not neutral. On-policy learning emphasizes states visited by the policy. States rarely visited under π\pi have little influence on the learned weights, even if their value errors are large.

The approximation objective also reveals why there may be no weight vector that represents the true value exactly. If the feature space cannot express VπV_\pi, learning must choose a compromise. With linear features, the learned value is constrained to a subspace spanned by the feature vectors. With neural networks, the function class is larger but optimization is nonconvex and still shaped by data distribution. Approximation error is therefore not a bug in the algorithm; it is part of the problem definition once tables are abandoned.

Step-size choice becomes more delicate under approximation. In a table, updating one state's entry changes only that state. With shared features, a large update can disrupt many predictions at once. Sparse binary features such as tile coding help because the number of active features is controlled, making reasonable step-size scaling easier. Dense features or neural networks usually require more careful normalization and optimizer choices.

Least-squares and memory-based methods in the chapter show that gradient descent is not the only possible approximation tool. The common thread is still prediction of return under a policy; the difference is how the function approximator represents and fits those predictions.

This is also where notation begins to matter more. The estimate v^(s,w)\hat v(s,\mathbf{w}) is a deterministic function of weights, while StS_t and rewards are random variables generated by interaction. Keeping those roles separate prevents many derivation errors.

Visual

RepresentationFeature shapeGeneralization patternTypical strengthTypical weakness
AggregationOne active groupSame value inside groupVery simpleBlocky approximation
Coarse codingOverlapping receptive fieldsNearby states share featuresSmooth local generalizationNeeds coverage design
Tile codingSparse binary tilingsLocal and computationally efficientStrong tabular-to-continuous bridgeMany implementation details
Fourier basisGlobal cosine featuresSmooth global functionsCompact for smooth valuesPoor for sharp discontinuities
RBFsDistance-to-center featuresSmooth local bumpsIntuitive geometryCenter and width selection
Neural networkLearned nonlinear featuresData-drivenFlexibleStability and tuning

Worked example 1: Linear semi-gradient TD update

Problem: A value approximator has weights w=(1,2)\mathbf{w}=(1, -2) and features x(s)=(2,1)\mathbf{x}(s)=(2,1), x(s)=(0,3)\mathbf{x}(s')=(0,3). The reward is Rt+1=4R_{t+1}=4, γ=0.5\gamma=0.5, and α=0.1\alpha=0.1. Compute the semi-gradient TD update.

Step 1: Current prediction:

v^(s,w)=(1)(2)+(2)(1)=0.\hat v(s,\mathbf{w}) = (1)(2)+(-2)(1)=0.

Step 2: Next prediction:

v^(s,w)=(1)(0)+(2)(3)=6.\hat v(s',\mathbf{w}) = (1)(0)+(-2)(3)=-6.

Step 3: TD target:

Rt+1+γv^(s,w)=4+0.5(6)=1.R_{t+1}+\gamma\hat v(s',\mathbf{w}) = 4 + 0.5(-6) = 1.

Step 4: TD error:

δ=10=1.\delta = 1 - 0 = 1.

Step 5: Linear gradient is x(s)=(2,1)\mathbf{x}(s)=(2,1), so

wnew=(1,2)+0.1(1)(2,1)=(1.2,1.9).\mathbf{w}_{new} = (1,-2)+0.1(1)(2,1) = (1.2,-1.9).

Check: Since the current prediction was too low relative to the target, weights move in the direction that increases v^(s,w)\hat v(s,\mathbf{w}).

Worked example 2: Feature overlap and generalization

Problem: A linear value function uses three binary features. State AA has x(A)=(1,1,0)\mathbf{x}(A)=(1,1,0) and state BB has x(B)=(0,1,1)\mathbf{x}(B)=(0,1,1). Initial weights are w=(0,0,0)\mathbf{w}=(0,0,0). A Monte Carlo update from state AA has target G=10G=10 and α=0.2\alpha=0.2. How does this update affect predictions for both AA and BB?

Step 1: Prediction for AA before the update:

v^(A,w)=0.\hat v(A,\mathbf{w})=0.

Step 2: Error:

Gv^(A,w)=100=10.G-\hat v(A,\mathbf{w})=10-0=10.

Step 3: Weight update:

wnew=(0,0,0)+0.2(10)(1,1,0)=(2,2,0).\mathbf{w}_{new} = (0,0,0)+0.2(10)(1,1,0) =(2,2,0).

Step 4: New prediction for AA:

v^(A,wnew)=(2)(1)+(2)(1)+(0)(0)=4.\hat v(A,\mathbf{w}_{new})=(2)(1)+(2)(1)+(0)(0)=4.

Step 5: New prediction for BB:

v^(B,wnew)=(2)(0)+(2)(1)+(0)(1)=2.\hat v(B,\mathbf{w}_{new})=(2)(0)+(2)(1)+(0)(1)=2.

Check: Updating AA also changed BB because both states share the second feature. This is the central benefit and risk of approximation.

Code

import torch

torch.manual_seed(0)
gamma = 0.9
alpha = 0.05

# Five states on a line, represented by one scalar feature s / 4.
values = torch.nn.Linear(1, 1, bias=True)
optimizer = torch.optim.SGD(values.parameters(), lr=alpha)

transitions = [
(0.0, 0.0, 1.0, False),
(1.0, 0.0, 2.0, False),
(2.0, 0.0, 3.0, False),
(3.0, 1.0, 4.0, True),
]

for epoch in range(200):
for s, r, sp, done in transitions:
s_tensor = torch.tensor([[s / 4.0]])
sp_tensor = torch.tensor([[sp / 4.0]])
v = values(s_tensor)
with torch.no_grad():
target = torch.tensor([[r]]) if done else torch.tensor([[r]]) + gamma * values(sp_tensor)
loss = 0.5 * (target - v).pow(2)
optimizer.zero_grad()
loss.backward()
optimizer.step()

with torch.no_grad():
grid = torch.arange(5, dtype=torch.float32).view(-1, 1) / 4.0
print(values(grid).view(-1).round(decimals=3))

Common pitfalls

  • Treating function approximation as a table with compression. Shared parameters mean every update can affect many states.
  • Forgetting that semi-gradient TD is not true gradient descent on the usual value-error objective.
  • Using features with wildly different scales without adjusting step sizes.
  • Assuming better supervised-learning fit always means better control. Prediction accuracy under one state distribution may not support policy improvement elsewhere.
  • Ignoring the on-policy state distribution. The approximator learns most about states visited by the policy.
  • Expecting nonlinear neural networks to inherit all tabular convergence guarantees.

Connections