On-policy Control with Approximation
On-policy control with approximation learns action values or policies when tabular storage is no longer practical. The agent improves the same policy it uses to collect data, often through semi-gradient SARSA. This is the natural approximate counterpart of tabular SARSA: it learns from trajectories generated by the current exploratory policy and updates a parameterized .
Sutton and Barto use examples such as mountain car to show why approximation is not optional. Continuous states require generalization across nearby situations. The chapter also introduces average reward as a continuing-task objective, which becomes important when discounting is an awkward way to express long-run performance.
Definitions
An approximate action-value function is
The one-step semi-gradient SARSA update is
The n-step version replaces the one-step target with an n-step action-value return:
For continuing problems, the average reward objective uses the long-run average reward rate
Differential returns compare rewards to the average reward:
Differential semi-gradient SARSA learns action values relative to an estimated average reward .
Key results
Semi-gradient SARSA is on-policy, so the target action is selected by the same behavior policy that is being improved. When the policy is -greedy with respect to , the learned values reflect the consequences of exploration. This can be desirable in domains where exploratory moves have real costs.
Linear action-value approximation often uses features of both state and action:
For discrete actions and continuous states, a common construction gives each action its own block of features. Updating one action then changes only that action's weights, while features still generalize across nearby states for the same action.
n-step semi-gradient SARSA can learn faster than one-step SARSA because it propagates reward information over multiple steps. In mountain car, for example, rewards are delayed until the car reaches the goal. Multi-step returns can move useful information farther back along trajectories.
Average-reward control avoids discounting in continuing tasks where there is no natural episode and no reason to prefer earlier reward except through the long-run rate. The differential TD error for action values has the form
for a control variant, with a separate update for . In the on-policy SARSA form, the next action value is .
Mountain car illustrates why local generalization and multi-step credit assignment matter together. The agent often must move away from the goal to build momentum, so immediate reward and greedy one-step intuition are misleading. A representation such as tile coding lets successful experience near one position-velocity pair influence nearby pairs, while n-step SARSA can propagate the eventual success signal back through the momentum-building sequence.
Control also amplifies approximation errors because the policy depends on the estimates being learned. A small value error can change which action is greedy, which changes the future data distribution, which changes later updates. On-policy methods keep this feedback loop somewhat aligned by learning about the same exploratory policy that generates the data. This does not remove all instability, but it avoids the extra mismatch introduced by off-policy learning.
In average-reward problems, adding a constant to all rewards changes differential values less directly than it changes discounted returns. What matters is reward relative to the long-run rate. This is often a better conceptual fit for continuing control tasks such as queueing, routing, or ongoing resource management.
Visual
| Setting | Objective | Typical update | Why it matters |
|---|---|---|---|
| Episodic discounted | Expected discounted return | Semi-gradient SARSA | Natural for tasks with terminal goals |
| Episodic n-step | n-step return | Semi-gradient n-step SARSA | Propagates sparse rewards faster |
| Continuing discounted | Discounted return from each time | Discounted TD control | Sometimes mathematically convenient |
| Continuing average reward | Reward rate | Differential SARSA | Better when no start or terminal state matters |
Worked example 1: Linear action-value SARSA update
Problem: Let . Current weights are . The current feature vector is and the next feature vector is . Reward is , , and . Compute the one-step semi-gradient SARSA update.
Step 1: Current action-value estimate:
Step 2: Next action-value estimate:
Step 3: SARSA target:
Step 4: TD error:
Step 5: Gradient is the current feature vector:
Step 6: Compute components:
Check: Only weights attached to active current features changed.
Worked example 2: Differential reward update
Problem: A continuing task uses an average reward estimate . A transition gives reward . The current estimate is and the next selected action has . Use for the value update and for the average reward update. Compute the differential SARSA TD error and updates for a scalar active feature of .
Step 1: Differential SARSA error:
Step 2: Substitute:
Step 3: Value parameter update with feature :
Step 4: Average reward update:
Check: The reward and next value were better than expected relative to the current average reward, so both the active value parameter and average reward estimate increase.
Code
import torch
torch.manual_seed(1)
n_states, n_actions = 6, 2
gamma, epsilon = 0.95, 0.1
q_net = torch.nn.Linear(n_states, n_actions, bias=False)
optimizer = torch.optim.SGD(q_net.parameters(), lr=0.1)
def one_hot(s):
x = torch.zeros(1, n_states)
x[0, s] = 1.0
return x
def step(s, a):
ns = min(5, s + 1) if a == 1 else max(0, s - 1)
reward = 1.0 if ns == 5 else -0.01
done = ns == 5
return ns, reward, done
def choose(s):
if torch.rand(()) < epsilon:
return int(torch.randint(n_actions, ()).item())
with torch.no_grad():
return int(torch.argmax(q_net(one_hot(s))).item())
for episode in range(100):
s = 0
a = choose(s)
done = False
while not done:
ns, r, done = step(s, a)
na = choose(ns)
q = q_net(one_hot(s))[0, a]
with torch.no_grad():
target = torch.tensor(r) if done else torch.tensor(r) + gamma * q_net(one_hot(ns))[0, na]
loss = 0.5 * (target - q).pow(2)
optimizer.zero_grad()
loss.backward()
optimizer.step()
s, a = ns, na
with torch.no_grad():
print(torch.round(q_net(torch.eye(n_states)) * 1000) / 1000)
Common pitfalls
- Updating the gradient at the next state instead of the current state-action pair. Semi-gradient SARSA differentiates .
- Forgetting that on-policy control learns values for the exploratory policy, not an idealized greedy policy during learning.
- Using one shared feature vector for all actions unintentionally. If action identity is omitted, the approximator may be unable to distinguish action values.
- Applying discounted episodic intuition to continuing tasks where average reward is the intended objective.
- Choosing tile widths or feature scales without considering the dynamics. Bad features can make nearby states generalize when they should not.
- Expecting function approximation to fix sparse rewards by itself. n-step targets, traces, and exploration still matter.