Policy Gradient Methods
Policy gradient methods optimize a parameterized policy directly. Instead of learning action values and then choosing greedily, the agent adjusts policy parameters in a direction that increases expected return. Sutton and Barto present this as a major alternative to value-based control, especially useful for stochastic policies, continuous action spaces, and settings where smooth changes in behavior are desirable.
The basic REINFORCE algorithm uses complete returns to estimate the gradient. Baselines reduce variance without changing the expected gradient. Actor-critic methods combine policy gradients with learned value functions, using the critic's TD error as a lower-variance learning signal for the actor. These methods connect the gradient bandit idea from Chapter 2 to full sequential decision making.
Definitions
A parameterized policy is written
where are policy parameters. The objective is expected return from a start distribution in episodic tasks, or average reward in continuing tasks.
The policy gradient theorem states, in one common episodic form, that
which is often written using the log-derivative trick:
for Monte Carlo REINFORCE-style estimates.
The REINFORCE update is
With a baseline , the update becomes
The baseline may depend on the state but not on the action. A learned state-value function is a common baseline.
Actor-critic methods maintain an actor, the policy parameters , and a critic, often a value estimate . The critic estimates returns or advantages; the actor updates the policy.
Key results
The log-derivative identity is the algebraic engine:
It allows gradients of expected returns to be estimated from sampled actions. If an action produced better-than-expected return, increase the log probability of that action in that state; if it produced worse-than-expected return, decrease it.
Baselines reduce variance without biasing the policy gradient. The reason is that
Thus subtracting changes the sample variance but not the expected gradient.
Softmax policies over action preferences are common for discrete actions:
\pi(a\mid s,\boldsymbol{\theta})= \frac{\exp(h(s,a,\boldsymbol{\theta}))} \sum_b \exp(h(s,b,\boldsymbol{\theta}))}.For continuous actions, policies can be parameterized as distributions such as Gaussians, with neural networks outputting means and sometimes variances.
Actor-critic methods trade some bias for lower variance and online learning. A one-step actor-critic update can use
as an estimate of advantage, then update
Policy parameterization shapes what improvement is possible. A softmax over preferences can represent any stochastic policy over a finite action set if the preferences are free enough. A Gaussian policy for continuous actions represents a family of probability densities, and learning may adjust the mean, the variance, or both. If the policy class cannot express a good behavior, gradient ascent can only find the best behavior inside that class.
The policy gradient theorem is powerful because it avoids differentiating the state distribution directly. Changing the policy changes which states will be visited in the future, so a naive derivative of expected return appears to require differentiating through the entire Markov chain. The theorem packages those effects into action-value weighting under the on-policy distribution. This is why sampled trajectories can produce usable gradient estimates.
Entropy and stochasticity are not just implementation details. Sutton and Barto emphasize stochastic policies as first-class objects, and policy gradient methods naturally support them. Stochastic policies can explore, represent mixed strategies, and handle action preferences smoothly, whereas greedy value methods often need a separate exploration rule.
Policy gradient methods also make constraints on actions easier to encode. If the policy distribution is defined only over legal actions, the gradient changes probabilities inside that legal set instead of learning values for impossible choices.
Visual
| Method | Policy update signal | Needs value function? | Bias/variance pattern |
|---|---|---|---|
| REINFORCE | Full return | No | Unbiased, high variance |
| REINFORCE with baseline | Baseline optional/learned | Same expectation, lower variance | |
| Actor-critic | TD error or advantage estimate | Yes | Lower variance, may introduce bias |
| Continuing actor-critic | Differential TD error | Yes | Fits average-reward tasks |
| Gaussian policy gradient | Return times score of density | Often | Handles continuous actions |
Worked example 1: Softmax policy gradient direction
Problem: In one state, a softmax policy has two action preferences and . Action 2 is selected and the return advantage is . With step size , compute the preference update for a tabular softmax policy.
Step 1: Compute action probabilities:
Step 2: For a selected action , the tabular softmax score gradients are
and
Step 3: Multiply by :
Check: Because the selected action had positive advantage, its preference increases and the other preference decreases.
Worked example 2: Baseline leaves expected gradient unchanged
Problem: A state has two actions with probabilities and . A baseline is subtracted. Show that the expected baseline contribution to the policy gradient is zero.
Step 1: The baseline contribution is
Step 2: Use :
Step 3: Move gradient outside the sum:
Step 4: Since probabilities sum to one:
Check: The numerical probabilities do not matter as long as they form a normalized differentiable policy and the baseline does not depend on the sampled action.
Code
import torch
torch.manual_seed(2)
n_states, n_actions = 4, 2
policy = torch.nn.Linear(n_states, n_actions, bias=False)
optimizer = torch.optim.Adam(policy.parameters(), lr=0.05)
gamma = 0.9
def one_hot(s):
x = torch.zeros(n_states)
x[s] = 1.0
return x
def step(s, a):
ns = min(3, s + 1) if a == 1 else max(0, s - 1)
reward = 1.0 if ns == 3 else -0.02
done = ns == 3
return ns, reward, done
for episode in range(200):
log_probs, rewards = [], []
s, done = 0, False
while not done and len(rewards) < 20:
logits = policy(one_hot(s))
dist = torch.distributions.Categorical(logits=logits)
a = dist.sample()
ns, r, done = step(s, int(a.item()))
log_probs.append(dist.log_prob(a))
rewards.append(r)
s = ns
G = 0.0
returns = []
for r in reversed(rewards):
G = r + gamma * G
returns.append(G)
returns.reverse()
returns = torch.tensor(returns)
baseline = returns.mean()
loss = -sum(lp * (G - baseline) for lp, G in zip(log_probs, returns))
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
for s in range(n_states):
probs = torch.softmax(policy(one_hot(s)), dim=0)
print(s, torch.round(probs * 1000) / 1000)
Common pitfalls
- Differentiating through the sampled action as if it were a continuous deterministic output. Policy gradient uses the score .
- Using a baseline that depends on the action. That can bias the gradient unless handled as an action-dependent control variate with extra care.
- Forgetting the negative sign when implementing gradient ascent with optimizers that minimize losses.
- Expecting REINFORCE to be low variance. Full returns can be noisy, especially for long episodes.
- Treating the critic as ground truth. Actor-critic methods depend on critic quality and can become biased if the critic is poor.
- Collapsing exploration too early by allowing policy probabilities to become nearly deterministic before learning is reliable.