Skip to main content

Dynamic Programming

Dynamic programming solves finite MDPs when a complete model is available. In Sutton and Barto's structure, DP is less a practical universal method than a conceptual reference point. It shows what exact Bellman backups look like, why policy evaluation and policy improvement fit together, and how generalized policy iteration underlies many later model-free algorithms.

The main assumption is strong: the agent knows p(s,rs,a)p(s',r \mid s,a) for every relevant state and action. With that model, the agent can update value estimates by sweeping over states instead of waiting for sampled experience. Monte Carlo and temporal-difference methods later replace these full expected backups with sample backups.

Definitions

Policy evaluation, also called prediction, computes VπV_\pi for a fixed policy π\pi. Iterative policy evaluation starts from arbitrary V0V_0 and applies

Vk+1(s)=aπ(as)s,rp(s,rs,a)[r+γVk(s)].V_{k+1}(s) = \sum_a \pi(a \mid s) \sum_{s',r} p(s',r \mid s,a) \left[r+\gamma V_k(s')\right].

Policy improvement constructs a better or equal policy by acting greedily with respect to the current value function:

π(s)argmaxas,rp(s,rs,a)[r+γVπ(s)].\pi'(s) \in \arg\max_a \sum_{s',r} p(s',r \mid s,a)\left[r+\gamma V_\pi(s')\right].

Policy iteration alternates complete or approximate policy evaluation with policy improvement until the policy is stable. Value iteration combines the two ideas by repeatedly applying the Bellman optimality backup:

Vk+1(s)=maxas,rp(s,rs,a)[r+γVk(s)].V_{k+1}(s) = \max_a \sum_{s',r} p(s',r \mid s,a) \left[r+\gamma V_k(s')\right].

Asynchronous DP updates states in any order, sometimes one state at a time, as long as all states continue to receive attention. Generalized policy iteration is the broader idea that evaluation and improvement processes interact, each pushing the other: value estimates move toward the value of the current policy, and the policy moves toward greediness with respect to the current values.

Key results

The policy improvement theorem is the hinge. If for every state

Qπ(s,π(s))Vπ(s),Q_\pi(s,\pi'(s)) \ge V_\pi(s),

then π\pi' is at least as good as π\pi: Vπ(s)Vπ(s)V_{\pi'}(s)\ge V_\pi(s) for all ss. The greedy policy satisfies this condition because it chooses an action maximizing one-step lookahead from VπV_\pi.

Policy iteration is guaranteed to find an optimal policy in a finite discounted MDP. The reasoning is finite: each improvement step either leaves the policy unchanged, in which case it is optimal, or produces a strictly better policy for at least one state. There are finitely many deterministic policies.

Value iteration converges to VV_* for discounted finite MDPs because the Bellman optimality operator is a contraction in the max norm. Intuitively, applying the operator reduces the effect of value errors by at least a factor of γ\gamma. The formal statement is that if TT_* is the optimality backup, then

TuTvγuv.\|T_*u - T_*v\|_\infty \le \gamma \|u-v\|_\infty.

The contraction view clarifies why discounting matters and why repeated backups can converge from arbitrary initial values.

DP backups can be expected backups or sample backups. DP uses expected backups because it sums over all possible next states and rewards. Later methods such as TD(0) use sample backups from observed transitions. The tradeoff is computational: expected backups use more model information per update, while sample backups can operate without enumerating all next outcomes.

The sweep structure of DP also exposes the cost of exactness. A single full backup of one state may require looking at every action and every possible next state. A full sweep multiplies that cost by the number of states. This is acceptable for small gridworlds and useful for theory, but it becomes impossible in large or continuous domains. Approximate dynamic programming, simulation-based planning, and reinforcement learning can all be read as attempts to keep the Bellman idea while reducing the cost or weakening the need for a complete model.

Policy iteration and value iteration are not competitors in a narrow sense; they are endpoints on a spectrum. Policy iteration performs strong evaluation steps followed by explicit improvement. Value iteration performs very shallow evaluation because each optimality backup immediately includes maximization. Modified policy iteration lies between them, doing only a few evaluation sweeps before improving. This spectrum is the clearest early example of generalized policy iteration: values and policies can chase each other toward consistency and greediness without either process finishing completely at every round.

In implementation, the Bellman residual is often more informative than the number of sweeps. A small residual means the current values nearly satisfy the relevant Bellman equation. However, a small value residual and a stable greedy policy are not identical checks. In control, small value changes usually imply the greedy policy is close to stable, but ties, approximation, or loose thresholds can still produce policy changes.

Visual

AlgorithmBackupModel requiredPolicy changesTypical stopping rule
Iterative policy evaluationBellman expectationYesNoValue change below threshold
Policy iterationExpectation plus greedy improvementYesAfter evaluation sweepsPolicy stable
Value iterationBellman optimalityYesExtracted from final valuesValue change below threshold
Asynchronous DPAny selected Bellman backupYesDepends on variantCoverage plus small residual
Generalized policy iterationEvaluation and improvement togetherModel optional in later chaptersContinualConvergence or performance target

Worked example 1: One sweep of policy evaluation

Problem: There are two states, AA and BB, and one action in each. From AA the agent receives 00 and moves to BB. From BB it receives 11 and terminates. Let γ=0.9\gamma=0.9 and initial values V0(A)=0V_0(A)=0, V0(B)=0V_0(B)=0. Perform two synchronous policy-evaluation sweeps.

Step 1: First sweep for AA:

V1(A)=0+0.9V0(B)=0.V_1(A)=0+0.9V_0(B)=0.

Step 2: First sweep for BB:

V1(B)=1+0.9(0)=1.V_1(B)=1+0.9(0)=1.

The zero after termination is used because no future reward remains.

Step 3: Second sweep for AA uses V1(B)V_1(B):

V2(A)=0+0.9V1(B)=0.9.V_2(A)=0+0.9V_1(B)=0.9.

Step 4: Second sweep for BB remains

V2(B)=1.V_2(B)=1.

Check: The exact values are Vπ(B)=1V_\pi(B)=1 and Vπ(A)=0+0.9(1)=0.9V_\pi(A)=0+0.9(1)=0.9. After two synchronous sweeps, the checked values have reached the exact solution for this short chain.

Worked example 2: Policy improvement by one-step lookahead

Problem: In state ss, two actions are available. Action left gives reward 11 and transitions to s1s_1. Action right gives reward 00 and transitions to s2s_2. Suppose a policy evaluation step has produced V(s1)=3V(s_1)=3 and V(s2)=6V(s_2)=6, with γ=0.5\gamma=0.5. Which action is greedy?

Step 1: Compute the one-step lookahead value for left:

q(s,left)=1+0.5V(s1)=1+0.5(3)=2.5.\begin{aligned} q(s,\text{left}) &= 1 + 0.5V(s_1) \\ &= 1 + 0.5(3) \\ &= 2.5. \end{aligned}

Step 2: Compute the one-step lookahead value for right:

q(s,right)=0+0.5V(s2)=0+0.5(6)=3.\begin{aligned} q(s,\text{right}) &= 0 + 0.5V(s_2) \\ &= 0 + 0.5(6) \\ &= 3. \end{aligned}

Step 3: Compare:

3>2.5.3 > 2.5.

The greedy improved policy chooses right. Check the interpretation: right has lower immediate reward, but it leads to a state with sufficiently larger estimated future value. This is the reason value functions matter for sequential decision making.

Code

import numpy as np

# Small deterministic MDP:
# states 0,1 are nonterminal; state 2 is terminal.
# actions 0,1. P[s, a] = (next_state, reward)
P = {
(0, 0): (1, 0.0),
(0, 1): (2, 1.0),
(1, 0): (0, 0.0),
(1, 1): (2, 2.0),
}
gamma = 0.9
V = np.zeros(3)

for sweep in range(50):
delta = 0.0
for s in [0, 1]:
old = V[s]
V[s] = max(r + gamma * V[sp] for (sp, r) in [P[(s, a)] for a in [0, 1]])
delta = max(delta, abs(old - V[s]))
if delta < 1e-10:
break

policy = {}
for s in [0, 1]:
scores = [P[(s, a)][1] + gamma * V[P[(s, a)][0]] for a in [0, 1]]
policy[s] = int(np.argmax(scores))

print("Optimal values:", V[:2])
print("Greedy policy:", policy)

Common pitfalls

  • Applying DP when the model is not actually known. If transition probabilities are estimated from data, the result is planning with an approximate model, not exact DP.
  • Confusing policy evaluation with control. Evaluation predicts returns for a fixed policy; it does not by itself choose better actions.
  • Updating values in place without realizing it changes the algorithm from synchronous to asynchronous-style updates. This is often fine, but it affects intermediate values.
  • Stopping policy evaluation too early in a way that breaks policy improvement. Approximate policy iteration is useful, but evaluation error should be understood.
  • Assuming value iteration needs full policy evaluation between improvements. It intentionally uses a truncated evaluation idea with maximization inside each backup.
  • Ignoring terminal states. Their continuation value is normally zero in episodic return calculations.

Connections