Skip to main content

Monte Carlo Methods

Monte Carlo methods learn value functions and policies from complete sampled episodes. They do not require a model of the environment, and they do not bootstrap from current value estimates. Instead, they wait until an episode has ended, compute actual returns, and average those returns for states or state-action pairs. This makes Monte Carlo learning conceptually simple and statistically direct.

The price is that Monte Carlo methods apply most naturally to episodic tasks. They can have high variance because a return contains all later random rewards in an episode. Sutton and Barto use Monte Carlo methods as the first major model-free solution method, then contrast them with temporal-difference learning, which learns online from incomplete experience by bootstrapping.

Definitions

First-visit Monte Carlo prediction estimates Vπ(s)V_\pi(s) by averaging returns following the first visit to ss in each episode. Every-visit Monte Carlo prediction averages returns following every visit to ss. Both converge to Vπ(s)V_\pi(s) under standard sampling assumptions in episodic tasks.

For action values, Monte Carlo estimates

Qπ(s,a)=Eπ[GtSt=s,At=a]Q_\pi(s,a) = \mathbb{E}_\pi[G_t \mid S_t=s,A_t=a]

by averaging returns after visits to state-action pair (s,a)(s,a). Action values are especially important when no model is available, because policy improvement can be performed by greedifying with respect to QQ without one-step model lookahead.

Monte Carlo control alternates policy evaluation and policy improvement using sampled episodes. Exploring starts guarantee that every state-action pair has a nonzero chance of being the starting pair. More practical on-policy methods instead use soft policies, such as ϵ\epsilon-greedy policies, so that all actions continue to be sampled.

Off-policy Monte Carlo learning evaluates or improves a target policy π\pi using data generated by a behavior policy bb. Importance sampling corrects for the mismatch:

ρt:T1=k=tT1π(AkSk)b(AkSk).\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}.

Ordinary importance sampling averages ρG\rho G values. Weighted importance sampling normalizes by the sum of weights and often has lower variance, though it can be biased initially.

Key results

Monte Carlo prediction is unbiased for the value being estimated when episodes are sampled under the policy of interest. The target for a visit is the actual return GtG_t, not an estimate. This is why MC methods do not need the Markov property in the same strong bootstrapping sense for prediction from complete returns, though the MDP framework remains the control setting.

Incremental Monte Carlo updates use the same error-correction form as bandits:

V(St)V(St)+α(GtV(St)).V(S_t) \leftarrow V(S_t) + \alpha\left(G_t - V(S_t)\right).

With α=1/N(St)\alpha=1/N(S_t), this is an incremental sample average. With constant α\alpha, it tracks nonstationary returns or emphasizes recent episodes.

On-policy Monte Carlo control commonly maintains an ϵ\epsilon-soft policy. After estimating QQ, the greedy action gets most probability, but all actions keep at least ϵ/A(s)\epsilon/\vert \mathcal{A}(s)\vert probability:

π(as)={1ϵ+ϵ/A(s),aargmaxaQ(s,a),ϵ/A(s),otherwise.\pi(a \mid s)= \begin{cases} 1-\epsilon+\epsilon/|\mathcal{A}(s)|, & a \in \arg\max_{a'} Q(s,a'),\\ \epsilon/|\mathcal{A}(s)|, & \text{otherwise}. \end{cases}

Off-policy methods require coverage: if π(as)>0\pi(a \mid s)\gt 0, then b(as)>0b(a \mid s)\gt 0. Otherwise the behavior policy never generates some target-policy behavior, making evaluation impossible from that data.

Importance sampling is correct in expectation but can be unstable. A long product of probability ratios can become zero when the target would not have selected an observed action, or enormous when the behavior policy selected a target-likely trajectory with small probability. This variance problem motivates later off-policy TD methods, tree-backup methods, and gradient TD methods.

Monte Carlo learning is also the first place where the difference between episode generation and update logic becomes prominent. The same episode can be processed backward to compute all returns efficiently, then filtered according to first-visit or every-visit rules. This separation matters in code because it prevents accidental leakage: the return following a visit should include rewards after that visit, not rewards that occurred before it. Backward accumulation is usually the simplest way to avoid indexing mistakes.

Exploring starts are mathematically convenient but operationally artificial. They assume the learner can begin episodes from any state-action pair, which is rarely true in physical or user-facing systems. Sutton and Barto therefore move quickly toward soft on-policy methods. The cost is that the optimal policy within the restricted class of ϵ\epsilon-soft policies is not exactly the deterministic optimal policy. As ϵ\epsilon decreases, behavior can approach greediness, but learning still needs enough exploration to keep estimates meaningful.

Monte Carlo control also highlights why action values are often preferred in model-free control. If a model is absent, knowing V(s)V(s) alone does not tell the agent which action leads to better next states. Estimating Q(s,a)Q(s,a) makes policy improvement local: choose the action with the largest estimated return from the current state.

Visual

Monte Carlo variantData sourceTargetMain advantageMain risk
First-visit predictionEpisodes from π\piFirst return after stateSimple independent episode accountingWastes repeated visits
Every-visit predictionEpisodes from π\piEvery return after stateUses more dataCorrelated returns within an episode
On-policy controlEpisodes from current soft policyQπQ_\pi then improveNo model requiredMust keep exploring
Off-policy ordinary ISBehavior policy bbρG\rho GUnbiased in many settingsVery high variance
Off-policy weighted ISBehavior policy bbNormalized weighted returnsLower varianceInitial bias and denominator issues

Worked example 1: First-visit Monte Carlo value update

Problem: An episode visits states and receives rewards:

S0=A, R1=0, S1=B, R2=2, S2=A, R3=4, terminal.S_0=A,\ R_1=0,\ S_1=B,\ R_2=2,\ S_2=A,\ R_3=4,\ \text{terminal}.

Let γ=1\gamma=1. Compute first-visit returns for AA and BB, then update sample-average estimates from no prior data.

Step 1: Compute returns backward:

G2=R3=4.G_2 = R_3 = 4. G1=R2+G2=2+4=6.G_1 = R_2 + G_2 = 2 + 4 = 6. G0=R1+G1=0+6=6.G_0 = R_1 + G_1 = 0 + 6 = 6.

Step 2: Identify first visits. State AA appears at t=0t=0 and t=2t=2, so the first visit is t=0t=0. State BB first appears at t=1t=1.

Step 3: Assign first-visit returns:

return for A=G0=6,return for B=G1=6.\text{return for } A = G_0 = 6,\qquad \text{return for } B = G_1 = 6.

Step 4: Since there is no prior data, sample averages equal these returns:

V(A)=6,V(B)=6.V(A)=6,\qquad V(B)=6.

Check: Every-visit MC would also record G2=4G_2=4 for the second visit to AA, producing a different estimate for AA after this one episode. First-visit MC intentionally ignores that second visit.

Worked example 2: Off-policy importance sampling ratio

Problem: A target policy π\pi chooses left with probability 11 in both visited states. A behavior policy bb chooses left with probability 0.50.5 in the first state and 0.250.25 in the second. An episode takes left both times and has return G0=8G_0=8. Compute the ordinary importance-sampling target for time 00.

Step 1: Write the trajectory probability ratio:

ρ0:1=π(A0S0)b(A0S0)π(A1S1)b(A1S1).\rho_{0:1} = \frac{\pi(A_0 \mid S_0)}{b(A_0 \mid S_0)} \frac{\pi(A_1 \mid S_1)}{b(A_1 \mid S_1)}.

Step 2: Substitute probabilities:

ρ0:1=10.510.25.\rho_{0:1} = \frac{1}{0.5}\frac{1}{0.25}.

Step 3: Multiply:

ρ0:1=2×4=8.\rho_{0:1} = 2 \times 4 = 8.

Step 4: Multiply by the return:

ρ0:1G0=8×8=64.\rho_{0:1}G_0 = 8 \times 8 = 64.

The checked ordinary importance-sampling target is 6464. The large target is not a mistake; it reflects that the behavior policy generated a trajectory that the deterministic target policy always takes but the behavior policy samples with probability 0.1250.125.

Code

import numpy as np

def episode(rng):
# Random walk: states 1..5, terminal at 0 and 6.
# Reward is 1 only when exiting right.
s = 3
states, rewards = [s], []
while 0 < s < 6:
s += rng.choice([-1, 1])
rewards.append(1.0 if s == 6 else 0.0)
if 0 < s < 6:
states.append(s)
return states, rewards

rng = np.random.default_rng(0)
V = np.zeros(7)
returns = {s: [] for s in range(1, 6)}
gamma = 1.0

for _ in range(5000):
states, rewards = episode(rng)
G = 0.0
returns_from_t = []
for r in reversed(rewards):
G = r + gamma * G
returns_from_t.append(G)
returns_from_t.reverse()

seen = set()
for s, Gt in zip(states, returns_from_t):
if s not in seen:
returns[s].append(Gt)
V[s] = np.mean(returns[s])
seen.add(s)

print(np.round(V[1:6], 3))

Common pitfalls

  • Trying to update Monte Carlo values before knowing the return. Basic MC prediction waits until the episode ends.
  • Using state values for model-free control without a model. Without transition probabilities, greedy one-step improvement requires action values.
  • Forgetting exploring starts or soft policies. If some actions are never tried, their action values cannot be learned.
  • Mixing up ordinary and weighted importance sampling. Ordinary averages weighted returns; weighted normalizes by total weights.
  • Ignoring coverage. Off-policy evaluation fails if the behavior policy cannot generate target-policy actions.
  • Underestimating variance. Importance sampling ratios multiply across time, so long episodes can be numerically and statistically difficult.

Connections