Monte Carlo Methods
Monte Carlo methods learn value functions and policies from complete sampled episodes. They do not require a model of the environment, and they do not bootstrap from current value estimates. Instead, they wait until an episode has ended, compute actual returns, and average those returns for states or state-action pairs. This makes Monte Carlo learning conceptually simple and statistically direct.
The price is that Monte Carlo methods apply most naturally to episodic tasks. They can have high variance because a return contains all later random rewards in an episode. Sutton and Barto use Monte Carlo methods as the first major model-free solution method, then contrast them with temporal-difference learning, which learns online from incomplete experience by bootstrapping.
Definitions
First-visit Monte Carlo prediction estimates by averaging returns following the first visit to in each episode. Every-visit Monte Carlo prediction averages returns following every visit to . Both converge to under standard sampling assumptions in episodic tasks.
For action values, Monte Carlo estimates
by averaging returns after visits to state-action pair . Action values are especially important when no model is available, because policy improvement can be performed by greedifying with respect to without one-step model lookahead.
Monte Carlo control alternates policy evaluation and policy improvement using sampled episodes. Exploring starts guarantee that every state-action pair has a nonzero chance of being the starting pair. More practical on-policy methods instead use soft policies, such as -greedy policies, so that all actions continue to be sampled.
Off-policy Monte Carlo learning evaluates or improves a target policy using data generated by a behavior policy . Importance sampling corrects for the mismatch:
Ordinary importance sampling averages values. Weighted importance sampling normalizes by the sum of weights and often has lower variance, though it can be biased initially.
Key results
Monte Carlo prediction is unbiased for the value being estimated when episodes are sampled under the policy of interest. The target for a visit is the actual return , not an estimate. This is why MC methods do not need the Markov property in the same strong bootstrapping sense for prediction from complete returns, though the MDP framework remains the control setting.
Incremental Monte Carlo updates use the same error-correction form as bandits:
With , this is an incremental sample average. With constant , it tracks nonstationary returns or emphasizes recent episodes.
On-policy Monte Carlo control commonly maintains an -soft policy. After estimating , the greedy action gets most probability, but all actions keep at least probability:
Off-policy methods require coverage: if , then . Otherwise the behavior policy never generates some target-policy behavior, making evaluation impossible from that data.
Importance sampling is correct in expectation but can be unstable. A long product of probability ratios can become zero when the target would not have selected an observed action, or enormous when the behavior policy selected a target-likely trajectory with small probability. This variance problem motivates later off-policy TD methods, tree-backup methods, and gradient TD methods.
Monte Carlo learning is also the first place where the difference between episode generation and update logic becomes prominent. The same episode can be processed backward to compute all returns efficiently, then filtered according to first-visit or every-visit rules. This separation matters in code because it prevents accidental leakage: the return following a visit should include rewards after that visit, not rewards that occurred before it. Backward accumulation is usually the simplest way to avoid indexing mistakes.
Exploring starts are mathematically convenient but operationally artificial. They assume the learner can begin episodes from any state-action pair, which is rarely true in physical or user-facing systems. Sutton and Barto therefore move quickly toward soft on-policy methods. The cost is that the optimal policy within the restricted class of -soft policies is not exactly the deterministic optimal policy. As decreases, behavior can approach greediness, but learning still needs enough exploration to keep estimates meaningful.
Monte Carlo control also highlights why action values are often preferred in model-free control. If a model is absent, knowing alone does not tell the agent which action leads to better next states. Estimating makes policy improvement local: choose the action with the largest estimated return from the current state.
Visual
| Monte Carlo variant | Data source | Target | Main advantage | Main risk |
|---|---|---|---|---|
| First-visit prediction | Episodes from | First return after state | Simple independent episode accounting | Wastes repeated visits |
| Every-visit prediction | Episodes from | Every return after state | Uses more data | Correlated returns within an episode |
| On-policy control | Episodes from current soft policy | then improve | No model required | Must keep exploring |
| Off-policy ordinary IS | Behavior policy | Unbiased in many settings | Very high variance | |
| Off-policy weighted IS | Behavior policy | Normalized weighted returns | Lower variance | Initial bias and denominator issues |
Worked example 1: First-visit Monte Carlo value update
Problem: An episode visits states and receives rewards:
Let . Compute first-visit returns for and , then update sample-average estimates from no prior data.
Step 1: Compute returns backward:
Step 2: Identify first visits. State appears at and , so the first visit is . State first appears at .
Step 3: Assign first-visit returns:
Step 4: Since there is no prior data, sample averages equal these returns:
Check: Every-visit MC would also record for the second visit to , producing a different estimate for after this one episode. First-visit MC intentionally ignores that second visit.
Worked example 2: Off-policy importance sampling ratio
Problem: A target policy chooses left with probability in both visited states. A behavior policy chooses left with probability in the first state and in the second. An episode takes left both times and has return . Compute the ordinary importance-sampling target for time .
Step 1: Write the trajectory probability ratio:
Step 2: Substitute probabilities:
Step 3: Multiply:
Step 4: Multiply by the return:
The checked ordinary importance-sampling target is . The large target is not a mistake; it reflects that the behavior policy generated a trajectory that the deterministic target policy always takes but the behavior policy samples with probability .
Code
import numpy as np
def episode(rng):
# Random walk: states 1..5, terminal at 0 and 6.
# Reward is 1 only when exiting right.
s = 3
states, rewards = [s], []
while 0 < s < 6:
s += rng.choice([-1, 1])
rewards.append(1.0 if s == 6 else 0.0)
if 0 < s < 6:
states.append(s)
return states, rewards
rng = np.random.default_rng(0)
V = np.zeros(7)
returns = {s: [] for s in range(1, 6)}
gamma = 1.0
for _ in range(5000):
states, rewards = episode(rng)
G = 0.0
returns_from_t = []
for r in reversed(rewards):
G = r + gamma * G
returns_from_t.append(G)
returns_from_t.reverse()
seen = set()
for s, Gt in zip(states, returns_from_t):
if s not in seen:
returns[s].append(Gt)
V[s] = np.mean(returns[s])
seen.add(s)
print(np.round(V[1:6], 3))
Common pitfalls
- Trying to update Monte Carlo values before knowing the return. Basic MC prediction waits until the episode ends.
- Using state values for model-free control without a model. Without transition probabilities, greedy one-step improvement requires action values.
- Forgetting exploring starts or soft policies. If some actions are never tried, their action values cannot be learned.
- Mixing up ordinary and weighted importance sampling. Ordinary averages weighted returns; weighted normalizes by total weights.
- Ignoring coverage. Off-policy evaluation fails if the behavior policy cannot generate target-policy actions.
- Underestimating variance. Importance sampling ratios multiply across time, so long episodes can be numerically and statistically difficult.