Reinforcement Learning Problem and Finite MDPs
Reinforcement learning studies how an agent can learn to choose actions through interaction. The agent is not handed labeled correct actions. Instead, it receives scalar rewards after acting, observes the next situation, and must improve behavior from experience. Sutton and Barto use this setup to separate reinforcement learning from supervised learning and from pure planning: the learner is active, its actions influence later data, and the main objective is long-run return rather than immediate prediction accuracy.
The mathematical home for this problem is the finite Markov decision process, or finite MDP. An MDP turns informal ideas such as "state," "action," "reward," and "future consequence" into a compact stochastic model. Once the MDP is defined, later chapters can ask more focused questions: how to estimate values, how to improve a policy, when bootstrapping helps, and what changes when the state space is too large for a table.
Definitions
A reinforcement learning problem is organized around an agent-environment interface. At time , the agent observes a state , selects an action , and receives a reward and next state . The interaction continues either forever in a continuing task or until a terminal state in an episodic task.
A finite MDP is specified by finite state and action sets and one-step dynamics:
The Markov property says this conditional distribution depends on the present state and action, not on the complete earlier history. This is not a claim that the world has no hidden causes. It is a modeling requirement: the chosen state representation should contain enough information for predicting the next state and reward distribution relevant to control.
A policy is a decision rule. In the stochastic case,
The return is the discounted sum of future rewards:
where is the discount rate. In episodic tasks, the sum stops at termination. In continuing tasks, keeps the infinite sum finite in many common settings. Sutton and Barto also introduce average-reward formulations later for continuing control.
The state-value function under policy is
and the action-value function is
Optimal values compare all policies:
A reward signal defines the goal. A value function predicts long-run reward. A model predicts consequences, usually through samples or probabilities . Model-free methods learn directly from experience, while model-based methods also use a learned or supplied model for planning.
Key results
The central recursive fact is that returns decompose into immediate reward plus discounted future return:
Taking conditional expectations gives the Bellman expectation equation for a fixed policy:
For action values,
The optimality equations replace policy averaging with maximization:
and
These equations justify the later algorithmic pattern of bootstrapping: update a current estimate toward a target that uses another current estimate. They also explain why value is not the same as immediate reward. An action with small immediate reward can be best if it tends to lead to valuable future states.
For finite discounted MDPs, at least one deterministic optimal policy exists. If is known, an optimal policy can choose any action in
This result is powerful because it reduces the control problem to estimating values accurately enough. Many RL algorithms differ mainly in what target they use, how they sample experience, and whether they evaluate the policy being followed or a different target policy.
Visual
| Concept | Symbol | Role in the MDP | Common confusion |
|---|---|---|---|
| State | Information used for prediction and control | Treating raw observation as automatically Markov | |
| Action | Agent's controllable choice | Ignoring state-dependent action sets | |
| Reward | Immediate scalar objective signal | Designing reward as a hint instead of the real goal | |
| Return | Total discounted future reward | Forgetting the first reward is | |
| Policy | Behavior rule | Mixing behavior policy and target policy | |
| Value | , | Expected return prediction | Equating high immediate reward with high value |
| Model | Predicts next consequences | Assuming a model is required for all RL |
Worked example 1: Computing returns in an episode
Problem: An episode has rewards , , , then terminates. Let . Compute , , and .
Step 1: Start from the last nonterminal time. At , only remains:
Step 2: At , use :
Step 3: At , use :
Check: Expanding the full sum directly gives
The checked answer is , , and .
Worked example 2: Solving a two-state Bellman system
Problem: A policy always takes the only available action in two nonterminal states and . From , the environment gives reward and moves to . From , it gives reward and moves to . Let . Solve for and .
Step 1: Write Bellman equations from the transitions:
Step 2: Substitute the second equation into the first:
Step 3: Isolate :
Step 4: Plug back into the second equation:
Check: Put these values into the first equation:
The checked answer is and .
Code
import numpy as np
# Two-state Markov reward process under a fixed policy.
# States: 0 = A, 1 = B
gamma = 0.9
P = np.array([
[0.0, 1.0], # A -> B
[1.0, 0.0], # B -> A
])
r = np.array([1.0, 2.0])
# Bellman equation: v = r + gamma P v
# Rearranged: (I - gamma P) v = r
I = np.eye(2)
v = np.linalg.solve(I - gamma * P, r)
print({"V(A)": v[0], "V(B)": v[1]})
print("Bellman residual:", v - (r + gamma * P @ v))
Common pitfalls
- Treating the reward as a training label for the last action only. In RL, the reward is part of a sequential objective, and the consequences of an action can arrive much later.
- Forgetting the time indexing: after choosing in state , the reward is and the next state is .
- Assuming every observation is a Markov state. A camera frame, sensor reading, or database row may omit information needed to predict future rewards.
- Confusing with . The first evaluates a particular policy; the second is the best value achievable by any policy.
- Using as a vague preference for "short-term thinking." Discounting also controls mathematical convergence in continuing tasks.
- Designing rewards that encode advice rather than the real goal. If reward gives points for intermediate behavior that can be exploited, the agent may optimize the proxy.