Skip to main content

Reinforcement Learning Problem and Finite MDPs

Reinforcement learning studies how an agent can learn to choose actions through interaction. The agent is not handed labeled correct actions. Instead, it receives scalar rewards after acting, observes the next situation, and must improve behavior from experience. Sutton and Barto use this setup to separate reinforcement learning from supervised learning and from pure planning: the learner is active, its actions influence later data, and the main objective is long-run return rather than immediate prediction accuracy.

The mathematical home for this problem is the finite Markov decision process, or finite MDP. An MDP turns informal ideas such as "state," "action," "reward," and "future consequence" into a compact stochastic model. Once the MDP is defined, later chapters can ask more focused questions: how to estimate values, how to improve a policy, when bootstrapping helps, and what changes when the state space is too large for a table.

Definitions

A reinforcement learning problem is organized around an agent-environment interface. At time tt, the agent observes a state StSS_t \in \mathcal{S}, selects an action AtA(St)A_t \in \mathcal{A}(S_t), and receives a reward Rt+1RR_{t+1} \in \mathbb{R} and next state St+1S_{t+1}. The interaction continues either forever in a continuing task or until a terminal state in an episodic task.

A finite MDP is specified by finite state and action sets and one-step dynamics:

p(s,rs,a)=Pr(St+1=s,Rt+1=rSt=s,At=a).p(s', r \mid s, a) = \Pr(S_{t+1}=s', R_{t+1}=r \mid S_t=s, A_t=a).

The Markov property says this conditional distribution depends on the present state and action, not on the complete earlier history. This is not a claim that the world has no hidden causes. It is a modeling requirement: the chosen state representation should contain enough information for predicting the next state and reward distribution relevant to control.

A policy π\pi is a decision rule. In the stochastic case,

π(as)=Pr(At=aSt=s).\pi(a \mid s) = \Pr(A_t=a \mid S_t=s).

The return is the discounted sum of future rewards:

Gt=k=0γkRt+k+1,G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1},

where 0γ10 \le \gamma \le 1 is the discount rate. In episodic tasks, the sum stops at termination. In continuing tasks, γ<1\gamma \lt 1 keeps the infinite sum finite in many common settings. Sutton and Barto also introduce average-reward formulations later for continuing control.

The state-value function under policy π\pi is

Vπ(s)=Eπ[GtSt=s],V_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t=s],

and the action-value function is

Qπ(s,a)=Eπ[GtSt=s,At=a].Q_\pi(s,a) = \mathbb{E}_\pi[G_t \mid S_t=s, A_t=a].

Optimal values compare all policies:

V(s)=maxπVπ(s),Q(s,a)=maxπQπ(s,a).V_*(s) = \max_\pi V_\pi(s), \qquad Q_*(s,a) = \max_\pi Q_\pi(s,a).

A reward signal defines the goal. A value function predicts long-run reward. A model predicts consequences, usually through samples (s,r)(s',r) or probabilities p(s,rs,a)p(s',r \mid s,a). Model-free methods learn directly from experience, while model-based methods also use a learned or supplied model for planning.

Key results

The central recursive fact is that returns decompose into immediate reward plus discounted future return:

Gt=Rt+1+γGt+1.G_t = R_{t+1} + \gamma G_{t+1}.

Taking conditional expectations gives the Bellman expectation equation for a fixed policy:

Vπ(s)=aπ(as)s,rp(s,rs,a)[r+γVπ(s)].V_\pi(s) = \sum_a \pi(a \mid s) \sum_{s',r} p(s',r \mid s,a) \left[r + \gamma V_\pi(s')\right].

For action values,

Qπ(s,a)=s,rp(s,rs,a)[r+γaπ(as)Qπ(s,a)].Q_\pi(s,a) = \sum_{s',r} p(s',r \mid s,a) \left[ r + \gamma \sum_{a'} \pi(a' \mid s') Q_\pi(s',a') \right].

The optimality equations replace policy averaging with maximization:

V(s)=maxas,rp(s,rs,a)[r+γV(s)],V_*(s) = \max_a \sum_{s',r} p(s',r \mid s,a) \left[r + \gamma V_*(s')\right],

and

Q(s,a)=s,rp(s,rs,a)[r+γmaxaQ(s,a)].Q_*(s,a) = \sum_{s',r} p(s',r \mid s,a) \left[r + \gamma \max_{a'} Q_*(s',a')\right].

These equations justify the later algorithmic pattern of bootstrapping: update a current estimate toward a target that uses another current estimate. They also explain why value is not the same as immediate reward. An action with small immediate reward can be best if it tends to lead to valuable future states.

For finite discounted MDPs, at least one deterministic optimal policy exists. If QQ_* is known, an optimal policy can choose any action in

argmaxaQ(s,a).\arg\max_a Q_*(s,a).

This result is powerful because it reduces the control problem to estimating values accurately enough. Many RL algorithms differ mainly in what target they use, how they sample experience, and whether they evaluate the policy being followed or a different target policy.

Visual

ConceptSymbolRole in the MDPCommon confusion
StatessInformation used for prediction and controlTreating raw observation as automatically Markov
ActionaaAgent's controllable choiceIgnoring state-dependent action sets
RewardrrImmediate scalar objective signalDesigning reward as a hint instead of the real goal
ReturnGtG_tTotal discounted future rewardForgetting the first reward is Rt+1R_{t+1}
Policyπ(as)\pi(a \mid s)Behavior ruleMixing behavior policy and target policy
ValueVπV_\pi, QπQ_\piExpected return predictionEquating high immediate reward with high value
Modelp(s,rs,a)p(s',r \mid s,a)Predicts next consequencesAssuming a model is required for all RL

Worked example 1: Computing returns in an episode

Problem: An episode has rewards R1=2R_1=2, R2=1R_2=-1, R3=4R_3=4, then terminates. Let γ=0.5\gamma=0.5. Compute G0G_0, G1G_1, and G2G_2.

Step 1: Start from the last nonterminal time. At t=2t=2, only R3R_3 remains:

G2=R3=4.G_2 = R_3 = 4.

Step 2: At t=1t=1, use G1=R2+γG2G_1 = R_2 + \gamma G_2:

G1=1+0.5(4)=1+2=1.\begin{aligned} G_1 &= -1 + 0.5(4) \\ &= -1 + 2 \\ &= 1. \end{aligned}

Step 3: At t=0t=0, use G0=R1+γG1G_0 = R_1 + \gamma G_1:

G0=2+0.5(1)=2.5.\begin{aligned} G_0 &= 2 + 0.5(1) \\ &= 2.5. \end{aligned}

Check: Expanding the full sum directly gives

G0=2+0.5(1)+0.52(4)=20.5+1=2.5.G_0 = 2 + 0.5(-1) + 0.5^2(4) = 2 - 0.5 + 1 = 2.5.

The checked answer is G0=2.5G_0=2.5, G1=1G_1=1, and G2=4G_2=4.

Worked example 2: Solving a two-state Bellman system

Problem: A policy π\pi always takes the only available action in two nonterminal states AA and BB. From AA, the environment gives reward 11 and moves to BB. From BB, it gives reward 22 and moves to AA. Let γ=0.9\gamma=0.9. Solve for Vπ(A)V_\pi(A) and Vπ(B)V_\pi(B).

Step 1: Write Bellman equations from the transitions:

Vπ(A)=1+0.9Vπ(B),Vπ(B)=2+0.9Vπ(A).\begin{aligned} V_\pi(A) &= 1 + 0.9 V_\pi(B), \\ V_\pi(B) &= 2 + 0.9 V_\pi(A). \end{aligned}

Step 2: Substitute the second equation into the first:

Vπ(A)=1+0.9(2+0.9Vπ(A))=1+1.8+0.81Vπ(A)=2.8+0.81Vπ(A).\begin{aligned} V_\pi(A) &= 1 + 0.9(2 + 0.9V_\pi(A)) \\ &= 1 + 1.8 + 0.81V_\pi(A) \\ &= 2.8 + 0.81V_\pi(A). \end{aligned}

Step 3: Isolate Vπ(A)V_\pi(A):

Vπ(A)0.81Vπ(A)=2.80.19Vπ(A)=2.8Vπ(A)=2.80.1914.7368.\begin{aligned} V_\pi(A) - 0.81V_\pi(A) &= 2.8 \\ 0.19V_\pi(A) &= 2.8 \\ V_\pi(A) &= \frac{2.8}{0.19} \approx 14.7368. \end{aligned}

Step 4: Plug back into the second equation:

Vπ(B)=2+0.9(14.7368)=2+13.263115.2631.\begin{aligned} V_\pi(B) &= 2 + 0.9(14.7368) \\ &= 2 + 13.2631 \\ &\approx 15.2631. \end{aligned}

Check: Put these values into the first equation:

1+0.9(15.2631)=1+13.7368=14.7368.1 + 0.9(15.2631) = 1 + 13.7368 = 14.7368.

The checked answer is Vπ(A)14.7368V_\pi(A)\approx 14.7368 and Vπ(B)15.2631V_\pi(B)\approx 15.2631.

Code

import numpy as np

# Two-state Markov reward process under a fixed policy.
# States: 0 = A, 1 = B
gamma = 0.9
P = np.array([
[0.0, 1.0], # A -> B
[1.0, 0.0], # B -> A
])
r = np.array([1.0, 2.0])

# Bellman equation: v = r + gamma P v
# Rearranged: (I - gamma P) v = r
I = np.eye(2)
v = np.linalg.solve(I - gamma * P, r)

print({"V(A)": v[0], "V(B)": v[1]})
print("Bellman residual:", v - (r + gamma * P @ v))

Common pitfalls

  • Treating the reward as a training label for the last action only. In RL, the reward is part of a sequential objective, and the consequences of an action can arrive much later.
  • Forgetting the time indexing: after choosing AtA_t in state StS_t, the reward is Rt+1R_{t+1} and the next state is St+1S_{t+1}.
  • Assuming every observation is a Markov state. A camera frame, sensor reading, or database row may omit information needed to predict future rewards.
  • Confusing VπV_\pi with VV_*. The first evaluates a particular policy; the second is the best value achievable by any policy.
  • Using γ\gamma as a vague preference for "short-term thinking." Discounting also controls mathematical convergence in continuing tasks.
  • Designing rewards that encode advice rather than the real goal. If reward gives points for intermediate behavior that can be exploited, the agent may optimize the proxy.

Connections