Reinforcement Learning

These notes cover reinforcement learning following Richard S. Sutton and Andrew G. Barto's Reinforcement Learning: An Introduction, 2nd edition. The subject studies agents that learn by interacting with an environment: they observe state, choose actions, receive rewards, and improve behavior to maximize long-run return. The early material builds the finite Markov decision process framework and tabular solution methods. The middle material replaces tables with function approximation. The closing material connects RL to psychology, neuroscience, applications, and open research directions.

The organizing thread is value: how to predict return, how to improve policies from value estimates, and when direct policy optimization or planning is more appropriate. Read the pages in order if you want the Sutton-Barto progression from bandits to dynamic programming, Monte Carlo methods, temporal-difference learning, approximation, and policy gradients.

Reinforcement Learning Problem and Finite MDPs
Multi-armed Bandits
Dynamic Programming
Monte Carlo Methods
Temporal-Difference Learning
n-step Bootstrapping
Planning and Learning with Tabular Methods
On-policy Prediction with Approximation
On-policy Control with Approximation
Off-policy Methods with Approximation
Eligibility Traces
Policy Gradient Methods
Psychology Connections
Neuroscience Connections
Applications and Frontiers