Learning Problems and System Design
Mitchell's first chapter frames machine learning as the engineering problem of making a program improve through experience. That sounds broad, but the value of the definition is that it forces every learning claim to name the task being improved, the evidence used for improvement, and the measurement that decides whether improvement actually occurred. This is a pre-deep-learning formulation: it is less concerned with giant model families and more concerned with the precise relation among task, representation, training signal, and search.
The chapter's checkers example is also a compact design pattern for the whole book. A learner does not directly receive the perfect strategy. It receives experience, converts experience into training examples, searches a hypothesis space, and updates a performance system. The same pattern reappears in decision trees, neural networks, Bayesian classifiers, instance-based methods, explanation-based learning, and reinforcement learning.
Definitions
A computer program learns from experience with respect to a class of tasks and performance measure if its performance at tasks in , as measured by , improves with experience .
A well-posed learning problem therefore specifies three pieces.
| Symbol | Meaning | Example for checkers |
|---|---|---|
| Task class | Playing legal games of checkers | |
| Performance measure | Percent of games won in a tournament | |
| Training experience | Games played against itself |
A target function is the function the learner is trying to approximate. In the checkers example, one possible target is ChooseMove, mapping board states to moves. Mitchell instead uses an evaluation function , where is the set of legal board states. Good states get high values; bad states get low values.
An operational definition is a definition that can actually be computed within the resource limits of the performance system. The ideal minimax value of a board may be mathematically clear but nonoperational if it requires searching the game tree to the end of play.
A hypothesis is the learner's current approximation to the target function. Mitchell often writes the learned approximation to as . A hypothesis space is the set of all hypotheses representable by the chosen representation.
For the checkers design, a simple linear representation is:
where the are hand-designed board features such as number of pieces, number of kings, and number of threatened pieces.
Key results
The central reduction in the chapter is from performance improvement to function approximation. Instead of asking "How can a program learn to play checkers?", the designer asks:
- What target function would allow good play if it were known?
- What representation can approximate that target?
- What training examples can be extracted from experience?
- What search or update rule will improve the hypothesis?
The checkers learner estimates training values for intermediate states by bootstrapping from successor states:
This is a historical bridge to reinforcement learning. The learner updates a state estimate using an estimate of a later state, which is the same broad idea behind temporal-difference methods and Q-learning.
The least mean squares update for a linear evaluation function adjusts each weight in proportion to the prediction error and the feature value:
Here is the learning rate. If the prediction is too low, weights attached to active positive-valued features increase. If it is too high, they decrease. This rule is a stochastic gradient step on squared error:
Mitchell also separates a learning system into four modules:
| Module | Input | Output | Role |
|---|---|---|---|
| Performance system | New problem and current hypothesis | Behavior trace | Uses what has been learned |
| Critic | Behavior trace | Training examples | Interprets experience as feedback |
| Generalizer | Training examples | Revised hypothesis | Searches for a better general rule |
| Experiment generator | Current hypothesis | New practice problem | Chooses what experience to gather next |
This decomposition matters because learning failures can occur in any module. A poor critic can assign misleading labels. A narrow representation can make the right function impossible to express. A weak experiment generator can avoid important regions of the state space. A powerful generalizer can still overfit bad training examples.
A useful reading of Mitchell's design is that every learning system contains both an external problem and an internal surrogate problem. The external problem is to win games, recognize handwriting, drive a vehicle, or classify patients. The internal surrogate may be to fit a value function, infer a decision tree, estimate posterior probabilities, or tune network weights. Good machine-learning design makes the surrogate faithful enough that improving it improves the external performance measure. Bad design creates a surrogate that is easy to optimize but detached from the real task.
The training-experience choice is especially subtle. Direct instruction, labeled examples, passive observation, self-play, and active experimentation give different information. A learner that receives legal moves and final game outcomes faces a harder credit-assignment problem than a learner that receives the best move in every position. Similarly, a medical classifier trained only on historical treatment decisions may inherit the biases of past decisions. Mitchell's framework does not solve these problems, but it gives a vocabulary for locating them: the problem is not "machine learning failed" in general; it may be that is weakly related to or that the distribution generating differs from the distribution used to evaluate .
The representation choice also controls what can be learned. A linear checkers evaluator over six hand-designed features cannot express every useful board pattern. It may miss tactics involving piece placement, tempo, or forced captures. The benefit is that only a few weights must be estimated, so small amounts of experience can improve the program. This tradeoff between expressiveness and sample requirements recurs throughout the book: decision trees can represent disjunctions but overfit, neural networks can represent nonlinear functions but require careful optimization, and Bayesian networks can encode structured uncertainty but demand modeling assumptions.
Visual
The diagram is deliberately circular. Learning is not a single pass from data to model; the current hypothesis influences future experience, especially in active learning, game learning, robotics, and reinforcement learning.
Worked example 1: Specify a well-posed learning problem
Problem: A hospital wants a program that predicts whether a pneumonia patient is at high risk of complications. Specify , , and in Mitchell's format.
Method:
-
Identify the task.
The program receives a patient record and must assign a risk label such as
highornot high.
-
Identify the performance measure.
A reasonable measure is accuracy on future patients, but in medicine false negatives and false positives may have different costs. Suppose the hospital wants high-risk cases caught reliably. A better performance measure might be area under the ROC curve or sensitivity at a fixed specificity.
-
Identify the experience.
The learner needs historical records containing features and outcomes.
-
Check whether it is well posed.
The three parts are concrete enough to test improvement. If a new model has higher sensitivity at the same specificity on a held-out sample, it improved under .
Answer: The well-posed problem is: learn from historical labeled pneumonia records to classify new records , with performance measured by sensitivity at 90 percent specificity. The checked answer is valid because it names the future task, the source of training information, and the exact measurement rule.
Worked example 2: One LMS update for a board evaluator
Problem: A checkers learner represents a board as two features, number of black pieces and number of red pieces. Its hypothesis is
Suppose , , , the current board has , , the training value is , and . Compute the update.
Method:
- Compute the current prediction.
- Compute the prediction error.
- Update the bias weight. Treat .
- Update .
- Update .
- Check the new prediction on the same board.
Answer: The updated weights are . The check reveals that with and large feature values, the update overshoots the target value of 20 on this single example. That does not make the rule wrong, but it shows why feature scaling and learning-rate choice matter.
Code
import numpy as np
def lms_update(weights, features, target, eta=0.01):
"""One LMS update for a linear value function."""
x = np.r_[1.0, np.asarray(features, dtype=float)]
prediction = float(weights @ x)
error = target - prediction
return weights + eta * error * x, prediction, error
weights = np.array([0.0, 4.0, -3.0])
features = np.array([5.0, 4.0])
new_weights, pred, err = lms_update(weights, features, target=20.0, eta=0.1)
print("prediction:", pred)
print("error:", err)
print("new weights:", new_weights)
Common pitfalls
- Calling a system "learning" without specifying , , and . Without all three, there is no precise claim to evaluate.
- Confusing the target function with the learned hypothesis. is the ideal function; is what the learner currently represents.
- Assuming the training distribution matches the future test distribution. Mitchell notes that much theory assumes this, while practice often violates it.
- Choosing a representation before deciding what information the performance system needs. A representation is useful only relative to the target function and the final task.
- Treating self-play as automatically representative. Self-play can generate many examples, but it may miss strategies used by external opponents.
- Forgetting that a nonoperational definition can still guide learning. The exact minimax value may be too expensive to compute, but it can motivate a learnable approximation.