Skip to main content

Learning Problems and System Design

Mitchell's first chapter frames machine learning as the engineering problem of making a program improve through experience. That sounds broad, but the value of the definition is that it forces every learning claim to name the task being improved, the evidence used for improvement, and the measurement that decides whether improvement actually occurred. This is a pre-deep-learning formulation: it is less concerned with giant model families and more concerned with the precise relation among task, representation, training signal, and search.

The chapter's checkers example is also a compact design pattern for the whole book. A learner does not directly receive the perfect strategy. It receives experience, converts experience into training examples, searches a hypothesis space, and updates a performance system. The same pattern reappears in decision trees, neural networks, Bayesian classifiers, instance-based methods, explanation-based learning, and reinforcement learning.

Definitions

A computer program learns from experience EE with respect to a class of tasks TT and performance measure PP if its performance at tasks in TT, as measured by PP, improves with experience EE.

A well-posed learning problem therefore specifies three pieces.

SymbolMeaningExample for checkers
TTTask classPlaying legal games of checkers
PPPerformance measurePercent of games won in a tournament
EETraining experienceGames played against itself

A target function is the function the learner is trying to approximate. In the checkers example, one possible target is ChooseMove, mapping board states to moves. Mitchell instead uses an evaluation function V:BRV : B \to \mathbb{R}, where BB is the set of legal board states. Good states get high values; bad states get low values.

An operational definition is a definition that can actually be computed within the resource limits of the performance system. The ideal minimax value of a board may be mathematically clear but nonoperational if it requires searching the game tree to the end of play.

A hypothesis is the learner's current approximation to the target function. Mitchell often writes the learned approximation to VV as V^\hat{V}. A hypothesis space HH is the set of all hypotheses representable by the chosen representation.

For the checkers design, a simple linear representation is:

V^(b)=w0+w1x1(b)+w2x2(b)++w6x6(b)\hat{V}(b) = w_0 + w_1x_1(b) + w_2x_2(b) + \cdots + w_6x_6(b)

where the xix_i are hand-designed board features such as number of pieces, number of kings, and number of threatened pieces.

Key results

The central reduction in the chapter is from performance improvement to function approximation. Instead of asking "How can a program learn to play checkers?", the designer asks:

  1. What target function would allow good play if it were known?
  2. What representation can approximate that target?
  3. What training examples can be extracted from experience?
  4. What search or update rule will improve the hypothesis?

The checkers learner estimates training values for intermediate states by bootstrapping from successor states:

Vtrain(b)V^(Successor(b)).V_{\text{train}}(b) \leftarrow \hat{V}(\text{Successor}(b)).

This is a historical bridge to reinforcement learning. The learner updates a state estimate using an estimate of a later state, which is the same broad idea behind temporal-difference methods and Q-learning.

The least mean squares update for a linear evaluation function adjusts each weight in proportion to the prediction error and the feature value:

wiwi+η(Vtrain(b)V^(b))xi(b).w_i \leftarrow w_i + \eta\left(V_{\text{train}}(b) - \hat{V}(b)\right)x_i(b).

Here η\eta is the learning rate. If the prediction is too low, weights attached to active positive-valued features increase. If it is too high, they decrease. This rule is a stochastic gradient step on squared error:

E=bD(Vtrain(b)V^(b))2.E = \sum_{b \in D}\left(V_{\text{train}}(b) - \hat{V}(b)\right)^2.

Mitchell also separates a learning system into four modules:

ModuleInputOutputRole
Performance systemNew problem and current hypothesisBehavior traceUses what has been learned
CriticBehavior traceTraining examplesInterprets experience as feedback
GeneralizerTraining examplesRevised hypothesisSearches for a better general rule
Experiment generatorCurrent hypothesisNew practice problemChooses what experience to gather next

This decomposition matters because learning failures can occur in any module. A poor critic can assign misleading labels. A narrow representation can make the right function impossible to express. A weak experiment generator can avoid important regions of the state space. A powerful generalizer can still overfit bad training examples.

A useful reading of Mitchell's design is that every learning system contains both an external problem and an internal surrogate problem. The external problem is to win games, recognize handwriting, drive a vehicle, or classify patients. The internal surrogate may be to fit a value function, infer a decision tree, estimate posterior probabilities, or tune network weights. Good machine-learning design makes the surrogate faithful enough that improving it improves the external performance measure. Bad design creates a surrogate that is easy to optimize but detached from the real task.

The training-experience choice is especially subtle. Direct instruction, labeled examples, passive observation, self-play, and active experimentation give different information. A learner that receives legal moves and final game outcomes faces a harder credit-assignment problem than a learner that receives the best move in every position. Similarly, a medical classifier trained only on historical treatment decisions may inherit the biases of past decisions. Mitchell's framework does not solve these problems, but it gives a vocabulary for locating them: the problem is not "machine learning failed" in general; it may be that EE is weakly related to PP or that the distribution generating EE differs from the distribution used to evaluate TT.

The representation choice also controls what can be learned. A linear checkers evaluator over six hand-designed features cannot express every useful board pattern. It may miss tactics involving piece placement, tempo, or forced captures. The benefit is that only a few weights must be estimated, so small amounts of experience can improve the program. This tradeoff between expressiveness and sample requirements recurs throughout the book: decision trees can represent disjunctions but overfit, neural networks can represent nonlinear functions but require careful optimization, and Bayesian networks can encode structured uncertainty but demand modeling assumptions.

Visual

The diagram is deliberately circular. Learning is not a single pass from data to model; the current hypothesis influences future experience, especially in active learning, game learning, robotics, and reinforcement learning.

Worked example 1: Specify a well-posed learning problem

Problem: A hospital wants a program that predicts whether a pneumonia patient is at high risk of complications. Specify TT, PP, and EE in Mitchell's format.

Method:

  1. Identify the task.

    The program receives a patient record and must assign a risk label such as high or not high.

T=classifying pneumonia patient records by complication risk.T = \text{classifying pneumonia patient records by complication risk}.
  1. Identify the performance measure.

    A reasonable measure is accuracy on future patients, but in medicine false negatives and false positives may have different costs. Suppose the hospital wants high-risk cases caught reliably. A better performance measure might be area under the ROC curve or sensitivity at a fixed specificity.

P=sensitivity at 90 percent specificity on held-out future cases.P = \text{sensitivity at 90 percent specificity on held-out future cases}.
  1. Identify the experience.

    The learner needs historical records containing features and outcomes.

E=past patient records with observed complications.E = \text{past patient records with observed complications}.
  1. Check whether it is well posed.

    The three parts are concrete enough to test improvement. If a new model has higher sensitivity at the same specificity on a held-out sample, it improved under PP.

Answer: The well-posed problem is: learn from historical labeled pneumonia records EE to classify new records TT, with performance PP measured by sensitivity at 90 percent specificity. The checked answer is valid because it names the future task, the source of training information, and the exact measurement rule.

Worked example 2: One LMS update for a board evaluator

Problem: A checkers learner represents a board as two features, x1=x_1 = number of black pieces and x2=x_2 = number of red pieces. Its hypothesis is

V^(b)=w0+w1x1+w2x2.\hat{V}(b)=w_0+w_1x_1+w_2x_2.

Suppose w0=0w_0=0, w1=4w_1=4, w2=3w_2=-3, the current board has x1=5x_1=5, x2=4x_2=4, the training value is Vtrain(b)=20V_{\text{train}}(b)=20, and η=0.1\eta=0.1. Compute the update.

Method:

  1. Compute the current prediction.
V^(b)=0+4(5)+(3)(4)=2012=8.\hat{V}(b)=0+4(5)+(-3)(4)=20-12=8.
  1. Compute the prediction error.
Vtrain(b)V^(b)=208=12.V_{\text{train}}(b)-\hat{V}(b)=20-8=12.
  1. Update the bias weight. Treat x0=1x_0=1.
w00+0.1(12)(1)=1.2.w_0 \leftarrow 0+0.1(12)(1)=1.2.
  1. Update w1w_1.
w14+0.1(12)(5)=4+6=10.w_1 \leftarrow 4+0.1(12)(5)=4+6=10.
  1. Update w2w_2.
w23+0.1(12)(4)=3+4.8=1.8.w_2 \leftarrow -3+0.1(12)(4)=-3+4.8=1.8.
  1. Check the new prediction on the same board.
V^new(b)=1.2+10(5)+1.8(4)=1.2+50+7.2=58.4.\hat{V}_{\text{new}}(b)=1.2+10(5)+1.8(4)=1.2+50+7.2=58.4.

Answer: The updated weights are (1.2,10,1.8)(1.2,10,1.8). The check reveals that with η=0.1\eta=0.1 and large feature values, the update overshoots the target value of 20 on this single example. That does not make the rule wrong, but it shows why feature scaling and learning-rate choice matter.

Code

import numpy as np

def lms_update(weights, features, target, eta=0.01):
"""One LMS update for a linear value function."""
x = np.r_[1.0, np.asarray(features, dtype=float)]
prediction = float(weights @ x)
error = target - prediction
return weights + eta * error * x, prediction, error

weights = np.array([0.0, 4.0, -3.0])
features = np.array([5.0, 4.0])
new_weights, pred, err = lms_update(weights, features, target=20.0, eta=0.1)

print("prediction:", pred)
print("error:", err)
print("new weights:", new_weights)

Common pitfalls

  • Calling a system "learning" without specifying TT, PP, and EE. Without all three, there is no precise claim to evaluate.
  • Confusing the target function with the learned hypothesis. VV is the ideal function; V^\hat{V} is what the learner currently represents.
  • Assuming the training distribution matches the future test distribution. Mitchell notes that much theory assumes this, while practice often violates it.
  • Choosing a representation before deciding what information the performance system needs. A representation is useful only relative to the target function and the final task.
  • Treating self-play as automatically representative. Self-play can generate many examples, but it may miss strategies used by external opponents.
  • Forgetting that a nonoperational definition can still guide learning. The exact minimax value may be too expensive to compute, but it can motivate a learnable approximation.

Connections