Skip to main content

Bayesian Learning

Bayesian learning is both an algorithmic family and an analytical lens. Mitchell uses it to describe classifiers that explicitly compute probabilities, but also to explain why some non-Bayesian procedures can be interpreted as optimizing a probabilistic objective. This chapter links prior beliefs, observed data, likelihood, posterior probability, maximum a posteriori hypotheses, maximum likelihood hypotheses, and minimum description length.

Historically, this treatment is important because it places machine learning inside probability theory rather than only symbolic search or numerical optimization. Modern probabilistic modeling, Bayesian networks, topic models, Gaussian mixtures, and calibrated classifiers all build on the same foundations, even when implemented with newer algorithms.

Definitions

Bayes' theorem states:

P(hD)=P(Dh)P(h)P(D).P(h \mid D)=\frac{P(D \mid h)P(h)}{P(D)}.

Here hh is a hypothesis and DD is observed data.

TermMeaning
P(h)P(h)Prior probability of hypothesis before observing DD
P(Dh)P(D \mid h)Likelihood of observing DD if hh were true
P(D)P(D)Evidence or marginal likelihood
P(hD)P(h \mid D)Posterior probability after observing DD

A maximum a posteriori hypothesis is:

hMAP=argmaxhHP(hD).h_{MAP}=\arg\max_{h \in H} P(h \mid D).

Because P(D)P(D) is constant with respect to hh:

hMAP=argmaxhHP(Dh)P(h).h_{MAP}=\arg\max_{h \in H} P(D \mid h)P(h).

A maximum likelihood hypothesis is:

hML=argmaxhHP(Dh).h_{ML}=\arg\max_{h \in H} P(D \mid h).

Maximum likelihood is the special case of MAP when all hypotheses have equal prior probability.

The Bayes optimal classifier predicts the class with greatest posterior probability after averaging across all hypotheses:

v=argmaxvjVhiHP(vjhi)P(hiD).v^*=\arg\max_{v_j \in V}\sum_{h_i \in H}P(v_j \mid h_i)P(h_i \mid D).

The Gibbs algorithm samples one hypothesis from the posterior and uses it to classify. Mitchell presents it because, under suitable assumptions, it has an expected error at most twice that of the Bayes optimal classifier.

Key results

Bayesian analysis clarifies concept learning. If the learner assumes deterministic, noise-free labels and assigns equal prior probability to each hypothesis, then all consistent hypotheses have equal posterior probability and all inconsistent hypotheses have posterior probability zero. Under those assumptions, the version space is precisely the set of hypotheses with nonzero posterior probability.

MAP and ML also connect to common loss functions. Suppose examples have real-valued targets and independent Gaussian noise with constant variance:

ti=h(xi)+ϵi,ϵiN(0,σ2).t_i = h(x_i) + \epsilon_i, \qquad \epsilon_i \sim \mathcal{N}(0,\sigma^2).

Maximizing likelihood is equivalent to minimizing sum of squared errors:

i(tih(xi))2.\sum_i(t_i-h(x_i))^2.

This gives a probabilistic justification for the squared-error objectives used with linear units and many neural-network examples.

For classification with Bernoulli outputs, maximizing likelihood leads to cross-entropy-style objectives. Mitchell's treatment anticipates modern logistic classification, although the notation and applications are smaller in scale.

Minimum description length (MDL) is another face of Bayesian preference. It chooses the hypothesis that gives the shortest combined encoding of the hypothesis and the data given the hypothesis:

hMDL=argminh[L(h)+L(Dh)].h_{MDL}=\arg\min_h \left[L(h)+L(D \mid h)\right].

This connects Occam's razor to probability: simpler hypotheses are preferred when they compress the data well, not simply because they are short.

Bayesian learning also clarifies what it means to combine prior knowledge with data. The prior P(h)P(h) can encode a preference for simpler hypotheses, smoother functions, smaller weights, known causal structures, or expert rules. The likelihood P(Dh)P(D \mid h) measures how well each hypothesis predicts the observed evidence. A strong prior can dominate when data are scarce; abundant data can overwhelm a weak prior when the likelihood strongly favors another hypothesis.

The difference between MAP and Bayes optimal prediction is a difference between choosing and averaging. MAP chooses the single most probable hypothesis and predicts with it. Bayes optimal prediction averages over the posterior distribution of hypotheses. Averaging accounts for uncertainty, but it is often computationally expensive because it requires summing or integrating over many hypotheses. This is why MAP and approximations such as Gibbs sampling appear in practical discussions.

Mitchell's connection between likelihood and squared error is a recurring bridge across chapters. The LMS rule from Chapter 1 and the delta rule from Chapter 4 can be interpreted as optimizing a likelihood under Gaussian noise assumptions. This does not mean squared error is always right. It means each loss function carries an implicit noise model. Choosing a loss is partly choosing a belief about how targets are generated around the ideal function.

Bayesian classifiers also invite a distinction between decision making and probability estimation. A classifier may choose the right class even if its probabilities are poorly calibrated. Conversely, a well-calibrated model may sometimes choose the wrong class because the evidence is genuinely ambiguous. Mitchell's presentation of Bayes optimal classification focuses on minimizing expected classification error, but the posterior probabilities can also support decisions with unequal costs, such as medical screening where false negatives may be more expensive than false positives.

The assumption that the hypothesis space is known is another important simplification. In practice, model structure, features, and priors are designed by the engineer. A Bayesian calculation can be exact relative to those choices and still miss the real data-generating process. That is why the Bayesian view complements, rather than replaces, evaluation on held-out data.

This is also why Bayesian results in the book often serve two roles at once. They give algorithms for learners that explicitly manipulate probabilities, and they give explanations for why other algorithms behave sensibly under certain assumptions. Reading the chapter both ways helps connect symbolic consistency, squared-error training, probabilistic classification, and Occam-style simplicity.

Visual

PrincipleOptimization formIntuitionMitchell connection
MAPmaxhP(Dh)P(h)\max_h P(D \mid h)P(h)Fit data while respecting prior beliefBayesian concept learning
MLmaxhP(Dh)\max_h P(D \mid h)Choose the hypothesis that makes data most probableSquared-error and likelihood derivations
Bayes optimalAverage predictions over all hypothesesUse posterior uncertainty instead of one winnerTheoretical ideal classifier
GibbsSample from posteriorCheaper randomized approximationError bound relative to Bayes optimal
MDLminhL(h)+L(Dh)\min_h L(h)+L(D \mid h)Prefer concise total explanationsOccam-style bias and pruning

Worked example 1: Compute a MAP hypothesis

Problem: Two hypotheses can explain a dataset DD.

HypothesisPrior P(h)P(h)Likelihood P(Dh)P(D \mid h)
h1h_10.700.20
h2h_20.300.60

Find the MAP hypothesis and normalized posterior probabilities.

Method:

  1. Compute unnormalized posterior scores.
score(h1)=P(Dh1)P(h1)=0.20(0.70)=0.14.score(h_1)=P(D \mid h_1)P(h_1)=0.20(0.70)=0.14. score(h2)=P(Dh2)P(h2)=0.60(0.30)=0.18.score(h_2)=P(D \mid h_2)P(h_2)=0.60(0.30)=0.18.
  1. Compute evidence by summing scores over the two hypotheses.
P(D)=0.14+0.18=0.32.P(D)=0.14+0.18=0.32.
  1. Normalize.
P(h1D)=0.14/0.32=0.4375.P(h_1 \mid D)=0.14/0.32=0.4375. P(h2D)=0.18/0.32=0.5625.P(h_2 \mid D)=0.18/0.32=0.5625.
  1. Choose the largest posterior.
hMAP=h2.h_{MAP}=h_2.

Answer: h2h_2 is the MAP hypothesis with posterior probability 0.56250.5625. The data favor h2h_2 enough to overcome its smaller prior.

Worked example 2: Show why Gaussian noise gives squared error

Problem: Suppose D={(xi,ti)}i=1nD=\{(x_i,t_i)\}_{i=1}^n and ti=h(xi)+ϵit_i=h(x_i)+\epsilon_i with independent Gaussian noise ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0,\sigma^2). Show why maximizing likelihood is equivalent to minimizing squared error.

Method:

  1. Write the likelihood for one example.
P(tih,xi)=12πσ2exp((tih(xi))22σ2).P(t_i \mid h,x_i)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(t_i-h(x_i))^2}{2\sigma^2}\right).
  1. Independence makes the full likelihood a product.
P(Dh)=i=1n12πσ2exp((tih(xi))22σ2).P(D \mid h)= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(t_i-h(x_i))^2}{2\sigma^2}\right).
  1. Take log likelihood.
logP(Dh)=i=1n[log2πσ2(tih(xi))22σ2].\log P(D \mid h) = \sum_{i=1}^n \left[ -\log\sqrt{2\pi\sigma^2} -\frac{(t_i-h(x_i))^2}{2\sigma^2} \right].
  1. Separate constants from the hypothesis-dependent part.
logP(Dh)=C12σ2i(tih(xi))2.\log P(D \mid h) = C - \frac{1}{2\sigma^2} \sum_i(t_i-h(x_i))^2.
  1. Maximize the log likelihood.

    Since CC and 1/(2σ2)1/(2\sigma^2) do not depend on hh, maximizing the expression is equivalent to minimizing:

i(tih(xi))2.\sum_i(t_i-h(x_i))^2.

Answer: Under independent constant-variance Gaussian noise, maximum likelihood learning is least-squares learning.

Code

import numpy as np

priors = np.array([0.70, 0.30])
likelihoods = np.array([0.20, 0.60])

scores = priors * likelihoods
posteriors = scores / scores.sum()
map_index = int(np.argmax(posteriors))

print("posterior probabilities:", posteriors)
print("MAP hypothesis:", f"h{map_index + 1}")

Common pitfalls

  • Confusing P(hD)P(h \mid D) with P(Dh)P(D \mid h). The posterior and likelihood answer different questions.
  • Dropping the prior without noticing. Maximum likelihood is not always appropriate when prior knowledge is meaningful.
  • Treating the evidence P(D)P(D) as irrelevant in all contexts. It is irrelevant for comparing fixed hypotheses by MAP, but important for normalized probabilities and model comparison.
  • Saying "Bayesian" when only a point estimate is used. MAP is Bayesian in its objective, but it still returns one hypothesis.
  • Interpreting MDL as "shortest hypothesis wins." MDL minimizes the code length of hypothesis plus remaining data, so an overly simple hypothesis can lose if it fails to explain the data.
  • Assuming all posterior computations are tractable. Bayes optimal classification is often a theoretical standard rather than a practical algorithm.

Connections