Skip to main content

Evaluating Hypotheses

Mitchell's evaluation chapter gives the statistical machinery needed to say whether a learned hypothesis is actually good. Earlier chapters build hypotheses; this chapter asks how accurately their future performance can be estimated from finite samples. The key warning is simple: training accuracy is not enough, and even test accuracy is an estimate with uncertainty.

The chapter is intentionally practical. It introduces sample error, true error, confidence intervals, tests for comparing hypotheses, and paired tests for comparing learning algorithms. These tools connect directly to modern model evaluation, even though modern practice often uses larger datasets, cross-validation, bootstrap procedures, and benchmark suites.

Definitions

Let hh be a hypothesis and DD be a sample of examples drawn independently from some distribution. For a discrete-valued target function, the sample error is:

errorD(h)=1nxDI(h(x)c(x)),error_D(h)=\frac{1}{n}\sum_{x \in D} I(h(x) \neq c(x)),

where n=Dn=\vert D\vert and II is 1 when its condition is true and 0 otherwise.

The true error is the probability that hh misclassifies a new instance drawn from the same distribution:

errorD(h)=PrxD[h(x)c(x)].error_{\mathcal{D}}(h)=Pr_{x \sim \mathcal{D}}[h(x)\neq c(x)].

A confidence interval gives a range that contains the unknown true value with a specified probability under repeated sampling. For a measured sample error p^=errorD(h)\hat{p}=error_D(h), Mitchell uses normal approximations for large enough samples:

p^±zNp^(1p^)n.\hat{p} \pm z_N \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}.

For an approximate 95 percent interval, zN1.96z_N \approx 1.96.

Hypothesis testing compares observed differences against what might occur by sampling variation alone. The null hypothesis usually says that no real difference exists.

A paired t-test compares two learning algorithms across matched trials, such as the same train/test splits. Pairing reduces noise because each pair shares the same data conditions.

Key results

If a hypothesis is tested on nn independent examples and commits rr errors, then the number of errors behaves like a binomial random variable with parameter p=errorD(h)p=error_{\mathcal{D}}(h):

rBinomial(n,p).r \sim Binomial(n,p).

The observed sample error is:

p^=r/n.\hat{p}=r/n.

For sufficiently large samples, the binomial can be approximated by a normal distribution. That is why the standard error of the sample proportion is:

SE(p^)=p^(1p^)n.SE(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}.

When comparing two hypotheses h1h_1 and h2h_2 on independent test sets, estimate the difference:

d=errorD(h1)errorD(h2).d = error_D(h_1)-error_D(h_2).

The uncertainty of this difference depends on both sample sizes. If both hypotheses are tested on the same examples, however, the errors are correlated, and paired methods are more appropriate.

For comparing algorithms AA and BB over kk matched trials, compute differences:

δi=errori(A)errori(B).\delta_i = error_i(A)-error_i(B).

Then compute:

δˉ=1ki=1kδi,sδ=1k1i=1k(δiδˉ)2.\bar{\delta}=\frac{1}{k}\sum_{i=1}^k \delta_i, \qquad s_{\delta}=\sqrt{\frac{1}{k-1}\sum_{i=1}^k(\delta_i-\bar{\delta})^2}.

The t statistic is:

t=δˉsδ/k.t=\frac{\bar{\delta}}{s_{\delta}/\sqrt{k}}.

The interpretation depends on the t distribution with k1k-1 degrees of freedom.

Mitchell's evaluation framework depends on the same distributional assumption that appears in computational learning theory: the examples used to estimate performance should be drawn from the same distribution as future cases of interest. If the test set is easier, cleaner, or differently sampled than deployment cases, the confidence interval can be numerically correct for the wrong population. This is a statistical version of the training-experience problem from Chapter 1.

The chapter also distinguishes estimating a single hypothesis from comparing algorithms. Once a learning algorithm has used data to choose among many hypotheses, the selected hypothesis is no longer independent of the training sample. That is why validation and test separation matters. A model selected because it happened to perform well on a validation set may have a slightly optimistic validation estimate, so a final untouched test set is still valuable.

In small-sample settings, the normal approximation should be treated as an approximation rather than a law. Mitchell introduces it because it is simple and widely useful, but the underlying binomial structure matters. When errors are rare or test sets are small, exact intervals or resampling methods may be preferable. The main lesson is not the specific constant 1.961.96; it is that observed error is a random variable.

Cross-validation is a practical response to limited data. Instead of reserving one large test subset during development, the data are partitioned into folds, and each fold is used as a test fold while the others train the model. The resulting errors are averaged. This gives a more stable estimate than a single split when data are scarce, although it is still an estimate and can still be biased if used repeatedly for model selection.

The paired t-test section is best read as a warning against overinterpreting isolated wins. If algorithm AA beats algorithm BB on one split, the difference may reflect the split rather than the algorithm. Matched trials ask whether the difference persists across comparable conditions. Even then, the conclusion depends on the number of trials, the variability of the differences, and whether the assumptions behind the test are reasonable.

Evaluation choices should be made before looking at the final results whenever possible. If the metric, test subset, or comparison procedure is changed after seeing outcomes, the reported number can become an artifact of selection. Mitchell's formulas assume a clean sampling story; disciplined experimental design is what makes that story plausible.

The practical habit is to write down the evaluation protocol before tuning begins, then treat deviations as new experiments rather than as the same estimate.

Evaluation questionSuitable toolMain assumption
How accurate is one hypothesis?Confidence interval for sample errorIndependent test examples
Is one hypothesis better than another?Difference in error, hypothesis testComparable evaluation distributions
Is one algorithm better across datasets or splits?Paired t-testMatched trials, approximately normal differences
Is the model overfitting?Training vs validation/test curvesValidation data represent future cases

Visual

The central discipline is separation of roles. Training fits parameters, validation guides choices, and test data estimate final performance.

Worked example 1: Confidence interval for error

Problem: A classifier is tested on n=200n=200 independent examples and makes r=30r=30 errors. Estimate a 95 percent confidence interval for its true error.

Method:

  1. Compute sample error.
p^=r/n=30/200=0.15.\hat{p}=r/n=30/200=0.15.
  1. Compute standard error.
SE=0.15(10.15)200=0.1275200=0.00063750.02525.SE=\sqrt{\frac{0.15(1-0.15)}{200}} =\sqrt{\frac{0.1275}{200}} =\sqrt{0.0006375} \approx 0.02525.
  1. Use z=1.96z=1.96 for an approximate 95 percent interval.
1.96SE1.96(0.02525)0.0495.1.96SE \approx 1.96(0.02525)\approx 0.0495.
  1. Form the interval.
0.15±0.0495=[0.1005,0.1995].0.15 \pm 0.0495 = [0.1005,0.1995].
  1. Check reasonableness.

    The interval is centered at the observed 15 percent error and spans about 5 percentage points on each side. With 200 examples, this degree of uncertainty is plausible.

Answer: The approximate 95 percent confidence interval is [0.101,0.200][0.101,0.200]. In words, the true error is estimated to be between about 10.1 percent and 20.0 percent.

Worked example 2: Paired t-test calculation

Problem: Algorithms AA and BB are compared on five matched train/test splits. The observed error differences δi=errori(A)errori(B)\delta_i=error_i(A)-error_i(B) are:

[0.03,0.01,0.04,0.01,0.02].[0.03, 0.01, 0.04, -0.01, 0.02].

Compute the paired t statistic.

Method:

  1. Compute the mean difference.
δˉ=0.03+0.01+0.040.01+0.025=0.095=0.018.\bar{\delta}=\frac{0.03+0.01+0.04-0.01+0.02}{5} =\frac{0.09}{5}=0.018.
  1. Compute deviations from the mean.
[0.012,0.008,0.022,0.028,0.002].[0.012,-0.008,0.022,-0.028,0.002].
  1. Square and sum deviations.
0.0122+(0.008)2+0.0222+(0.028)2+0.00220.012^2+(-0.008)^2+0.022^2+(-0.028)^2+0.002^2 =0.000144+0.000064+0.000484+0.000784+0.000004=0.00148.=0.000144+0.000064+0.000484+0.000784+0.000004=0.00148.
  1. Compute sample standard deviation.
sδ=0.0014851=0.000370.01924.s_{\delta}=\sqrt{\frac{0.00148}{5-1}} =\sqrt{0.00037}\approx 0.01924.
  1. Compute t.
t=0.0180.01924/5=0.0180.008602.09.t=\frac{0.018}{0.01924/\sqrt{5}} =\frac{0.018}{0.00860}\approx 2.09.

Answer: The paired t statistic is about 2.092.09 with 44 degrees of freedom. This suggests AA has higher error than BB on average, but the sample is small, so the final significance judgment must be made against a t table or software.

Code

import math
from statistics import mean, stdev

def error_confidence_interval(errors, n, z=1.96):
p_hat = errors / n
se = math.sqrt(p_hat * (1.0 - p_hat) / n)
return p_hat - z * se, p_hat + z * se

def paired_t_statistic(differences):
k = len(differences)
avg = mean(differences)
sd = stdev(differences)
return avg / (sd / math.sqrt(k))

print(error_confidence_interval(errors=30, n=200))
print(paired_t_statistic([0.03, 0.01, 0.04, -0.01, 0.02]))

Common pitfalls

  • Evaluating on the training set and reporting the result as future accuracy. This estimates memorization plus generalization, not generalization alone.
  • Reusing the test set for model selection. Once the test set guides choices, it is no longer an unbiased final test.
  • Reporting one accuracy number without uncertainty. Small test sets can make large differences appear by chance.
  • Applying the normal approximation when nn is too small or the error rate is too close to 0 or 1. Exact binomial intervals or other methods may be better.
  • Comparing algorithms on different splits without pairing. Paired comparisons are usually more sensitive because they control for split difficulty.
  • Treating statistical significance as practical significance. A tiny improvement can be statistically detectable but operationally irrelevant.

Connections