Concept Learning and Version Spaces
Concept learning is Mitchell's cleanest introduction to machine learning as search. The learner sees labeled examples and tries to infer a boolean-valued concept, such as whether a day is suitable for enjoying a sport. The chapter is intentionally symbolic: hypotheses are logical descriptions, examples are attribute vectors, and learning is movement through a partially ordered hypothesis space.
This treatment feels different from modern large-scale statistical learning, but it remains useful because it exposes assumptions that are often hidden. Every learner has a bias. Every hypothesis language excludes some possible target concepts. Every generalization from observed examples to unseen examples requires an assumption beyond the data itself.
Definitions
An instance is an object described by attributes. In the classic EnjoySport example, attributes include sky, air temperature, humidity, wind, water, and forecast.
A concept is a boolean-valued function:
A positive example has ; a negative example has .
A hypothesis is the learner's candidate description of the concept. A hypothesis is consistent with a training set if it classifies every training example correctly.
In the conjunctive representation used by Mitchell, each hypothesis is a tuple of constraints. A constraint may be:
| Symbol | Meaning |
|---|---|
specific value, such as Sunny | Attribute must equal that value |
? | Any value is allowed |
0 or empty constraint | No value is allowed, so no instance matches |
The more-general-than relation orders hypotheses. A hypothesis is more general than or equal to if every instance classified positive by is also classified positive by :
The version space is the subset of hypotheses in that remain consistent with all observed training examples :
The specific boundary contains the maximally specific hypotheses in the version space. The general boundary contains the maximally general hypotheses in the version space. Together, and compactly represent all consistent hypotheses between them.
Key results
Concept learning can be formulated as search through . The training examples do not merely score hypotheses; in the noiseless formulation they eliminate inconsistent hypotheses. Positive examples remove hypotheses that are too specific to cover them. Negative examples remove hypotheses that are too general because they cover an instance they should reject.
The FIND-S algorithm searches from the most specific hypothesis upward. It begins with the hypothesis that matches nothing. Each positive example forces the current hypothesis to become just general enough to include that example. Negative examples are ignored because FIND-S assumes the target concept is in and examples are noise-free. Under those assumptions, the result is a maximally specific consistent hypothesis.
Candidate-elimination maintains both and . The update rules are symmetric:
| Training example | Update to | Update to |
|---|---|---|
| Positive | Remove hypotheses that do not cover it; minimally generalize | Remove hypotheses that do not cover it |
| Negative | Remove hypotheses that cover it | Minimally specialize to exclude it |
The version space representation theorem says that the version space is exactly the set of hypotheses more general than some member of and more specific than some member of :
This theorem matters because can be large, but the boundary sets can be much smaller. The cost is that the representation works cleanly only when the hypothesis language, labels, and ordering behave as assumed.
Inductive bias is unavoidable. Mitchell defines the inductive bias of a learner as the set of assumptions that, together with the training data, deductively justify the learner's classifications of unseen instances. A bias-free learner over the full power set of cannot classify any unseen point with confidence, because both labels remain consistent with the observed data.
The version-space view also separates two questions that are easy to mix together. The first question is consistency: which hypotheses have not yet been contradicted by the observed examples? The second question is preference: among those hypotheses, which one should be used when the system must act now? Candidate-elimination answers the first question by maintaining all consistent hypotheses. It does not, by itself, always answer the second question with a single classifier. Mitchell discusses voting among version-space members as one possible use of a partially learned concept, but that requires an additional rule for aggregating disagreement.
FIND-S illustrates the danger of returning one boundary point as though it were the whole state of knowledge. Its final hypothesis is maximally specific among consistent hypotheses, so it tends to say "negative" for unseen instances unless forced to generalize. This can be appropriate when false positives are costly, but it is not a neutral conclusion from the data. A maximally general consistent hypothesis would behave differently on the same unseen instance. The training examples alone do not decide between them.
Noise is the main practical fracture in the clean theory. A single mislabeled positive example can force to generalize incorrectly. A single mislabeled negative example can force to specialize incorrectly or even collapse the version space to empty. Later chapters handle this by replacing exact consistency with statistical preference, likelihood, pruning, margins, or validation error. That progression is one of the book's important arcs: exact logical consistency is elegant, but real learning systems usually need controlled tolerance for imperfect data.
The chapter is therefore best read as a conceptual laboratory. The EnjoySport hypothesis language is small, the labels are clean, and the ordering is easy to inspect. Those simplifications make it possible to see the deeper structure of supervised learning: examples rule out hypotheses, representation controls what can be ruled in, and bias is what permits any prediction beyond the observed sample.
Visual
The diagram shows the version space as a band in the hypothesis lattice. The boundaries move inward as evidence accumulates.
Worked example 1: Run FIND-S on simple weather data
Problem: Learn the concept PlayTennis from attributes (Sky, Wind). Hypotheses use specific values, ?, and the empty hypothesis 0. Use the following examples:
| Example | Instance | Label |
|---|---|---|
| 1 | (Sunny, Weak) | positive |
| 2 | (Sunny, Strong) | positive |
| 3 | (Rainy, Weak) | negative |
Method:
- Initialize with the most specific hypothesis.
It matches no instance.
-
Process example 1,
(Sunny, Weak), positive.The hypothesis must cover this instance. Replace each empty constraint by the observed value:
-
Process example 2,
(Sunny, Strong), positive.Compare attribute by attribute.
- Sky: current constraint
Sunny, observedSunny; keepSunny. - Wind: current constraint
Weak, observedStrong; generalize to?.
- Sky: current constraint
-
Process example 3,
(Rainy, Weak), negative.FIND-S ignores negative examples. Check whether the current hypothesis covers it:
because Rainy does not satisfy Sunny.
Answer: FIND-S returns . The checked answer is consistent with all examples: both positive sunny instances are covered, and the rainy negative instance is rejected.
Worked example 2: One candidate-elimination update
Problem: Same attributes (Sky, Wind) with domains Sky={Sunny,Rainy} and Wind={Weak,Strong}. Start after one positive example (Sunny, Weak). Then process a negative example (Sunny, Strong). What happens to and ?
Method:
- After the positive example, the specific boundary is:
- The general boundary initially contains the most general hypothesis, and it covers the positive example:
-
Process the negative example
(Sunny, Strong).Any member of that covers the negative example must be removed. The current hypothesis requires weak wind, so it does not cover
(Sunny, Strong). Thus:
-
Any member of that covers the negative example is too general.
(?, ?)covers it, so specialize it minimally.Candidate specializations of
(?, ?)that exclude(Sunny, Strong)include:(Rainy, ?)excludes the negative by sky.(?, Weak)excludes the negative by wind.
-
Keep only specializations that are more general than some member of . Check each:
(Rainy, ?)does not cover(Sunny, Weak), so it is not more general than .(?, Weak)covers(Sunny, Weak), so it remains.
Answer:
The version space now contains only hypotheses that require weak wind and still cover the positive example.
Code
def covers(h, x):
return all(hv == "?" or hv == xv for hv, xv in zip(h, x))
def find_s(examples, n_attrs):
h = [None] * n_attrs
for x, label in examples:
if not label:
continue
for i, value in enumerate(x):
if h[i] is None:
h[i] = value
elif h[i] != value:
h[i] = "?"
return tuple(h)
examples = [
(("Sunny", "Weak"), True),
(("Sunny", "Strong"), True),
(("Rainy", "Weak"), False),
]
h = find_s(examples, n_attrs=2)
print(h)
print(covers(h, ("Sunny", "Weak")))
print(covers(h, ("Rainy", "Weak")))
Common pitfalls
- Treating FIND-S as a general-purpose classifier. It is fragile when there is noise, when the target concept is not in , or when negative examples are essential.
- Forgetting that
?means "any value", not "unknown value". It is a constraint in the hypothesis language. - Confusing a maximally specific consistent hypothesis with the true concept. It is only one boundary point of the remaining version space.
- Assuming candidate-elimination always converges to one hypothesis. It can end with many hypotheses, or an empty version space if data are noisy or the target is outside .
- Ignoring inductive bias. Without restrictions on or preferences among hypotheses, no learner can justify predictions on unseen cases.
- Applying the version-space theorem after changing the representation to one without a clean general-to-specific partial order.