Instance-Based Learning
Instance-based learning postpones generalization until a query arrives. Instead of eagerly building a global model during training, the learner stores examples and uses nearby cases to answer new questions. Mitchell uses this chapter to introduce k-nearest neighbor, distance-weighted voting, locally weighted regression, radial basis functions, case-based reasoning, and the distinction between lazy and eager learning.
The central idea is locality. If similar instances tend to have similar target values, then nearby training examples are useful evidence. This is simple and powerful, but it makes the choice of distance metric, feature scaling, and local weighting extremely important.
Definitions
The k-nearest neighbor algorithm stores all training examples. To classify a query , it finds the training examples closest to and predicts by majority vote:
For regression, it averages target values:
Distance-weighted nearest neighbor gives closer examples more influence:
Practical implementations often add a small constant to avoid division by zero.
Locally weighted regression fits a local model around the query point, often by minimizing:
Here is a kernel or weighting function that decreases with distance from .
Radial basis functions use localized functions centered at selected points:
The prediction is a weighted combination of these basis functions.
Key results
Instance-based methods can approximate complex target functions by combining simple local decisions. k-NN has almost no training cost beyond storing examples, but prediction can be expensive because distances to many stored examples must be computed. This is the classic lazy-learning tradeoff.
The choice of distance function is part of the hypothesis bias. Euclidean distance implies that attributes are numeric, scaled comparably, and meaningful under straight-line geometry. If one feature has a much larger numeric range than others, it can dominate the neighbor relation.
High-dimensional spaces create difficulty. As dimensions increase, distances often become less informative because points become sparse. This is one form of the curse of dimensionality. Mitchell's chapter predates many modern approximate-nearest-neighbor systems, but the underlying issue remains.
Locally weighted regression differs from k-NN voting by fitting a local approximation at query time. The local model can be constant, linear, or more complex. A local linear model can capture slope near the query even when the global target function is nonlinear.
Case-based reasoning extends nearest-neighbor thinking to structured problems. Instead of merely voting over labels, the system retrieves similar past cases, adapts their solutions, and stores the new solved case for future use.
Lazy learning changes when computation and commitment happen. An eager learner such as ID3 or backpropagation commits to a global hypothesis before seeing the query. A lazy learner waits, then forms a local approximation tailored to the query. This can be an advantage when the target function has different behavior in different regions. The learner does not need one global formula to fit all regions equally well.
The price is that storage and indexing become part of the learning system. A naive k-NN implementation compares the query with every stored example. For small datasets this is fine; for large datasets it can dominate runtime. Data structures such as k-d trees, ball trees, and approximate nearest-neighbor indexes can help, but their effectiveness depends on dimension and metric structure.
The bias-variance tradeoff appears through and bandwidth. Small or narrow kernels produce low-bias, high-variance predictions that follow local quirks. Large or wide kernels produce smoother, higher-bias predictions. Locally weighted regression makes this tradeoff continuous: the kernel width decides how quickly examples lose influence with distance.
Irrelevant features are especially harmful for nearest-neighbor methods. If many attributes have nothing to do with the target, they still contribute to distance unless the metric ignores or down-weights them. Two examples can appear far apart because of irrelevant coordinates even when they are similar in the dimensions that matter. Feature selection, metric learning, and domain-specific similarity functions are therefore natural companions to instance-based learning.
Locally weighted regression can be seen as building a new model for every query. At one query point, the nearby examples might support a rising linear trend; at another query point, they might support a flat or falling trend. This is more flexible than fitting one global line, but it means the learner must solve a weighted fitting problem repeatedly. Mitchell's discussion highlights the general theme: lazy methods save effort during training by spending it at prediction time.
Radial basis function networks sit between lazy and eager learning. Once centers and widths are chosen, prediction is an eager weighted combination of basis functions. But the intuition is still local: each radial unit responds most strongly near its center. This makes RBFs a bridge from nearest-neighbor locality to neural-network-style parametric prediction.
Tie-breaking and class imbalance are small details that can change predictions. With even , a query may receive equal votes from two classes. With imbalanced data, the majority class may dominate neighborhoods even when minority-class examples are closer. Weighted voting, odd , prior adjustment, or class-sensitive distance rules can make the behavior match the task better.
The chapter's terminology of lazy and eager learning is also a reminder that there is no free generalization. A lazy learner may look assumption-light because it stores examples, but its distance metric and weighting rule are strong assumptions about the target function. It generalizes by saying nearby points should behave similarly. That assumption is powerful when the feature space is meaningful and weak when similarity has been poorly encoded.
For this reason, instance-based learning often benefits from careful preprocessing. Normalization, removal of irrelevant attributes, and domain-specific distance functions can matter more than the choice between several neighbor-voting variants.
When the distance notion is wrong, storing more examples can simply make the learner confidently localize around the wrong neighbors.
| Method | Training time | Prediction time | Main design choice | Output type |
|---|---|---|---|---|
| k-NN | Low | High | and distance metric | Classification or regression |
| Distance-weighted k-NN | Low | High | Weight function | Classification or regression |
| Locally weighted regression | Low to moderate | High | Kernel and local model | Regression |
| RBF network | Moderate | Moderate | Centers and widths | Classification or regression |
| Case-based reasoning | Domain dependent | Domain dependent | Retrieval and adaptation rules | Structured solutions |
Visual
Feature space around query q
x2
^
|
A | B
| o
| o
| q *
| o
A | o
+-----------------> x1
Nearest points around q vote or fit a local model.
Farther points may be ignored or down-weighted.
The figure emphasizes that the model used for is built from the neighborhood of , not from a single global partition learned in advance.
Worked example 1: k-NN classification
Problem: A query point is . Four labeled training examples are:
| Point | Class |
|---|---|
| A | |
| A | |
| B | |
| B |
Use Euclidean distance and .
Method:
- Compute distance to .
- Compute distance to .
- Compute distance to .
- Compute distance to .
-
Sort by distance.
Point Class Distance A 1.000 B 1.000 A 1.414 B 2.828 -
Take the three nearest neighbors: A, B, A.
Answer: The majority vote is class A, with two votes out of three. The checked distances show that the more distant B point is excluded.
Worked example 2: Distance-weighted regression
Problem: Estimate a real target at query from three examples:
| 2 | 10 |
| 4 | 14 |
| 7 | 30 |
Use weights , where .
Method:
- Compute distances.
- Compute weights.
- Compute weighted numerator.
- Compute sum of weights.
- Divide.
Answer: The distance-weighted estimate is approximately . The far point at influences the answer only weakly.
Code
import numpy as np
from collections import Counter
def knn_predict(X, y, query, k=3):
distances = np.linalg.norm(X - query, axis=1)
neighbor_ids = np.argsort(distances)[:k]
votes = Counter(y[i] for i in neighbor_ids)
return votes.most_common(1)[0][0], distances[neighbor_ids]
X = np.array([[1, 1], [2, 1], [4, 4], [2, 3]], dtype=float)
y = np.array(["A", "A", "B", "B"])
label, distances = knn_predict(X, y, np.array([2, 2]), k=3)
print(label)
print(distances)
Common pitfalls
- Forgetting to scale features. A feature measured in thousands can dominate one measured in decimals.
- Choosing without considering noise. One noisy neighbor can control the prediction.
- Choosing very large and washing out local structure. The method becomes less local as grows.
- Treating Euclidean distance as neutral. It encodes a strong assumption about similarity.
- Ignoring prediction-time cost. Lazy learning can be expensive when many queries or many stored examples are present.
- Using distance-weighted formulas without handling zero distance. If the query exactly matches a training point, return that target or add a small stabilizer.