Softmax Classification and Generalization
Classification changes the target from a real number to a discrete class. D2L introduces softmax regression as the linear neural network for this setting: it is still a single affine transformation, but its outputs are interpreted as class probabilities. This makes it the natural baseline for image classification, text classification, and any problem where the model must choose among mutually exclusive labels.
The chapter also introduces a more careful view of generalization. Good training accuracy is not the goal; good performance on future data from the intended environment is the goal. Classification makes this distinction visible because accuracy, cross-entropy, class imbalance, distribution shift, and repeated test-set use can all tell different stories about the same model.
Definitions
For classes, a classifier produces logits . Logits are unconstrained scores, not probabilities. The softmax function maps logits to probabilities:
The probabilities are nonnegative and sum to . The predicted class is often
which is the same as because softmax preserves score order.
For a one-hot label vector , the cross-entropy loss is
If the true class is , this reduces to .
Accuracy is the fraction of examples whose predicted class equals the true class. Top-k accuracy counts a prediction as correct when the true class appears among the highest-scoring classes.
Generalization error is expected error on new data from the target distribution. Training error is measured on the examples used for fitting. A gap between them indicates overfitting, underspecified evaluation, distribution shift, or both.
Distribution shift occurs when training and deployment data differ. D2L distinguishes covariate shift changes, label shift changes, and concept shift changes.
Key results
Softmax is invariant to adding the same constant to every logit:
In practice, one chooses before exponentiating to avoid overflow. This numerical detail is important enough that PyTorch combines softmax and cross-entropy in nn.CrossEntropyLoss, which expects raw logits and applies a stable log-softmax internally.
Softmax regression is a linear classifier. With input , weights , and bias , the logits are
The decision boundaries are linear in input space because class changes occur where two logits are equal:
For cross-entropy with softmax, the gradient with respect to logits has a compact form:
This explains the behavior of classification training. The model pushes down probabilities assigned to wrong classes and pushes up the probability assigned to the true class.
The test set should be used sparingly. If many models are trained and repeatedly selected by test-set performance, the test set becomes part of the model-selection process. A separate validation set is used for selection; the test set is reserved for final reporting.
Cross-entropy also rewards calibrated confidence more than accuracy does. If two classifiers both predict the correct class, the one assigning probability to that class receives lower loss than the one assigning . This is useful during training because gradients remain informative before the argmax decision changes. It is also why a model can improve cross-entropy while accuracy stays flat.
Distribution shift should be considered before deployment, not only after failure. A clothing classifier trained on catalog photos may fail on user-uploaded images because lighting, pose, background, and camera quality differ. The labels may be the same, but the input distribution has changed. D2L's taxonomy gives names to these failures so they can be tested and mitigated deliberately.
The information-theoretic view is useful beyond terminology. Cross-entropy can be read as the expected code length when using the model's predicted distribution to encode labels from the true distribution. Minimizing it encourages the predicted distribution to place mass where the data distribution places mass. KL divergence then measures the extra cost of using one distribution when another is correct. This connects classification loss to probability modeling rather than only to decision accuracy.
Softmax temperature changes confidence without changing logit order when applied uniformly. Dividing logits by a temperature greater than softens probabilities; using a temperature below sharpens them. Temperature scaling is often used after training for calibration, while training-time label smoothing changes targets to reduce overconfidence. These tools are separate from the linear softmax-regression model, but they address the same probability-output interpretation.
Visual
| Evaluation idea | Measures | Strength | Weakness |
|---|---|---|---|
| Cross-entropy | Probability assigned to true class | Sensitive to confidence | Harder to interpret than accuracy |
| Accuracy | Correct class decisions | Simple and task-facing | Ignores confidence and class imbalance |
| Validation error | Model-selection performance | Helps tune hyperparameters | Can be overused too |
| Test error | Final held-out estimate | Best report of chosen model | Invalid if used repeatedly for selection |
| Calibration | Probability reliability | Important for risk decisions | Not guaranteed by high accuracy |
| Shift analysis | Train-deploy mismatch | Explains failures beyond overfitting | Requires knowledge of deployment data |
Worked example 1: cross-entropy from logits
Problem: a three-class model outputs logits , and the true class is class . Compute the softmax probabilities and cross-entropy loss.
Method:
- Exponentiate logits:
- Sum exponentials:
- Divide by the sum:
- Since the true class is , the cross-entropy is
- Compute the logit gradient:
Checked answer: the loss is about . The gradient increases the true-class logit and decreases the two wrong-class logits under gradient descent.
Worked example 2: accuracy under class imbalance
Problem: a dataset has examples of class A and examples of class B. Classifier 1 predicts A for every example. Classifier 2 correctly predicts of the A examples and of the B examples. Compare accuracy and balanced accuracy.
Method:
- Classifier 1 correct predictions:
Accuracy is
- Classifier 1 recall by class:
Balanced accuracy is
- Classifier 2 correct predictions:
Accuracy is
- Classifier 2 recall by class:
Balanced accuracy is
Checked answer: classifier 1 has higher raw accuracy, but classifier 2 is much better when both classes matter. This is why D2L treats metrics as part of problem formulation, not decoration after training.
Code
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
torch.manual_seed(1)
n = 600
X0 = torch.randn(n // 3, 2) + torch.tensor([-2.0, 0.0])
X1 = torch.randn(n // 3, 2) + torch.tensor([2.0, 0.0])
X2 = torch.randn(n // 3, 2) + torch.tensor([0.0, 2.5])
X = torch.cat([X0, X1, X2], dim=0)
y = torch.cat(
[
torch.zeros(n // 3, dtype=torch.long),
torch.ones(n // 3, dtype=torch.long),
2 * torch.ones(n // 3, dtype=torch.long),
]
)
loader = DataLoader(TensorDataset(X, y), batch_size=64, shuffle=True)
model = nn.Linear(2, 3)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(20):
for xb, yb in loader:
logits = model(xb)
loss = loss_fn(logits, yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
pred = model(X).argmax(dim=1)
accuracy = (pred == y).float().mean().item()
print(f"training accuracy: {accuracy:.3f}")
Common pitfalls
- Applying
softmaxbeforenn.CrossEntropyLoss. The PyTorch loss expects logits and applies a stable log-softmax internally. - Reporting accuracy alone on imbalanced data. Per-class recall, macro averages, or task-specific costs may matter more.
- Treating logits as probabilities. Logits can be any real numbers and do not sum to one.
- Reusing the test set for model selection. This produces an optimistic estimate of future performance.
- Forgetting that high softmax confidence does not guarantee calibrated probabilities.
- Assuming a classifier trained under one data distribution will remain reliable after covariate, label, or concept shift.