Multilayer Perceptrons and Regularization
Multilayer perceptrons extend linear models by composing affine transformations with nonlinear activation functions. This is the first point in D2L where the model can represent genuinely nonlinear relationships without manually designing nonlinear features. A linear model can only carve input space with hyperplanes; an MLP can bend those boundaries by learning hidden representations.
The same flexibility creates new training issues. Deep networks can overfit, gradients can vanish or explode, initialization can make learning slow, and regularizers can change both optimization and generalization. D2L treats these as practical engineering concerns rather than isolated theory: activation choice, initialization, dropout, weight decay, and early stopping all shape the behavior of the same training loop.
Definitions
An MLP with one hidden layer maps an input to
where is an elementwise activation function. For regression, may be a scalar prediction. For classification, is usually a vector of logits.
Common activation functions include
and
Forward propagation computes outputs layer by layer. Backpropagation applies the chain rule in reverse to compute gradients. A computational graph records which operations produced which tensors.
Weight initialization chooses initial parameter values before training. Xavier initialization aims to preserve activation variance through layers with roughly symmetric activations. Kaiming initialization is commonly paired with ReLU.
Weight decay adds an penalty to discourage large weights. Dropout randomly sets hidden activations to zero during training and rescales the survivors so their expectation is preserved. Early stopping halts training when validation performance stops improving.
Key results
Without nonlinear activations, stacking affine layers does not increase expressive power. If
then
which is just another affine function of . The activation function is what lets hidden layers learn nonlinear features.
The ReLU activation is popular because it is simple, cheap, and has derivative for positive inputs:
At the derivative is undefined, but frameworks choose a subgradient convention. ReLU reduces the saturation problem that affects sigmoid and tanh when inputs have large magnitude.
Dropout with drop probability keeps an activation with probability . In inverted dropout, the training activation is
Then
This means no extra scaling is needed at evaluation time.
Weight decay for loss optimizes
The gradient contribution from the penalty is , which shrinks weights toward zero during updates.
Capacity is not determined by parameter count alone, but parameter count is still a useful warning sign. A wide MLP can memorize small datasets, especially when labels contain noise. Regularization methods do not make overfitting impossible; they bias training toward simpler or more robust solutions. D2L's generalization discussion emphasizes that modern deep networks can generalize in regimes where classical parameter-count intuition is incomplete, but validation data remains essential.
Initialization interacts with activation functions. If weights are too small, signals and gradients may shrink as they pass through layers. If weights are too large, activations may explode or saturate. Xavier-style initialization balances fan-in and fan-out for roughly symmetric activations, while Kaiming initialization accounts for ReLU zeroing roughly half of its inputs.
Dropout and weight decay regularize in different ways. Weight decay directly penalizes large weights in the objective. Dropout injects multiplicative noise into activations during training, making the network less dependent on exact co-adaptations among hidden units. They can be used together, but their strengths and tuning knobs are not interchangeable.
Activation distributions are worth monitoring. If most ReLU units are always negative, they output zero and receive no gradient on those examples. If sigmoid or tanh units operate in saturated regions, their derivatives are tiny. Initialization, normalization, learning rate, and input scaling all influence where activations live. This is why preprocessing and initialization are part of the MLP story rather than separate housekeeping.
Generalization controls should be evaluated on held-out data. A lower training loss after adding width may simply mean higher capacity. A lower validation loss after adding weight decay, dropout, or early stopping is stronger evidence of improved generalization. D2L's regularization sections are best read as a toolkit for changing the bias of training, not as guarantees that a model will generalize.
The MLP is also a reference model for non-image tabular features. Even when specialized architectures dominate vision and language, a well-regularized MLP remains a useful baseline for dense numeric and categorical features after appropriate preprocessing.
Visual
| Technique | Main purpose | Training-time behavior | Evaluation-time behavior |
|---|---|---|---|
| ReLU | Nonlinear representation | Clips negative preactivations | Same as training |
| Xavier init | Stable variance | Sets scale from fan-in and fan-out | Only affects starting point |
| Kaiming init | ReLU-friendly variance | Sets scale from fan-in | Only affects starting point |
| Weight decay | Penalize large weights | Adds to gradients | No direct runtime change |
| Dropout | Reduce co-adaptation | Randomly masks activations | Uses full network |
| Early stopping | Avoid late overfitting | Stops by validation signal | Selects saved checkpoint |
Worked example 1: forward pass through a tiny MLP
Problem: compute the output of a one-hidden-layer MLP. Let
use ReLU activation, and let
Method:
- Compute the hidden preactivation:
- Apply ReLU:
- Compute output:
Checked answer: the MLP output is . The first hidden unit is inactive because its preactivation is negative.
Worked example 2: dropout preserves expectation
Problem: a hidden activation is . Apply inverted dropout with drop probability . List all possible masked outputs and show that the expected output equals .
Method:
- The keep probability is .
- Each unit is kept independently. If kept, it is scaled by .
- The four possible masks are , , , and , each with probability .
- The corresponding outputs are
- Compute the expectation:
Checked answer: inverted dropout changes individual training passes but preserves the activation expectation. This is why PyTorch dropout layers automatically disable masking in eval() mode rather than rescaling outputs again.
Code
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
torch.manual_seed(2)
n = 1000
X = 4 * torch.rand(n, 2) - 2
y = ((X[:, 0] ** 2 + X[:, 1] ** 2) > 1.5).long()
loader = DataLoader(TensorDataset(X, y), batch_size=64, shuffle=True)
model = nn.Sequential(
nn.Linear(2, 32),
nn.ReLU(),
nn.Dropout(p=0.2),
nn.Linear(32, 16),
nn.ReLU(),
nn.Linear(16, 2),
)
for module in model:
if isinstance(module, nn.Linear):
nn.init.kaiming_uniform_(module.weight, nonlinearity="relu")
nn.init.zeros_(module.bias)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=1e-4)
model.train()
for epoch in range(30):
for xb, yb in loader:
loss = loss_fn(model(xb), yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.no_grad():
accuracy = (model(X).argmax(dim=1) == y).float().mean().item()
print(f"training accuracy: {accuracy:.3f}")
Common pitfalls
- Stacking linear layers without nonlinear activations and expecting a deeper model.
- Leaving dropout active during evaluation by forgetting
model.eval(). - Applying dropout to logits in a simple classifier unless there is a specific reason.
- Using sigmoid in deep hidden layers without considering saturation and vanishing gradients.
- Initializing all weights to zero, which makes hidden units learn identical features.
- Assuming lower training loss always means a better model. Validation behavior is the relevant generalization signal.