Skip to main content

Multilayer Perceptrons and Regularization

Multilayer perceptrons extend linear models by composing affine transformations with nonlinear activation functions. This is the first point in D2L where the model can represent genuinely nonlinear relationships without manually designing nonlinear features. A linear model can only carve input space with hyperplanes; an MLP can bend those boundaries by learning hidden representations.

The same flexibility creates new training issues. Deep networks can overfit, gradients can vanish or explode, initialization can make learning slow, and regularizers can change both optimization and generalization. D2L treats these as practical engineering concerns rather than isolated theory: activation choice, initialization, dropout, weight decay, and early stopping all shape the behavior of the same training loop.

Definitions

An MLP with one hidden layer maps an input xRdx \in \mathbb{R}^d to

h=ϕ(W1x+b1),o=W2h+b2,h = \phi(W_1 x + b_1), \qquad o = W_2 h + b_2,

where ϕ\phi is an elementwise activation function. For regression, oo may be a scalar prediction. For classification, oo is usually a vector of logits.

Common activation functions include

ReLU(x)=max(x,0),\mathrm{ReLU}(x)=\max(x,0), σ(x)=11+exp(x),\sigma(x)=\frac{1}{1+\exp(-x)},

and

tanh(x)=1exp(2x)1+exp(2x).\tanh(x)=\frac{1-\exp(-2x)}{1+\exp(-2x)}.

Forward propagation computes outputs layer by layer. Backpropagation applies the chain rule in reverse to compute gradients. A computational graph records which operations produced which tensors.

Weight initialization chooses initial parameter values before training. Xavier initialization aims to preserve activation variance through layers with roughly symmetric activations. Kaiming initialization is commonly paired with ReLU.

Weight decay adds an L2L_2 penalty to discourage large weights. Dropout randomly sets hidden activations to zero during training and rescales the survivors so their expectation is preserved. Early stopping halts training when validation performance stops improving.

Key results

Without nonlinear activations, stacking affine layers does not increase expressive power. If

h=W1x+b1,o=W2h+b2,h = W_1x + b_1, \qquad o = W_2h + b_2,

then

o=W2W1x+W2b1+b2,o = W_2W_1x + W_2b_1 + b_2,

which is just another affine function of xx. The activation function is what lets hidden layers learn nonlinear features.

The ReLU activation is popular because it is simple, cheap, and has derivative 11 for positive inputs:

ddxReLU(x)={1,x>0,0,x<0.\frac{d}{dx}\mathrm{ReLU}(x) = \begin{cases} 1, & x > 0, \\ 0, & x < 0. \end{cases}

At x=0x=0 the derivative is undefined, but frameworks choose a subgradient convention. ReLU reduces the saturation problem that affects sigmoid and tanh when inputs have large magnitude.

Dropout with drop probability pp keeps an activation with probability q=1pq=1-p. In inverted dropout, the training activation is

h~j=mjhjq,mjBernoulli(q).\tilde{h}_j = \frac{m_j h_j}{q}, \qquad m_j \sim \mathrm{Bernoulli}(q).

Then

E[h~j]=hj.\mathbb{E}[\tilde{h}_j] = h_j.

This means no extra scaling is needed at evaluation time.

Weight decay for loss L(w)L(w) optimizes

L(w)+λ2w22.L(w) + \frac{\lambda}{2}\|w\|_2^2.

The gradient contribution from the penalty is λw\lambda w, which shrinks weights toward zero during updates.

Capacity is not determined by parameter count alone, but parameter count is still a useful warning sign. A wide MLP can memorize small datasets, especially when labels contain noise. Regularization methods do not make overfitting impossible; they bias training toward simpler or more robust solutions. D2L's generalization discussion emphasizes that modern deep networks can generalize in regimes where classical parameter-count intuition is incomplete, but validation data remains essential.

Initialization interacts with activation functions. If weights are too small, signals and gradients may shrink as they pass through layers. If weights are too large, activations may explode or saturate. Xavier-style initialization balances fan-in and fan-out for roughly symmetric activations, while Kaiming initialization accounts for ReLU zeroing roughly half of its inputs.

Dropout and weight decay regularize in different ways. Weight decay directly penalizes large weights in the objective. Dropout injects multiplicative noise into activations during training, making the network less dependent on exact co-adaptations among hidden units. They can be used together, but their strengths and tuning knobs are not interchangeable.

Activation distributions are worth monitoring. If most ReLU units are always negative, they output zero and receive no gradient on those examples. If sigmoid or tanh units operate in saturated regions, their derivatives are tiny. Initialization, normalization, learning rate, and input scaling all influence where activations live. This is why preprocessing and initialization are part of the MLP story rather than separate housekeeping.

Generalization controls should be evaluated on held-out data. A lower training loss after adding width may simply mean higher capacity. A lower validation loss after adding weight decay, dropout, or early stopping is stronger evidence of improved generalization. D2L's regularization sections are best read as a toolkit for changing the bias of training, not as guarantees that a model will generalize.

The MLP is also a reference model for non-image tabular features. Even when specialized architectures dominate vision and language, a well-regularized MLP remains a useful baseline for dense numeric and categorical features after appropriate preprocessing.

Visual

TechniqueMain purposeTraining-time behaviorEvaluation-time behavior
ReLUNonlinear representationClips negative preactivationsSame as training
Xavier initStable varianceSets scale from fan-in and fan-outOnly affects starting point
Kaiming initReLU-friendly varianceSets scale from fan-inOnly affects starting point
Weight decayPenalize large weightsAdds λw\lambda w to gradientsNo direct runtime change
DropoutReduce co-adaptationRandomly masks activationsUses full network
Early stoppingAvoid late overfittingStops by validation signalSelects saved checkpoint

Worked example 1: forward pass through a tiny MLP

Problem: compute the output of a one-hidden-layer MLP. Let

x=[12],W1=[1121],b1=[01],x = \begin{bmatrix} 1 \\ -2 \end{bmatrix}, \quad W_1 = \begin{bmatrix} 1 & 1 \\ 2 & -1 \end{bmatrix}, \quad b_1 = \begin{bmatrix} 0 \\ 1 \end{bmatrix},

use ReLU activation, and let

W2=[12],b2=0.5.W_2 = \begin{bmatrix} 1 & -2 \end{bmatrix}, \quad b_2 = 0.5.

Method:

  1. Compute the hidden preactivation:
z1=W1x+b1=[1(1)+1(2)2(1)+(1)(2)]+[01]=[14]+[01]=[15].z_1 = W_1x+b_1 = \begin{bmatrix} 1(1)+1(-2) \\ 2(1)+(-1)(-2) \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} -1 \\ 4 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} -1 \\ 5 \end{bmatrix}.
  1. Apply ReLU:
h=ReLU(z1)=[05].h = \mathrm{ReLU}(z_1) = \begin{bmatrix} 0 \\ 5 \end{bmatrix}.
  1. Compute output:
o=W2h+b2=[12][05]+0.5=10+0.5=9.5.o = W_2h+b_2 = \begin{bmatrix} 1 & -2 \end{bmatrix} \begin{bmatrix} 0 \\ 5 \end{bmatrix} +0.5 = -10 + 0.5 = -9.5.

Checked answer: the MLP output is 9.5-9.5. The first hidden unit is inactive because its preactivation is negative.

Worked example 2: dropout preserves expectation

Problem: a hidden activation is h=[2,4]Th=[2,4]^T. Apply inverted dropout with drop probability p=0.5p=0.5. List all possible masked outputs and show that the expected output equals hh.

Method:

  1. The keep probability is q=1p=0.5q=1-p=0.5.
  2. Each unit is kept independently. If kept, it is scaled by 1/q=21/q=2.
  3. The four possible masks are [0,0][0,0], [1,0][1,0], [0,1][0,1], and [1,1][1,1], each with probability 0.250.25.
  4. The corresponding outputs are
[0,0],[4,0],[0,8],[4,8].[0,0], \quad [4,0], \quad [0,8], \quad [4,8].
  1. Compute the expectation:
E[h~]=0.25[0,0]+0.25[4,0]+0.25[0,8]+0.25[4,8]=[2,4].\mathbb{E}[\tilde{h}] =0.25[0,0]+0.25[4,0]+0.25[0,8]+0.25[4,8] =[2,4].

Checked answer: inverted dropout changes individual training passes but preserves the activation expectation. This is why PyTorch dropout layers automatically disable masking in eval() mode rather than rescaling outputs again.

Code

import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

torch.manual_seed(2)

n = 1000
X = 4 * torch.rand(n, 2) - 2
y = ((X[:, 0] ** 2 + X[:, 1] ** 2) > 1.5).long()
loader = DataLoader(TensorDataset(X, y), batch_size=64, shuffle=True)

model = nn.Sequential(
nn.Linear(2, 32),
nn.ReLU(),
nn.Dropout(p=0.2),
nn.Linear(32, 16),
nn.ReLU(),
nn.Linear(16, 2),
)

for module in model:
if isinstance(module, nn.Linear):
nn.init.kaiming_uniform_(module.weight, nonlinearity="relu")
nn.init.zeros_(module.bias)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=1e-4)

model.train()
for epoch in range(30):
for xb, yb in loader:
loss = loss_fn(model(xb), yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()

model.eval()
with torch.no_grad():
accuracy = (model(X).argmax(dim=1) == y).float().mean().item()
print(f"training accuracy: {accuracy:.3f}")

Common pitfalls

  • Stacking linear layers without nonlinear activations and expecting a deeper model.
  • Leaving dropout active during evaluation by forgetting model.eval().
  • Applying dropout to logits in a simple classifier unless there is a specific reason.
  • Using sigmoid in deep hidden layers without considering saturation and vanishing gradients.
  • Initializing all weights to zero, which makes hidden units learn identical features.
  • Assuming lower training loss always means a better model. Validation behavior is the relevant generalization signal.

Connections