Optimization Algorithms
Training a neural network is an optimization problem, but D2L is careful to separate optimization from generalization. Optimization asks how to reduce the training objective. Generalization asks whether the learned model works on new data. Deep learning needs both, and the optimizer often determines whether a reasonable architecture trains at all.
The D2L optimization chapters move from gradient descent and convexity to stochastic gradients, minibatches, momentum, adaptive learning rates, and schedules. These methods are not just implementation variants. Each one makes a different tradeoff among gradient noise, curvature, memory, compute, and hyperparameter sensitivity. Understanding the update equations makes debugging training curves much easier.
Definitions
An objective function maps parameters to a scalar value. In supervised learning, is usually empirical risk plus optional regularization.
Gradient descent updates parameters by
where is the learning rate.
Stochastic gradient descent replaces the full gradient with a gradient computed from one example or a minibatch:
Momentum keeps a velocity:
AdaGrad accumulates squared gradients coordinatewise:
RMSProp uses an exponential moving average of squared gradients:
Adam combines momentum and RMSProp-style second moments, with bias correction.
A learning-rate schedule changes over time. Common schedules include step decay, exponential decay, cosine decay, warmup, and inverse square-root decay.
Key results
For smooth objectives, a learning rate that is too large can diverge even when the gradient direction is correct. A learning rate that is too small may make training appear stuck. The useful range depends on curvature, batch size, normalization, initialization, and optimizer choice.
Convexity provides clean guarantees, but deep networks are nonconvex. D2L still studies convexity because it gives language for local minima, saddle points, curvature, and the difference between optimization difficulty and model expressiveness.
Minibatch size controls gradient noise and hardware efficiency. Larger batches produce less noisy gradient estimates and better matrix-kernel utilization, but they can require learning-rate adjustment and more memory. Very small batches are noisy and inefficient on parallel hardware.
Momentum damps oscillations in directions where gradients keep changing sign and accelerates progress in directions where gradients agree. It is especially useful in long, narrow valleys where plain gradient descent zigzags.
Adaptive methods rescale each coordinate by past gradient magnitude. This helps when features or parameters have different gradient scales. However, adaptive methods still need learning-rate tuning and may generalize differently from SGD with momentum in some vision settings.
Adam's common update is
Learning-rate schedules are often as important as the optimizer family. A high initial learning rate may make early progress fast but unstable. Warmup starts with a smaller rate and increases it over early steps, which is common for very deep networks and transformers. Decay then reduces update size so training can settle into a lower-loss region. A schedule changes the optimization trajectory even when every other hyperparameter is fixed.
Weight decay should be interpreted carefully with adaptive methods. Classical regularization adds to the gradient, after which an adaptive optimizer rescales it along with the loss gradient. Decoupled weight decay, used by optimizers such as AdamW, applies shrinkage separately from the adaptive gradient step. D2L's weight-decay treatment gives the mathematical penalty; modern optimizer APIs may expose both coupled and decoupled variants.
Stochastic gradients also act as noise. That noise can help escape sharp or poor regions, but too much noise prevents convergence. Batch size, data order, augmentation, dropout, and distributed training all affect the noise scale. This is why training recipes specify the optimizer, batch size, schedule, and regularization together rather than independently.
Training curves should be interpreted jointly. If training loss does not decrease, suspect optimization: learning rate, initialization, gradients, architecture, or data pipeline. If training loss decreases but validation loss worsens, suspect overfitting or distribution mismatch. If both losses are noisy, inspect batch size, label noise, and optimizer settings. Optimizers provide update rules, but diagnostics decide which rule or hyperparameter needs attention.
Gradient norms are a compact health signal. Norms near zero can indicate saturation, excessive regularization, or broken graph connections. Extremely large norms can indicate exploding gradients, too large a learning rate, or unstable recurrent dynamics. Clipping changes the update, so it should be used deliberately and monitored rather than added blindly to hide numerical problems.
The optimizer state should be treated as part of training. Resuming from a checkpoint with model weights but without momentum or Adam moments changes the next updates. For serious experiments, save model state, optimizer state, scheduler state, and epoch or step counters together.
Visual
| Optimizer | Extra state | Strength | Main tuning concern |
|---|---|---|---|
| Full GD | None | Exact training gradient | Expensive on large data |
| SGD | None | Cheap noisy updates | Learning-rate schedule |
| Minibatch SGD | None | Hardware efficient compromise | Batch size and learning rate |
| Momentum | Velocity | Faster in valleys | Momentum coefficient |
| AdaGrad | Sum of squared gradients | Rare feature adaptation | Learning rate can decay too much |
| RMSProp | Moving squared average | Nonstationary adaptation | Decay rate |
| Adam | First and second moments | Strong default optimizer | Weight decay and schedule |
Worked example 1: gradient descent on a quadratic
Problem: minimize starting from with learning rate . Compute two gradient-descent steps.
Method:
- Compute the derivative:
- Step 1 gradient:
- Step 1 update:
- Step 2 gradient:
- Step 2 update:
- Check objective decrease:
Checked answer: after two steps, and the objective has decreased monotonically.
Worked example 2: first Adam update in one dimension
Problem: compute the first Adam update for one parameter. Let , , , , , , and ignore for simplicity.
Method:
- First moment:
- Second moment:
- Bias-correct first moment:
- Bias-correct second moment:
- Update amount:
Checked answer: the first Adam step subtracts approximately from the parameter. Bias correction makes the first step behave like a normalized gradient step rather than a tiny moment estimate.
Code
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
torch.manual_seed(7)
n, d = 512, 5
X = torch.randn(n, d)
true_w = torch.arange(1.0, d + 1).reshape(d, 1)
y = X @ true_w + 0.1 * torch.randn(n, 1)
loader = DataLoader(TensorDataset(X, y), batch_size=32, shuffle=True)
def train(optimizer_name):
model = nn.Linear(d, 1)
loss_fn = nn.MSELoss()
if optimizer_name == "sgd":
optimizer = torch.optim.SGD(model.parameters(), lr=0.05, momentum=0.9)
else:
optimizer = torch.optim.Adam(model.parameters(), lr=0.05)
for _ in range(20):
for xb, yb in loader:
loss = loss_fn(model(xb), yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
return loss_fn(model(X), y).item()
print("SGD with momentum:", train("sgd"))
print("Adam:", train("adam"))
Common pitfalls
- Treating the optimizer as a magic default instead of checking update scale and schedules.
- Comparing optimizers with different effective learning-rate tuning effort.
- Forgetting to exclude biases or normalization parameters from weight decay when the recipe calls for it.
- Increasing batch size without reconsidering learning rate, warmup, and memory limits.
- Ignoring gradient clipping in recurrent or very deep models with unstable gradients.
- Reading training loss alone; optimization success still needs validation checks for generalization.