Math for Deep Learning
The mathematical preliminaries in D2L are intentionally compact: enough linear algebra to express neural networks, enough calculus to understand gradient-based learning, enough automatic differentiation to implement training, and enough probability to reason about data, uncertainty, loss functions, and generalization. The point is not to turn every model into a theorem, but to make the symbols in later chapters operational.
Deep learning repeatedly applies the same pattern. A model maps inputs to predictions through differentiable tensor operations. A loss function turns predictions into a scalar objective. Automatic differentiation computes gradients of that scalar with respect to model parameters. An optimizer uses those gradients to change the parameters. Linear algebra describes the computation; calculus supplies the local direction of improvement; probability explains why losses such as squared error and cross-entropy are natural.
Definitions
A scalar is a single number, often written . A vector is an ordered list . A matrix is a rectangular array . A tensor is an array with any number of axes.
The dot product of two vectors is
The matrix-vector product combines each row of with by a dot product. The matrix-matrix product composes linear transformations when their inner dimensions match.
A norm measures size. The common vector norms are
For a scalar-valued function , the derivative measures instantaneous rate of change. For , the gradient is
Automatic differentiation records a computation graph during the forward pass and applies the chain rule during the backward pass. PyTorch stores gradients in the .grad field of tensors with requires_grad=True.
A random variable assigns numeric values to uncertain outcomes. The expectation is a probability-weighted average. Conditional probability measures the probability of after assuming occurred.
Key results
Linear models use matrix multiplication to express many predictions at once:
This single equation covers an entire minibatch. It also reveals why matching dimensions matters: has shape , has shape , and has shape .
The chain rule is the main calculus result behind backpropagation. If and , then
For vector-valued intermediate quantities, the same idea applies through Jacobians, but deep learning frameworks avoid explicitly materializing most Jacobian matrices. They propagate vector-Jacobian products efficiently from the scalar loss back to parameters.
A local first-order approximation explains gradient descent:
Choosing gives
so a sufficiently small positive learning rate should reduce the objective locally.
Probability connects losses to statistical assumptions. If targets follow
then maximizing Gaussian likelihood is equivalent to minimizing squared error. If class labels follow a categorical distribution predicted by softmax probabilities, maximizing likelihood is equivalent to minimizing cross-entropy.
Automatic differentiation should be understood as exact differentiation of the executed program, not symbolic algebra over the mathematical expression in a textbook. If Python control flow chooses one branch, autograd differentiates the operations that actually ran. This is powerful because models can include loops and conditionals, but it also means that converting tensors to Python numbers can break the graph.
Most deep learning objectives are scalar because reverse-mode automatic differentiation is efficient for many parameters and one output. The framework propagates gradients from the scalar loss backward through the graph. When the output is not scalar, PyTorch asks for the upstream gradient because it needs to know which vector-Jacobian product to compute.
Probability also explains why empirical averages appear everywhere. The true risk is an expectation over the data distribution, but the distribution is unknown. Training replaces it with a sample average over the dataset or minibatch. Generalization asks whether this finite-sample approximation led to parameters that work beyond the observed examples.
Notation consistency reduces cognitive load in deep learning. D2L usually reserves uppercase letters for matrices or tensors, lowercase bold-like symbols for vectors, and plain lowercase symbols for scalars. Code does not enforce this distinction, so the reader must keep track of whether a tensor represents one example, a minibatch, a parameter matrix, or a scalar loss. Many derivation mistakes are shape mistakes in disguise.
The link between likelihood and loss is a recurring modeling choice. Squared error corresponds to Gaussian noise with fixed variance. Cross-entropy corresponds to categorical likelihood. Other data assumptions lead to other losses, such as Poisson losses for counts or quantile losses for asymmetric prediction. D2L focuses on common losses, but the principle is broader: choose objectives that match the data-generating story and task metric.
Visual
| Mathematical tool | Deep learning role | Typical failure when misunderstood |
|---|---|---|
| Matrix multiplication | Batches, linear layers, attention scores | Inner dimensions do not match |
| Norms | Regularization, gradient clipping, distance | Penalizing the wrong parameter group |
| Gradients | Direction of steepest local increase | Updating in the wrong sign |
| Chain rule | Backpropagation through layers | Detaching tensors accidentally |
| Expectation | Risk, average loss, sampling | Confusing sample mean with exact expectation |
| Conditional probability | Supervised prediction and Bayes reasoning | Ignoring what is being conditioned on |
Worked example 1: gradient of a quadratic loss
Problem: compute the gradient of
for
Method:
- Define the residual .
- Subtract :
- Use the standard gradient result:
- Compute:
Checked answer: . A gradient-descent step subtracts this gradient, so it increases both components of for a small positive learning rate.
Worked example 2: Bayes rule for a classifier signal
Problem: a detector flags an image as containing a rare class. The prior probability of the class is . The detector has true positive rate and false positive rate . Find .
Method:
- Write Bayes rule:
- Expand the denominator by total probability:
- Substitute values:
- Compute the posterior:
Checked answer: even with a strong detector, a positive flag only gives about probability because the class is rare. This is the base-rate effect, and it is one reason accuracy alone can be misleading.
Code
import torch
X = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
y = torch.tensor([[0.0], [1.0]])
w = torch.tensor([[1.0], [-1.0]], requires_grad=True)
loss = 0.5 * torch.sum((X @ w - y) ** 2)
loss.backward()
print("loss:", loss.item())
print("autograd gradient:", w.grad)
manual_grad = X.T @ (X @ w.detach() - y)
print("manual gradient:", manual_grad)
Common pitfalls
- Confusing a row vector, column vector, and rank-1 tensor. PyTorch often accepts rank-1 tensors, but derivations require explicit orientation.
- Calling
backward()repeatedly without clearing accumulated gradients. Optimizers usually needzero_grad()each iteration. - Backpropagating through non-scalar outputs without supplying a gradient argument.
- Detaching a tensor or converting it to NumPy before all needed gradients have been computed.
- Treating probability estimates as calibrated merely because they came from a softmax.
- Forgetting that gradient descent moves opposite the gradient, while the gradient points toward steepest local increase.