Skip to main content

Math for Deep Learning

The mathematical preliminaries in D2L are intentionally compact: enough linear algebra to express neural networks, enough calculus to understand gradient-based learning, enough automatic differentiation to implement training, and enough probability to reason about data, uncertainty, loss functions, and generalization. The point is not to turn every model into a theorem, but to make the symbols in later chapters operational.

Deep learning repeatedly applies the same pattern. A model maps inputs to predictions through differentiable tensor operations. A loss function turns predictions into a scalar objective. Automatic differentiation computes gradients of that scalar with respect to model parameters. An optimizer uses those gradients to change the parameters. Linear algebra describes the computation; calculus supplies the local direction of improvement; probability explains why losses such as squared error and cross-entropy are natural.

Definitions

A scalar is a single number, often written xRx \in \mathbb{R}. A vector is an ordered list xRdx \in \mathbb{R}^d. A matrix is a rectangular array XRm×nX \in \mathbb{R}^{m \times n}. A tensor is an array with any number of axes.

The dot product of two vectors x,yRdx,y \in \mathbb{R}^d is

xTy=j=1dxjyj.x^T y = \sum_{j=1}^d x_j y_j.

The matrix-vector product XwXw combines each row xix_i of XX with ww by a dot product. The matrix-matrix product ABAB composes linear transformations when their inner dimensions match.

A norm measures size. The common vector norms are

x1=jxj,x2=jxj2.\|x\|_1 = \sum_j |x_j|, \qquad \|x\|_2 = \sqrt{\sum_j x_j^2}.

For a scalar-valued function f(x)f(x), the derivative f(x)f'(x) measures instantaneous rate of change. For f:RdRf: \mathbb{R}^d \to \mathbb{R}, the gradient is

xf=[fx1,,fxd]T.\nabla_x f = \left[ \frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_d} \right]^T.

Automatic differentiation records a computation graph during the forward pass and applies the chain rule during the backward pass. PyTorch stores gradients in the .grad field of tensors with requires_grad=True.

A random variable assigns numeric values to uncertain outcomes. The expectation E[X]\mathbb{E}[X] is a probability-weighted average. Conditional probability P(AB)P(A \mid B) measures the probability of AA after assuming BB occurred.

Key results

Linear models use matrix multiplication to express many predictions at once:

y^=Xw+b.\hat{y} = Xw + b.

This single equation covers an entire minibatch. It also reveals why matching dimensions matters: XX has shape n×dn \times d, ww has shape dd, and y^\hat{y} has shape nn.

The chain rule is the main calculus result behind backpropagation. If z=g(x)z = g(x) and y=f(z)y = f(z), then

dydx=dydzdzdx.\frac{dy}{dx} = \frac{dy}{dz}\frac{dz}{dx}.

For vector-valued intermediate quantities, the same idea applies through Jacobians, but deep learning frameworks avoid explicitly materializing most Jacobian matrices. They propagate vector-Jacobian products efficiently from the scalar loss back to parameters.

A local first-order approximation explains gradient descent:

f(x+Δx)f(x)+f(x)TΔx.f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x.

Choosing Δx=ηf(x)\Delta x = -\eta \nabla f(x) gives

f(x+Δx)f(x)ηf(x)22,f(x + \Delta x) \approx f(x) - \eta \|\nabla f(x)\|_2^2,

so a sufficiently small positive learning rate η\eta should reduce the objective locally.

Probability connects losses to statistical assumptions. If targets follow

y=xTw+b+ϵ,ϵN(0,σ2),y = x^T w + b + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2),

then maximizing Gaussian likelihood is equivalent to minimizing squared error. If class labels follow a categorical distribution predicted by softmax probabilities, maximizing likelihood is equivalent to minimizing cross-entropy.

Automatic differentiation should be understood as exact differentiation of the executed program, not symbolic algebra over the mathematical expression in a textbook. If Python control flow chooses one branch, autograd differentiates the operations that actually ran. This is powerful because models can include loops and conditionals, but it also means that converting tensors to Python numbers can break the graph.

Most deep learning objectives are scalar because reverse-mode automatic differentiation is efficient for many parameters and one output. The framework propagates gradients from the scalar loss backward through the graph. When the output is not scalar, PyTorch asks for the upstream gradient because it needs to know which vector-Jacobian product to compute.

Probability also explains why empirical averages appear everywhere. The true risk is an expectation over the data distribution, but the distribution is unknown. Training replaces it with a sample average over the dataset or minibatch. Generalization asks whether this finite-sample approximation led to parameters that work beyond the observed examples.

Notation consistency reduces cognitive load in deep learning. D2L usually reserves uppercase letters for matrices or tensors, lowercase bold-like symbols for vectors, and plain lowercase symbols for scalars. Code does not enforce this distinction, so the reader must keep track of whether a tensor represents one example, a minibatch, a parameter matrix, or a scalar loss. Many derivation mistakes are shape mistakes in disguise.

The link between likelihood and loss is a recurring modeling choice. Squared error corresponds to Gaussian noise with fixed variance. Cross-entropy corresponds to categorical likelihood. Other data assumptions lead to other losses, such as Poisson losses for counts or quantile losses for asymmetric prediction. D2L focuses on common losses, but the principle is broader: choose objectives that match the data-generating story and task metric.

Visual

Mathematical toolDeep learning roleTypical failure when misunderstood
Matrix multiplicationBatches, linear layers, attention scoresInner dimensions do not match
NormsRegularization, gradient clipping, distancePenalizing the wrong parameter group
GradientsDirection of steepest local increaseUpdating in the wrong sign
Chain ruleBackpropagation through layersDetaching tensors accidentally
ExpectationRisk, average loss, samplingConfusing sample mean with exact expectation
Conditional probabilitySupervised prediction and Bayes reasoningIgnoring what is being conditioned on

Worked example 1: gradient of a quadratic loss

Problem: compute the gradient of

f(w)=12Xwy22f(w) = \frac{1}{2}\|Xw - y\|_2^2

for

X=[1234],w=[11],y=[01].X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad w = \begin{bmatrix} 1 \\ -1 \end{bmatrix}, \quad y = \begin{bmatrix} 0 \\ 1 \end{bmatrix}.

Method:

  1. Define the residual r=Xwyr = Xw - y.
Xw=[1(1)+2(1)3(1)+4(1)]=[11].Xw = \begin{bmatrix} 1(1)+2(-1) \\ 3(1)+4(-1) \end{bmatrix} = \begin{bmatrix} -1 \\ -1 \end{bmatrix}.
  1. Subtract yy:
r=[11][01]=[12].r = \begin{bmatrix} -1 \\ -1 \end{bmatrix} - \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} -1 \\ -2 \end{bmatrix}.
  1. Use the standard gradient result:
wf=XT(Xwy)=XTr.\nabla_w f = X^T(Xw-y) = X^T r.
  1. Compute:
XTr=[1324][12]=[1628]=[710].X^T r = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} -1 \\ -2 \end{bmatrix} = \begin{bmatrix} -1 - 6 \\ -2 - 8 \end{bmatrix} = \begin{bmatrix} -7 \\ -10 \end{bmatrix}.

Checked answer: wf=[7,10]T\nabla_w f = [-7, -10]^T. A gradient-descent step subtracts this gradient, so it increases both components of ww for a small positive learning rate.

Worked example 2: Bayes rule for a classifier signal

Problem: a detector flags an image as containing a rare class. The prior probability of the class is P(C)=0.01P(C)=0.01. The detector has true positive rate P(FC)=0.95P(F \mid C)=0.95 and false positive rate P(F¬C)=0.05P(F \mid \neg C)=0.05. Find P(CF)P(C \mid F).

Method:

  1. Write Bayes rule:
P(CF)=P(FC)P(C)P(F).P(C \mid F) = \frac{P(F \mid C)P(C)}{P(F)}.
  1. Expand the denominator by total probability:
P(F)=P(FC)P(C)+P(F¬C)P(¬C).P(F) = P(F \mid C)P(C) + P(F \mid \neg C)P(\neg C).
  1. Substitute values:
P(F)=0.95(0.01)+0.05(0.99)=0.0095+0.0495=0.059.P(F) = 0.95(0.01) + 0.05(0.99) = 0.0095 + 0.0495 = 0.059.
  1. Compute the posterior:
P(CF)=0.95(0.01)0.059=0.00950.0590.161.P(C \mid F) = \frac{0.95(0.01)}{0.059} = \frac{0.0095}{0.059} \approx 0.161.

Checked answer: even with a strong detector, a positive flag only gives about 16.1%16.1\% probability because the class is rare. This is the base-rate effect, and it is one reason accuracy alone can be misleading.

Code

import torch

X = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
y = torch.tensor([[0.0], [1.0]])
w = torch.tensor([[1.0], [-1.0]], requires_grad=True)

loss = 0.5 * torch.sum((X @ w - y) ** 2)
loss.backward()

print("loss:", loss.item())
print("autograd gradient:", w.grad)

manual_grad = X.T @ (X @ w.detach() - y)
print("manual gradient:", manual_grad)

Common pitfalls

  • Confusing a row vector, column vector, and rank-1 tensor. PyTorch often accepts rank-1 tensors, but derivations require explicit orientation.
  • Calling backward() repeatedly without clearing accumulated gradients. Optimizers usually need zero_grad() each iteration.
  • Backpropagating through non-scalar outputs without supplying a gradient argument.
  • Detaching a tensor or converting it to NumPy before all needed gradients have been computed.
  • Treating probability estimates as calibrated merely because they came from a softmax.
  • Forgetting that gradient descent moves opposite the gradient, while the gradient points toward steepest local increase.

Connections