Skip to main content

Least Squares

Least squares solves inconsistent systems by changing the goal. If Ax=bA\mathbf{x}=\mathbf{b} has no exact solution, choose x^\hat{\mathbf{x}} so that Ax^A\hat{\mathbf{x}} is as close as possible to b\mathbf{b}. Geometrically, this means projecting b\mathbf{b} onto the column space of AA.

This is one of the most important applied ideas in linear algebra. Data rarely fit a model perfectly. Least squares gives a principled way to choose the model parameters whose predictions have the smallest Euclidean error. It is the algebraic foundation of linear regression, curve fitting, signal approximation, and many inverse problems.

Definitions

For an m×nm\times n matrix AA and vector bRm\mathbf{b}\in\mathbb{R}^m, a least-squares solution is a vector x^\hat{\mathbf{x}} minimizing

Axb.\|A\mathbf{x}-\mathbf{b}\|.

The residual is

r=bAx^.\mathbf{r}=\mathbf{b}-A\hat{\mathbf{x}}.

The normal equations are

ATAx^=ATb.A^TA\hat{\mathbf{x}}=A^T\mathbf{b}.

If the columns of AA are linearly independent, then ATAA^TA is invertible and the least-squares solution is unique:

x^=(ATA)1ATb.\hat{\mathbf{x}}=(A^TA)^{-1}A^T\mathbf{b}.

The fitted vector is b^=Ax^\hat{\mathbf{b}}=A\hat{\mathbf{x}}, which lies in the column space of AA. The residual is the part of b\mathbf{b} that the model cannot explain.

Key results

Projection theorem: if WW is a finite-dimensional subspace of an inner product space and b\mathbf{b} is any vector, then there is a unique vector pW\mathbf{p}\in W closest to b\mathbf{b}. The error bp\mathbf{b}-\mathbf{p} is orthogonal to WW.

Least-squares theorem: x^\hat{\mathbf{x}} minimizes Axb\|A\mathbf{x}-\mathbf{b}\| if and only if the residual is orthogonal to every column of AA. In matrix form:

AT(bAx^)=0.A^T(\mathbf{b}-A\hat{\mathbf{x}})=\mathbf{0}.

Rearranging gives the normal equations:

ATAx^=ATb.A^TA\hat{\mathbf{x}}=A^T\mathbf{b}.

If the columns of AA are independent, ATAA^TA is invertible. To see why, suppose ATAx=0A^TA\mathbf{x}=\mathbf{0}. Then

xTATAx=Ax2=0,\mathbf{x}^TA^TA\mathbf{x}=\|A\mathbf{x}\|^2=0,

so Ax=0A\mathbf{x}=\mathbf{0}. Independent columns imply x=0\mathbf{x}=\mathbf{0}, so ATAA^TA has a trivial null space and is invertible.

Least squares can also be solved by QR. If A=QRA=QR with independent columns, then the normal equations reduce to

Rx^=QTb,R\hat{\mathbf{x}}=Q^T\mathbf{b},

which avoids squaring the condition number as the normal equations do.

The phrase "least squares" comes from minimizing the square of the residual norm:

Axb2.\|A\mathbf{x}-\mathbf{b}\|^2.

Minimizing the norm or its square gives the same x^\hat{\mathbf{x}}, but the squared expression is easier to differentiate. Expanding gives

(Axb)T(Axb)=xTATAx2xTATb+bTb.(A\mathbf{x}-\mathbf{b})^T(A\mathbf{x}-\mathbf{b}) = \mathbf{x}^TA^TA\mathbf{x}-2\mathbf{x}^TA^T\mathbf{b}+\mathbf{b}^T\mathbf{b}.

Taking the gradient with respect to x\mathbf{x} gives

2ATAx2ATb,2A^TA\mathbf{x}-2A^T\mathbf{b},

and setting this equal to zero gives the normal equations. This calculus derivation and the projection derivation reach the same condition: the error must be perpendicular to the model space.

In data fitting, the columns of AA are often called features or basis functions evaluated at the data points. For a line y=c+mxy=c+mx, the first column is all ones because the intercept contributes cc to every predicted value, and the second column stores the input values because the slope contributes mximx_i. For a quadratic model y=a+bx+cx2y=a+bx+cx^2, the design matrix has columns 11, xx, and x2x^2.

Residual analysis is part of the method. A small residual norm means the fitted values are close to the observed values in aggregate, but it does not prove the model is appropriate. If residuals show a pattern, such as all positive values at small xx and all negative values at large xx, then the model may be structurally wrong even if the least-squares calculation was correct.

Rank deficiency changes the solution story. If the columns of AA are dependent, there may be infinitely many parameter vectors that give the same fitted vector. The fitted vector in the column space is still unique, but the coefficient vector may not be. The SVD and pseudoinverse are the standard tools for choosing the minimum-norm least-squares solution in that case.

Visual

MethodEquation solvedStrengthCaution
Normal equationsATAx^=ATbA^TA\hat{x}=A^Tbsimple derivationcan amplify conditioning issues
QRRx^=QTbR\hat{x}=Q^Tbstable and efficientrequires factorization
SVDuses singular valuesbest for rank-deficient problemsmore expensive

Worked example 1: Solve a least-squares system

Problem: find the least-squares solution of

[111213][cm][122].\begin{bmatrix} 1&1\\ 1&2\\ 1&3 \end{bmatrix} \begin{bmatrix} c\\m \end{bmatrix} \approx \begin{bmatrix} 1\\2\\2 \end{bmatrix}.

This fits a line y=c+mxy=c+mx to the points (1,1)(1,1), (2,2)(2,2), and (3,2)(3,2).

Step 1: compute ATAA^TA.

ATA=[111123][111213]=[36614].A^TA= \begin{bmatrix} 1&1&1\\ 1&2&3 \end{bmatrix} \begin{bmatrix} 1&1\\ 1&2\\ 1&3 \end{bmatrix} = \begin{bmatrix} 3&6\\ 6&14 \end{bmatrix}.

Step 2: compute ATbA^T\mathbf{b}.

ATb=[111123][122]=[511].A^T\mathbf{b} = \begin{bmatrix} 1&1&1\\ 1&2&3 \end{bmatrix} \begin{bmatrix} 1\\2\\2 \end{bmatrix} = \begin{bmatrix} 5\\11 \end{bmatrix}.

Step 3: solve

[36614][cm]=[511].\begin{bmatrix} 3&6\\ 6&14 \end{bmatrix} \begin{bmatrix} c\\m \end{bmatrix} = \begin{bmatrix} 5\\11 \end{bmatrix}.

The equations are

3c+6m=5,6c+14m=11.3c+6m=5, \qquad 6c+14m=11.

Double the first equation:

6c+12m=10.6c+12m=10.

Subtract from the second:

2m=1m=12.2m=1 \quad\Longrightarrow\quad m=\frac12.

Then

3c+3=5c=23.3c+3=5 \quad\Longrightarrow\quad c=\frac23.

Checked answer: the least-squares line is

y=23+12x.y=\frac23+\frac12x.

Worked example 2: Check residual orthogonality

Use the fitted line from the previous example.

Step 1: compute the fitted vector.

Ax^=[111213][2/31/2]=[7/65/313/6].A\hat{\mathbf{x}} = \begin{bmatrix} 1&1\\ 1&2\\ 1&3 \end{bmatrix} \begin{bmatrix} 2/3\\1/2 \end{bmatrix} = \begin{bmatrix} 7/6\\ 5/3\\ 13/6 \end{bmatrix}.

Step 2: compute the residual.

r=bAx^=[122][7/65/313/6]=[1/61/31/6].\mathbf{r} = \mathbf{b}-A\hat{\mathbf{x}} = \begin{bmatrix} 1\\2\\2 \end{bmatrix} - \begin{bmatrix} 7/6\\5/3\\13/6 \end{bmatrix} = \begin{bmatrix} -1/6\\1/3\\-1/6 \end{bmatrix}.

Step 3: dot the residual with each column of AA. The first column is [111]T\begin{bmatrix}1&1&1\end{bmatrix}^T:

16+1316=0.-\frac16+\frac13-\frac16=0.

The second column is [123]T\begin{bmatrix}1&2&3\end{bmatrix}^T:

16+2(13)+3(16)=16+4636=0.-\frac16+2\left(\frac13\right)+3\left(-\frac16\right) = -\frac16+\frac46-\frac36=0.

Checked answer: the residual is orthogonal to the column space, confirming the least-squares condition.

Code

import numpy as np

x_data = np.array([1, 2, 3], dtype=float)
y_data = np.array([1, 2, 2], dtype=float)

A = np.column_stack([np.ones_like(x_data), x_data])
coef, residuals, rank, s = np.linalg.lstsq(A, y_data, rcond=None)

print(coef) # intercept, slope
print(A @ coef) # fitted values
print(y_data - A @ coef)
print(A.T @ (y_data - A @ coef))

The last line checks the normal-equation condition ATr=0A^T\mathbf{r}=0. In practical regression, one should also inspect residual patterns, not only minimize their squared length.

Common pitfalls

  • Trying to solve an inconsistent system exactly instead of minimizing the residual.
  • Forgetting that the residual is bAx^\mathbf{b}-A\hat{\mathbf{x}}, not Ax^bA\hat{\mathbf{x}}-\mathbf{b}; the norm is the same, but orthogonality formulas should be used consistently.
  • Assuming ATAA^TA is always invertible. It requires independent columns.
  • Forming normal equations blindly for ill-conditioned data.
  • Confusing the fitted vector Ax^A\hat{\mathbf{x}} with the parameter vector x^\hat{\mathbf{x}}.
  • Reading least squares as minimizing absolute errors. It minimizes the sum of squared errors, equivalently the Euclidean norm.

A dependable least-squares solution should pass three checks. First, the dimensions must match: Ax^A\hat{\mathbf{x}} must live in the same space as b\mathbf{b}. Second, the residual should be orthogonal to the columns of AA, so AT(bAx^)A^T(\mathbf{b}-A\hat{\mathbf{x}}) should be zero in exact arithmetic. Third, the fitted vector should lie in the column space of AA because it is a linear combination of the columns.

Normal equations are attractive because they produce a square system, but they can be numerically fragile. The matrix ATAA^TA can be much more ill-conditioned than AA. This is why serious numerical least squares often uses QR or SVD. The normal equations are still valuable for theory and for small exact problems, but they should not be treated as the only method.

Interpretation matters as much as computation. A least-squares line gives the best line within the chosen model class, not proof that the relationship is linear. Residual plots, units, outliers, and domain knowledge should be checked. Linear algebra can identify the best parameters for a model; it cannot decide by itself whether the model is scientifically appropriate.

If a design matrix has dependent columns, some parameters are not separately identifiable. For instance, if one feature column is twice another, many coefficient pairs can produce the same fitted values. In that case the prediction may be unique while the coefficients are not. The pseudoinverse resolves this by selecting the minimum-norm solution, but the modeling issue remains.

Least-squares problems can be weighted. If some measurements are more reliable than others, one may minimize

W(Axb)2\|W(A\mathbf{x}-\mathbf{b})\|^2

where WW is a diagonal matrix of weights. Larger weights penalize errors more strongly. This changes the geometry: the best approximation is no longer measured by the ordinary Euclidean length of the residual, but by a weighted length chosen to match the data assumptions.

The normal-equation matrix ATAA^TA is symmetric and positive semidefinite. It is positive definite exactly when the columns of AA are independent. This connects least squares to quadratic forms: minimizing Axb2\|A\mathbf{x}-\mathbf{b}\|^2 means minimizing a quadratic function of the parameters. The minimum is unique precisely when that quadratic curves upward in every parameter direction.

In regression language, adding a column to AA expands the model space. The residual norm cannot increase when the model space gets larger, because the old fitted vector is still available. But a smaller training residual does not automatically mean a better model; extra columns can fit noise. That modeling issue sits beyond the linear algebra calculation but depends on understanding column spaces.

Connections