Least Squares
Least squares solves inconsistent systems by changing the goal. If has no exact solution, choose so that is as close as possible to . Geometrically, this means projecting onto the column space of .
This is one of the most important applied ideas in linear algebra. Data rarely fit a model perfectly. Least squares gives a principled way to choose the model parameters whose predictions have the smallest Euclidean error. It is the algebraic foundation of linear regression, curve fitting, signal approximation, and many inverse problems.
Definitions
For an matrix and vector , a least-squares solution is a vector minimizing
The residual is
The normal equations are
If the columns of are linearly independent, then is invertible and the least-squares solution is unique:
The fitted vector is , which lies in the column space of . The residual is the part of that the model cannot explain.
Key results
Projection theorem: if is a finite-dimensional subspace of an inner product space and is any vector, then there is a unique vector closest to . The error is orthogonal to .
Least-squares theorem: minimizes if and only if the residual is orthogonal to every column of . In matrix form:
Rearranging gives the normal equations:
If the columns of are independent, is invertible. To see why, suppose . Then
so . Independent columns imply , so has a trivial null space and is invertible.
Least squares can also be solved by QR. If with independent columns, then the normal equations reduce to
which avoids squaring the condition number as the normal equations do.
The phrase "least squares" comes from minimizing the square of the residual norm:
Minimizing the norm or its square gives the same , but the squared expression is easier to differentiate. Expanding gives
Taking the gradient with respect to gives
and setting this equal to zero gives the normal equations. This calculus derivation and the projection derivation reach the same condition: the error must be perpendicular to the model space.
In data fitting, the columns of are often called features or basis functions evaluated at the data points. For a line , the first column is all ones because the intercept contributes to every predicted value, and the second column stores the input values because the slope contributes . For a quadratic model , the design matrix has columns , , and .
Residual analysis is part of the method. A small residual norm means the fitted values are close to the observed values in aggregate, but it does not prove the model is appropriate. If residuals show a pattern, such as all positive values at small and all negative values at large , then the model may be structurally wrong even if the least-squares calculation was correct.
Rank deficiency changes the solution story. If the columns of are dependent, there may be infinitely many parameter vectors that give the same fitted vector. The fitted vector in the column space is still unique, but the coefficient vector may not be. The SVD and pseudoinverse are the standard tools for choosing the minimum-norm least-squares solution in that case.
Visual
| Method | Equation solved | Strength | Caution |
|---|---|---|---|
| Normal equations | simple derivation | can amplify conditioning issues | |
| QR | stable and efficient | requires factorization | |
| SVD | uses singular values | best for rank-deficient problems | more expensive |
Worked example 1: Solve a least-squares system
Problem: find the least-squares solution of
This fits a line to the points , , and .
Step 1: compute .
Step 2: compute .
Step 3: solve
The equations are
Double the first equation:
Subtract from the second:
Then
Checked answer: the least-squares line is
Worked example 2: Check residual orthogonality
Use the fitted line from the previous example.
Step 1: compute the fitted vector.
Step 2: compute the residual.
Step 3: dot the residual with each column of . The first column is :
The second column is :
Checked answer: the residual is orthogonal to the column space, confirming the least-squares condition.
Code
import numpy as np
x_data = np.array([1, 2, 3], dtype=float)
y_data = np.array([1, 2, 2], dtype=float)
A = np.column_stack([np.ones_like(x_data), x_data])
coef, residuals, rank, s = np.linalg.lstsq(A, y_data, rcond=None)
print(coef) # intercept, slope
print(A @ coef) # fitted values
print(y_data - A @ coef)
print(A.T @ (y_data - A @ coef))
The last line checks the normal-equation condition . In practical regression, one should also inspect residual patterns, not only minimize their squared length.
Common pitfalls
- Trying to solve an inconsistent system exactly instead of minimizing the residual.
- Forgetting that the residual is , not ; the norm is the same, but orthogonality formulas should be used consistently.
- Assuming is always invertible. It requires independent columns.
- Forming normal equations blindly for ill-conditioned data.
- Confusing the fitted vector with the parameter vector .
- Reading least squares as minimizing absolute errors. It minimizes the sum of squared errors, equivalently the Euclidean norm.
A dependable least-squares solution should pass three checks. First, the dimensions must match: must live in the same space as . Second, the residual should be orthogonal to the columns of , so should be zero in exact arithmetic. Third, the fitted vector should lie in the column space of because it is a linear combination of the columns.
Normal equations are attractive because they produce a square system, but they can be numerically fragile. The matrix can be much more ill-conditioned than . This is why serious numerical least squares often uses QR or SVD. The normal equations are still valuable for theory and for small exact problems, but they should not be treated as the only method.
Interpretation matters as much as computation. A least-squares line gives the best line within the chosen model class, not proof that the relationship is linear. Residual plots, units, outliers, and domain knowledge should be checked. Linear algebra can identify the best parameters for a model; it cannot decide by itself whether the model is scientifically appropriate.
If a design matrix has dependent columns, some parameters are not separately identifiable. For instance, if one feature column is twice another, many coefficient pairs can produce the same fitted values. In that case the prediction may be unique while the coefficients are not. The pseudoinverse resolves this by selecting the minimum-norm solution, but the modeling issue remains.
Least-squares problems can be weighted. If some measurements are more reliable than others, one may minimize
where is a diagonal matrix of weights. Larger weights penalize errors more strongly. This changes the geometry: the best approximation is no longer measured by the ordinary Euclidean length of the residual, but by a weighted length chosen to match the data assumptions.
The normal-equation matrix is symmetric and positive semidefinite. It is positive definite exactly when the columns of are independent. This connects least squares to quadratic forms: minimizing means minimizing a quadratic function of the parameters. The minimum is unique precisely when that quadratic curves upward in every parameter direction.
In regression language, adding a column to expands the model space. The residual norm cannot increase when the model space gets larger, because the old fitted vector is still available. But a smaller training residual does not automatically mean a better model; extra columns can fit noise. That modeling issue sits beyond the linear algebra calculation but depends on understanding column spaces.