Covariance, Correlation, and Conditional Expectation
Variance measures how one random variable spreads around its mean. Covariance measures how two variables move together. Conditional expectation measures the best average prediction after observing information. These ideas are conceptually different, but they meet in formulas such as the law of total variance and in examples where one random variable is used to predict another.
Figure: Correlation examples show how association strength and visual pattern are related but not identical. Image: Wikimedia Commons, DenisBoigelot and Imagecreator, public domain.
MIT 18.440 develops covariance and correlation before moving to conditional expectation. The lectures emphasize two warnings: independence implies zero covariance, but zero covariance does not imply independence; and conditional expectation can behave counterintuitively when expectations are infinite or when the conditioning information is misunderstood.
Definitions
For random variables with finite second moments, the covariance is
Equivalently,
The correlation coefficient is
provided both variances are positive and finite.
For jointly discrete random variables, the conditional mass function of given is
when . The conditional expectation is
The expression is itself a random variable: it is the function of that takes value when .
Key results
Covariance is bilinear:
and similarly in the second argument. Variance of a sum satisfies
If and are independent, then
so . The converse is false.
The law of total expectation is
It says that averaging the conditional averages over the distribution of returns the original average.
The law of total variance is
Interpretation: uncertainty in can be decomposed into uncertainty explained by and average uncertainty remaining after is known.
Conditional expectation is also the best mean-square predictor among functions of the observed variable:
is minimized by .
Covariance is sensitive to units. If height is measured in centimeters rather than meters, covariance with another variable is multiplied by . Correlation removes this dependence by dividing by standard deviations, so it always lies between and . Values near or indicate strong linear association, while values near indicate little linear association, not necessarily no relationship at all.
The identity
explains why dependence matters for sums. Positive covariance increases the variance of a sum; negative covariance decreases it. In portfolio language, negatively correlated assets can reduce total risk. In probability computations, covariance terms are the price one pays when indicator variables are not independent.
Conditional expectation should be read as an averaging operation under a revised probability law. If is observed, the distribution of changes to the conditional distribution, and is the mean under that new distribution. As varies, these means form a function. Evaluating that function at the random value gives the random variable .
The total expectation formula says that there are two equivalent ways to average: average directly, or first average within each conditional slice and then average the slice means. This is the probability version of computing a class average by first computing section averages and then weighting by section sizes. The law of total variance adds that total spread splits into spread of the slice means plus average spread within slices.
The best-predictor property is one reason conditional expectation is central in stochastic processes and statistics. If squared error is the loss function and the available information is , then no other function of beats on average. This interpretation leads naturally to martingales, where successive conditional expectations are revised best guesses as information arrives.
Visual
| Quantity | Formula | Role |
|---|---|---|
| Covariance | signed joint variation | |
| Correlation | covariance divided by standard deviations | unitless linear association |
| Conditional mean | average after observing | |
| Total expectation | average the conditional averages | |
| Total variance | explained plus residual spread |
The diagram should be read as a two-stage experiment. First is observed, which selects a conditional distribution for . That conditional distribution has its own mean and variance. Across repeated experiments, the conditional mean changes because changes; that changing mean accounts for the explained part of the variance. The conditional variance accounts for the randomness left after is known. This is the same decomposition used informally when analysts say that some predictors explain part, but not all, of the variation in an outcome.
Covariance and conditional expectation answer related but distinct questions. Covariance asks for one number summarizing linear co-movement. Conditional expectation asks for an entire function describing the average of at each observed value of . When the conditional expectation is linear in , the two viewpoints are closely aligned. When it is nonlinear, covariance may miss important structure.
This perspective is also the bridge to regression in statistics: a regression curve estimates a conditional mean, while residual variance measures what remains unexplained after conditioning on the predictors.
Worked example 1: covariance with a noisy sum
Problem: Let and be independent random variables with , variances and , and define . Compute and .
Method:
- Use bilinearity:
- Since and independence gives ,
- The variance of is
- Therefore
Checked answer: if is large, contains much noise unrelated to , so the correlation is small. If , then and the correlation is .
Worked example 2: conditional expectation from a dice sum
Problem: Roll two fair dice. Let be the first die, the second die, and . Find .
Method:
- The event consists of outcomes
- Under the fair dice model, these four outcomes are equally likely after conditioning on .
- Therefore the conditional distribution of given is uniform on .
- Compute the conditional expectation:
Checked answer: by symmetry, if the sum is , each die has the same conditional expectation, and the two conditional expectations must add to . Thus each is .
Code
from collections import defaultdict
# Conditional expectation for dice.
outcomes = [(x, y) for x in range(1, 7) for y in range(1, 7)]
groups = defaultdict(list)
for x, y in outcomes:
groups[x + y].append(x)
for z in [5, 7, 10]:
print(z, sum(groups[z]) / len(groups[z]))
# Covariance and correlation for X and Z = X + Y.
sigma_x = 2.0
sigma_y = 3.0
cov_xz = sigma_x ** 2
var_z = sigma_x ** 2 + sigma_y ** 2
corr_xz = cov_xz / (sigma_x * (var_z ** 0.5))
print("Cov(X,Z):", cov_xz)
print("Corr(X,Z):", corr_xz)
Common pitfalls
- Concluding independence from zero covariance. Zero covariance only rules out linear association in a second-moment sense.
- Forgetting that correlation is undefined if a variance is zero.
- Treating as a number rather than a random variable depending on .
- Confusing with . Conditioning direction matters.
- Applying conditional-expectation paradoxes without checking whether the relevant expectations are finite.