Skip to main content

Discrete Random Variables, Expectation, and Variance

A random variable turns outcomes into numbers. This change of viewpoint is one of the main transitions in probability: instead of asking only which event occurred, we ask for numerical summaries such as the number of heads, the number of fixed points in a shuffled hat problem, or the total payoff from a gamble. Once outcomes have numerical values, expectation and variance become the central quantities.

A Galton box diagram shows balls falling through pegs into bins.

Figure: A Galton box turns repeated random left-right choices into an approximate bell-shaped distribution. Image: Wikimedia Commons, Marcin Floryan, CC BY-SA 3.0.

The MIT lectures introduce random variables as functions on the sample space, then define probability mass functions, cumulative distribution functions, expectation, variance, and the decomposition trick of writing complicated variables as sums of simple indicator variables. Linearity of expectation is the most important early result: it does not require independence and is often the simplest way to compute an expected count.

Definitions

A random variable XX is a function from the sample space SS to the real numbers. It assigns a numerical value X(ω)X(\omega) to each outcome ω\omega.

A random variable is discrete if it takes values in a finite or countable set with probability one. Its probability mass function is

pX(x)=P(X=x).p_X(x)=P(X=x).

The cumulative distribution function is

FX(a)=P(Xa)=xapX(x).F_X(a)=P(X\le a)=\sum_{x\le a}p_X(x).

If the relevant sums converge absolutely, the expectation is

E[X]=xxpX(x).E[X]=\sum_x x p_X(x).

If gg is a function, then

E[g(X)]=xg(x)pX(x),E[g(X)]=\sum_x g(x)p_X(x),

which is often called the law of the unconscious statistician.

The variance of XX is

Var(X)=E[(Xμ)2],μ=E[X].\operatorname{Var}(X)=E[(X-\mu)^2], \qquad \mu=E[X].

The standard deviation is Var(X)\sqrt{\operatorname{Var}(X)}.

An indicator random variable for event AA is

1A={1,A occurs,0,A does not occur.1_A= \begin{cases} 1,& A\text{ occurs},\\ 0,& A\text{ does not occur}. \end{cases}

Its expectation is E[1A]=P(A)E[1_A]=P(A).

Key results

Expectation is linear:

E[aX+bY]=aE[X]+bE[Y],E[aX+bY]=aE[X]+bE[Y],

whenever the expectations exist. No independence assumption is needed. For a countable sample space this follows by summing over outcomes:

ωS(aX(ω)+bY(ω))P(ω)=aωSX(ω)P(ω)+bωSY(ω)P(ω).\sum_{\omega\in S}(aX(\omega)+bY(\omega))P(\omega) =a\sum_{\omega\in S}X(\omega)P(\omega) +b\sum_{\omega\in S}Y(\omega)P(\omega).

The computational variance formula is

Var(X)=E[X2](E[X])2.\operatorname{Var}(X)=E[X^2]-(E[X])^2.

Proof:

E[(Xμ)2]=E[X22μX+μ2]=E[X2]2μE[X]+μ2=E[X2]2μ2+μ2=E[X2]μ2.\begin{aligned} E[(X-\mu)^2] &=E[X^2-2\mu X+\mu^2]\\ &=E[X^2]-2\mu E[X]+\mu^2\\ &=E[X^2]-2\mu^2+\mu^2\\ &=E[X^2]-\mu^2. \end{aligned}

Scaling and shifting behave as follows:

E[aX+b]=aE[X]+b,Var(aX+b)=a2Var(X).E[aX+b]=aE[X]+b, \qquad \operatorname{Var}(aX+b)=a^2\operatorname{Var}(X).

The indicator decomposition trick writes a count as

X=1A1++1An.X=1_{A_1}+\cdots+1_{A_n}.

Then

E[X]=P(A1)++P(An),E[X]=P(A_1)+\cdots+P(A_n),

even when the events are dependent. This explains why expected counts can be much easier than exact distributions.

There are two common ways to compute an expectation. One sums over values of the random variable:

E[X]=xxpX(x).E[X]=\sum_x xp_X(x).

The other sums over the original sample space:

E[X]=ωSX(ω)P({ω}).E[X]=\sum_{\omega\in S}X(\omega)P(\{\omega\}).

These are the same calculation grouped differently. The value-based formula groups together all outcomes with the same XX value. The state-space formula is sometimes easier when the sample space is small; the value-based formula is usually easier when the distribution of XX is already known.

Expectation can exist even when a most likely value is absent or misleading. For a fair die, the expectation is 3.53.5, which is not an outcome. In a lottery, a very large rare payoff can dominate the expectation even though the typical outcome is zero. This is why the lectures introduce variance immediately after expectation: the mean alone does not describe risk, spread, or typical behavior.

Variance depends on squared deviations, so it is sensitive to rare large values. If XX is measured in dollars, then Var(X)\operatorname{Var}(X) is measured in squared dollars, which is one reason the standard deviation is often easier to interpret. Still, variance is algebraically convenient because squares expand cleanly and because independent variances add.

Indicator variables turn probability questions into expectation questions. If XX counts the number of successes among many possibly dependent events, then E[X]E[X] only needs the individual success probabilities. This is why the expected number of fixed points in a random permutation is easy even though the exact distribution of fixed points requires inclusion-exclusion. The method also prepares for binomial variables, where a sum of independent indicators gives both the expectation and variance.

One must be more careful with infinite sums. A discrete random variable with values 1,2,3,1,2,3,\ldots may have probabilities summing to 11 but still fail to have a finite expectation. The expression xxpX(x)\sum_x xp_X(x) must converge in the usual absolute sense for expectation to be safely manipulated by linearity and variance formulas.

Visual

QuantityFormulaWhat it measures
PMFpX(x)=P(X=x)p_X(x)=P(X=x)probability at each value
CDFFX(a)=P(Xa)F_X(a)=P(X\le a)accumulated probability
MeanE[X]=xxpX(x)E[X]=\sum_x xp_X(x)center or long-run average
Second momentE[X2]E[X^2]raw squared size
VarianceE[X2](E[X])2E[X^2]-(E[X])^2spread around the mean
Indicator meanE[1A]=P(A)E[1_A]=P(A)probability as expectation

The table separates distributional information from summary information. A PMF or CDF can determine all probabilities involving XX. The mean and variance compress that information into two numbers. Compression is useful, but it loses detail. Two random variables can have the same mean and variance while having very different shapes, tail behavior, or most likely values. Later limit theorems explain why mean and variance are often enough for averages, but individual distributions still require more information.

When a problem asks for an expected count, try indicators before trying to find the whole PMF. When a problem asks for the probability that the count equals a specific value, the PMF is unavoidable. This distinction explains why the expected number of fixed hats is much easier than the probability that exactly three people get their own hats.

Worked example 1: expectation and variance of a die roll

Problem: Let XX be the result of a fair six-sided die. Compute E[X]E[X] and Var(X)\operatorname{Var}(X).

Method:

  1. The PMF is pX(k)=1/6p_X(k)=1/6 for k=1,,6k=1,\ldots,6.
  2. The expectation is
E[X]=k=16k16=1+2+3+4+5+66=216=72.E[X]=\sum_{k=1}^{6}k\frac16 =\frac{1+2+3+4+5+6}{6} =\frac{21}{6} =\frac72.
  1. Compute the second moment:
E[X2]=k=16k216=1+4+9+16+25+366=916.E[X^2]=\sum_{k=1}^{6}k^2\frac16 =\frac{1+4+9+16+25+36}{6} =\frac{91}{6}.
  1. Use the variance formula:
Var(X)=916(72)2=916494=18214712=3512.\operatorname{Var}(X) =\frac{91}{6}-\left(\frac72\right)^2 =\frac{91}{6}-\frac{49}{4} =\frac{182-147}{12} =\frac{35}{12}.

Checked answer: the standard deviation is 35/121.708\sqrt{35/12}\approx 1.708, which is plausible because die values range only from 11 to 66.

Worked example 2: expected fixed points in the hat shuffle

Problem: In a random shuffle of nn hats among nn people, let XX be the number of people who get their own hat. Compute E[X]E[X].

Method:

  1. Let AiA_i be the event that person ii receives their own hat.
  2. Define indicators
Xi=1Ai.X_i=1_{A_i}.
  1. Then the total number of fixed points is
X=X1++Xn.X=X_1+\cdots+X_n.
  1. For each person, symmetry gives
P(Ai)=1n.P(A_i)=\frac1n.
  1. Therefore
E[Xi]=1n.E[X_i]=\frac1n.
  1. By linearity,
E[X]=i=1nE[Xi]=n1n=1.E[X]=\sum_{i=1}^{n}E[X_i] =n\cdot\frac1n =1.

Checked answer: the expected number of people receiving their own hat is always 11, no matter how large nn is. This does not mean exactly one person usually gets their hat; it means the average count over many random shuffles is 11.

Code

from fractions import Fraction

values = range(1, 7)
mean = sum(Fraction(k, 6) for k in values)
second_moment = sum(Fraction(k * k, 6) for k in values)
variance = second_moment - mean * mean

print("die mean:", mean)
print("die variance:", variance)

def expected_fixed_points(n):
return sum(1 / n for _ in range(n))

for n in [3, 10, 100]:
print(n, expected_fixed_points(n))

Common pitfalls

  • Thinking a random variable is an event. An event is a subset of outcomes; a random variable is a numerical function on outcomes.
  • Forgetting that a PMF must sum to 11.
  • Assuming linearity of expectation requires independence. It does not.
  • Assuming variance is linear. In general Var(X+Y)\operatorname{Var}(X+Y) is not Var(X)+Var(Y)\operatorname{Var}(X)+\operatorname{Var}(Y) unless covariance is zero.
  • Interpreting expectation as the most likely value. A random variable may have expectation 11 without ever equaling 11.

Connections