Skip to main content

Weak Law, Concentration, and the Central Limit Theorem

Limit theorems explain why probability becomes predictable at large scale. The weak law of large numbers says sample averages concentrate near the mean. The central limit theorem says the remaining fluctuations, after scaling by n\sqrt n, often look normal. These two statements answer different questions: the weak law gives convergence to a constant, while the central limit theorem describes the shape of the error.

A central limit theorem simulation shows sample means becoming bell shaped.

Figure: A central limit theorem simulation shows why sample means often become approximately normal. Image: Wikimedia Commons, Daniel Resende, CC BY-SA 4.0.

MIT 18.440 proves the weak law first through Markov and Chebyshev inequalities, then revisits it with characteristic functions. The central limit theorem is then proved with transform methods. The lecture sequence makes clear why moment hypotheses matter: finite variance gives a short Chebyshev proof of the weak law, while characteristic functions allow more general convergence arguments.

Definitions

For a nonnegative random variable XX and a>0a\gt 0, Markov's inequality is

P(Xa)E[X]a.P(X\ge a)\le \frac{E[X]}{a}.

For a random variable XX with mean μ\mu and variance σ2\sigma^2, Chebyshev's inequality is

P(Xμk)σ2k2.P(|X-\mu|\ge k)\le \frac{\sigma^2}{k^2}.

Let X1,X2,X_1,X_2,\ldots be independent identically distributed random variables with mean μ\mu, and define the sample average

An=X1++Xnn.A_n=\frac{X_1+\cdots+X_n}{n}.

The weak law of large numbers states that for every ϵ>0\epsilon\gt 0,

P(Anμ>ϵ)0as n.P(|A_n-\mu|>\epsilon)\to0 \qquad\text{as }n\to\infty.

If the XiX_i have variance σ2\sigma^2, the normalized sum is

Bn=X1++Xnnμσn.B_n=\frac{X_1+\cdots+X_n-n\mu}{\sigma\sqrt n}.

The central limit theorem states that

BnZ,B_n\Rightarrow Z,

where ZZ is standard normal and \Rightarrow denotes convergence in distribution.

Key results

Markov's inequality proof: since X0X\ge0,

E[X]E[X1{Xa}]aP(Xa).E[X]\ge E[X1_{\{X\ge a\}}]\ge aP(X\ge a).

Dividing by aa gives the result.

Chebyshev follows by applying Markov to the nonnegative random variable (Xμ)2(X-\mu)^2:

P(Xμk)=P((Xμ)2k2)E[(Xμ)2]k2=σ2k2.P(|X-\mu|\ge k) =P((X-\mu)^2\ge k^2) \le \frac{E[(X-\mu)^2]}{k^2} =\frac{\sigma^2}{k^2}.

Weak law with finite variance: if the XiX_i are i.i.d. with variance σ2\sigma^2, then

E[An]=μ,Var(An)=σ2n.E[A_n]=\mu, \qquad \operatorname{Var}(A_n)=\frac{\sigma^2}{n}.

Chebyshev gives

P(Anμϵ)σ2nϵ20.P(|A_n-\mu|\ge\epsilon) \le \frac{\sigma^2}{n\epsilon^2} \to0.

The transform proof of the central limit theorem begins by standardizing so E[Xi]=0E[X_i]=0 and Var(Xi)=1\operatorname{Var}(X_i)=1. If M(t)=E[etX1]M(t)=E[e^{tX_1}] exists near zero, then

M(t)=1+t22+o(t2).M(t)=1+\frac{t^2}{2}+o(t^2).

For

Bn=X1++Xnn,B_n=\frac{X_1+\cdots+X_n}{\sqrt n},

independence gives

MBn(t)=(M(tn))n.M_{B_n}(t)=\left(M\left(\frac{t}{\sqrt n}\right)\right)^n.

Using the expansion,

MBn(t)et2/2,M_{B_n}(t)\to e^{t^2/2},

the MGF of a standard normal variable. Characteristic functions give the same argument under the finite-variance assumption without requiring the MGF to exist.

The weak law and the central limit theorem should be compared through the sample average:

Anμ=σnBn.A_n-\mu=\frac{\sigma}{\sqrt n}B_n.

The CLT says BnB_n has an approximately stable normal distribution for large nn. Multiplying by σ/n\sigma/\sqrt n then says the actual average error is typically of order 1/n1/\sqrt n. The weak law records only the consequence that this error goes to zero in probability, while the CLT describes the scale and shape of the error.

Markov and Chebyshev inequalities are deliberately crude. They do not assume a particular distribution, so they cannot give sharp normal-tail estimates in general. Their strength is robustness: with only a mean or variance, they still produce valid bounds. In applications, a loose guaranteed bound can be more valuable than a sharper approximation that depends on unverified distributional assumptions.

The finite-variance proof of the weak law also shows why independence matters. If X1,,XnX_1,\ldots,X_n are not independent and remain highly correlated, the variance of the average may fail to shrink like 1/n1/n. For example, if all XiX_i are actually the same random variable, then An=X1A_n=X_1 for every nn and averaging does not reduce uncertainty.

The CLT is robust but not universal. Heavy-tailed variables without finite variance can have different stable limits, and variables without finite mean may not satisfy the usual law of large numbers. The Cauchy distribution is the standard warning. Its sample average does not concentrate around a mean because no finite mean exists.

The De Moivre-Laplace theorem for coin tosses is the binomial special case of the CLT. If XBinomial(n,p)X\sim\operatorname{Binomial}(n,p), then

Xnpnp(1p)\frac{X-np}{\sqrt{np(1-p)}}

is approximately standard normal for large nn, provided pp is not too close to 00 or 11. This is the origin of the familiar bell curve in repeated-trial counting problems.

Visual

ResultScaleConclusionMain tool
Markov inequalityone variablelarge nonnegative values are limited by meanexpectation
Chebyshev inequalityone variabledeviations are limited by varianceMarkov on square
Weak lawAnA_naverage converges in probability to μ\muChebyshev or characteristic functions
CLTn(Anμ)/σ\sqrt n(A_n-\mu)/\sigmanormalized error tends to normalMGFs or characteristic functions

The table separates inequalities from asymptotic theorems. Markov and Chebyshev are finite-nn statements: they are true for a single random variable or a single sample average. The weak law and CLT are limiting statements about sequences. In practice, the finite inequalities are often used to prove limits, while the limits explain the behavior of large systems.

The CLT row also clarifies why the normal distribution appears even when the original variables are not normal. The theorem is about normalized sums, not the raw summands. Bernoulli, uniform, and many other finite-variance distributions produce approximately normal centered sums after enough independent addition.

The hypotheses should always be kept visible. Independence, identical distribution, finite mean, and finite variance each play a role in the standard statements. Changing those assumptions can lead to different limits or no useful limit at all. This is why heavy-tailed examples are not side issues; they mark the boundary of the theorems and prevent overgeneralization. They explain exactly what the theorem is buying.

Worked example 1: Chebyshev bound for a sample mean

Problem: Suppose X1,,X100X_1,\ldots,X_{100} are i.i.d. with mean 55 and variance 99. Bound the probability that their average differs from 55 by at least 11.

Method:

  1. Let
A100=X1++X100100.A_{100}=\frac{X_1+\cdots+X_{100}}{100}.
  1. The mean is
E[A100]=5.E[A_{100}]=5.
  1. Since the variables are independent,
Var(A100)=9100=0.09.\operatorname{Var}(A_{100}) =\frac{9}{100} =0.09.
  1. Apply Chebyshev with k=1k=1:
P(A10051)0.0912=0.09.P(|A_{100}-5|\ge1) \le \frac{0.09}{1^2} =0.09.

Checked answer: Chebyshev guarantees the probability is at most 9%9\%. The true probability may be much smaller, but this bound uses only the mean, variance, and independence.

Worked example 2: normal approximation for coin tosses

Problem: Toss a fair coin 400400 times. Approximate the probability of seeing between 190190 and 210210 heads, inclusive.

Method:

  1. Let XBinomial(400,1/2)X\sim\operatorname{Binomial}(400,1/2).
  2. The mean and variance are
μ=np=200,σ2=npq=100,\mu=np=200, \qquad \sigma^2=npq=100,

so σ=10\sigma=10.

  1. Use the normal approximation with continuity correction:
P(190X210)P(189.5N210.5),P(190\le X\le210) \approx P(189.5\le N\le210.5),

where NN is normal with mean 200200 and standard deviation 1010.

  1. Standardize endpoints:
z1=189.520010=1.05,z2=210.520010=1.05.z_1=\frac{189.5-200}{10}=-1.05, \qquad z_2=\frac{210.5-200}{10}=1.05.
  1. Therefore
P(190X210)Φ(1.05)Φ(1.05).P(190\le X\le210)\approx \Phi(1.05)-\Phi(-1.05).
  1. Using Φ(1.05)0.8531\Phi(1.05)\approx0.8531 and Φ(1.05)0.1469\Phi(-1.05)\approx0.1469,
P(190X210)0.7062.P(190\le X\le210)\approx0.7062.

Checked answer: the event is within about 1.051.05 standard deviations of the mean, so a probability around 70%70\% is plausible.

Code

from math import erf, sqrt, comb

def normal_cdf(z):
return 0.5 * (1 + erf(z / sqrt(2)))

chebyshev_bound = 9 / (100 * 1 ** 2)
print("Chebyshev bound:", chebyshev_bound)

approx = normal_cdf(1.05) - normal_cdf(-1.05)
print("CLT approximation:", approx)

# Exact binomial probability for comparison.
n = 400
exact = sum(comb(n, k) for k in range(190, 211)) / (2 ** n)
print("Exact binomial probability:", exact)

Common pitfalls

  • Confusing convergence in probability with almost sure convergence. The weak law is weaker than the strong law.
  • Forgetting the n\sqrt n scaling in the central limit theorem.
  • Applying Chebyshev without a finite variance.
  • Treating the CLT as saying the average itself has a nondegenerate normal limit. The average converges to a constant; the scaled error is asymptotically normal.
  • Ignoring continuity correction when approximating a discrete binomial probability by a continuous normal distribution.

Connections