Skip to main content

Moment and Characteristic Functions

Moment generating functions and characteristic functions encode probability laws as functions. The central idea is that multiplying independent transforms is easier than convolving densities or mass functions. This is why transforms are powerful for sums of independent random variables and why MIT 18.440 uses them to prove the weak law of large numbers and the central limit theorem.

Pierre-Simon de Laplace is shown in a historical engraved portrait.

Figure: Pierre-Simon de Laplace is a key figure in probability, transforms, and potential theory. Image: Wikimedia Commons, Louis Delaistre after Armand-Charles Guilleminot, public domain.

Moment generating functions are intuitive because their derivatives generate moments, but they may fail to exist for heavy-tailed distributions. Characteristic functions insert the complex number ii and are always defined, making them more robust. They are Fourier transforms of probability distributions and are the standard tool behind convergence in distribution.

Definitions

The moment generating function of a random variable XX is

MX(t)=E[etX],M_X(t)=E[e^{tX}],

for values of tt where the expectation is finite.

If XX is discrete,

MX(t)=xetxpX(x).M_X(t)=\sum_x e^{tx}p_X(x).

If XX has density fXf_X,

MX(t)=etxfX(x)dx.M_X(t)=\int_{-\infty}^{\infty} e^{tx}f_X(x)\,dx.

The characteristic function of XX is

φX(t)=E[eitX],\varphi_X(t)=E[e^{itX}],

where i2=1i^2=-1. Since eitX=1\vert e^{itX}\vert =1, this expectation always exists.

We say XnX_n converges in distribution to XX if

FXn(x)FX(x)F_{X_n}(x)\to F_X(x)

at every continuity point xx of FXF_X.

Key results

Moment generating functions generate moments when derivatives exist near zero:

MX(0)=E[X],MX(0)=E[X2],M_X'(0)=E[X], \qquad M_X''(0)=E[X^2],

and more generally

MX(m)(0)=E[Xm].M_X^{(m)}(0)=E[X^m].

Proof sketch: differentiate etXe^{tX} with respect to tt:

dmdtmetX=XmetX.\frac{d^m}{dt^m}e^{tX}=X^m e^{tX}.

Then evaluate at t=0t=0. Justifying interchange of derivative and expectation requires regularity, which is why existence near zero matters.

If XX and YY are independent, then

MX+Y(t)=MX(t)MY(t),M_{X+Y}(t)=M_X(t)M_Y(t),

because

E[et(X+Y)]=E[etXetY]=E[etX]E[etY].E[e^{t(X+Y)}]=E[e^{tX}e^{tY}]=E[e^{tX}]E[e^{tY}].

The same identity holds for characteristic functions:

φX+Y(t)=φX(t)φY(t).\varphi_{X+Y}(t)=\varphi_X(t)\varphi_Y(t).

Scaling satisfies

MaX(t)=MX(at),φaX(t)=φX(at).M_{aX}(t)=M_X(at), \qquad \varphi_{aX}(t)=\varphi_X(at).

Important examples:

DistributionMGF
Bernoulli(p)(p)q+petq+pe^t
Binomial(n,p)(n,p)(q+pet)n(q+pe^t)^n
Poisson(λ)(\lambda)exp(λ(et1))\exp(\lambda(e^t-1))
Normal(μ,σ2)(\mu,\sigma^2)exp(μt+σ2t2/2)\exp(\mu t+\sigma^2t^2/2)
Exponential(λ)(\lambda)λ/(λt)\lambda/(\lambda-t) for t<λt\lt \lambda

Levy's continuity theorem, in the form used in the lectures, says that convergence of characteristic functions to a characteristic function implies convergence in distribution. This makes characteristic functions a rigorous route to limit theorems.

Transforms are useful because they convert hard operations into easier ones. Convolution of densities becomes multiplication of transforms. Scaling a random variable becomes rescaling the transform argument. Moments become derivatives at zero. These rules allow one to prove distributional identities without performing difficult integrals directly.

The MGF may fail for two different reasons. It may be infinite for all nonzero tt, as with very heavy-tailed distributions, or it may exist only on one side of zero. For an exponential random variable with rate λ\lambda, M(t)=λ/(λt)M(t)=\lambda/(\lambda-t) exists only for t<λt\lt \lambda. That restricted domain is still enough for many calculations, but one must not plug in arbitrary tt values.

Characteristic functions avoid this integrability problem because eitXe^{itX} has absolute value 11. They can be complex-valued, but their real and imaginary parts are simply expectations of cosine and sine:

φX(t)=E[cos(tX)]+iE[sin(tX)].\varphi_X(t)=E[\cos(tX)]+iE[\sin(tX)].

This bounded oscillatory structure is why characteristic functions are always defined and why they are closely related to Fourier analysis.

For integer-valued random variables, characteristic functions contain periodic information. Since ei2πX=1e^{i2\pi X}=1 whenever XX is an integer, φX(2π)=1\varphi_X(2\pi)=1. More generally, the pattern of φX(t)\varphi_X(t) reflects lattice structure in the distribution. This is one reason characteristic functions are more than a technical proof device.

In limit theorem proofs, the logarithm of a transform often exposes the first few moments. If E[X]=0E[X]=0 and Var(X)=1\operatorname{Var}(X)=1, then near zero the transform behaves like 1+t2/21+t^2/2 for MGFs or 1t2/21-t^2/2 for characteristic functions. Raising this expression to the nnth power at argument t/nt/\sqrt n produces an exponential limit, which is the normal transform.

Visual

FeatureMGF MX(t)M_X(t)Characteristic function φX(t)\varphi_X(t)
DefinitionE[etX]E[e^{tX}]E[eitX]E[e^{itX}]
Always existsnoyes
Moments by derivativesdirect when finitewith powers of ii
Independent sumsproductsproducts
Limit theorem useuseful when it exists near zeromore general
Heavy-tail behaviormay failstill defined

The contrast in the table is the practical reason for learning both transforms. MGFs are friendlier in elementary calculations because derivatives at zero have no complex constants, and common distributions have simple MGFs. Characteristic functions require complex notation, but they work for every probability distribution. The later lectures use this extra generality to prove limit theorems under hypotheses where MGFs might not exist.

Transform uniqueness is the background principle. Under the standard hypotheses used in probability, a distribution is determined by its characteristic function, and an MGF determines the distribution when it exists in a neighborhood of zero. Therefore, showing two random variables have the same transform is a legitimate way to show they have the same distribution. This is what happens when proving that independent Poisson sums remain Poisson.

One should still keep transforms connected to probability. A transform is not just an algebraic gadget; it is an expectation of a function of XX. Its value depends on the whole distribution, weighting every possible outcome by an exponential or oscillatory factor.

When using transforms, always record the interval of tt values where the calculation is valid. Two expressions that agree only outside the domain of an MGF do not prove anything about the distribution. Characteristic functions avoid this particular domain issue, but they require tracking complex arithmetic carefully. In proofs, this bookkeeping is part of the argument, not a cosmetic detail. It is also where many otherwise plausible transform solutions fail. Always state the transform and its domain together before comparing formulas.

Worked example 1: binomial MGF from Bernoulli trials

Problem: Find the MGF of a binomial(n,p)(n,p) random variable and use it to compute the mean.

Method:

  1. For a Bernoulli(p)(p) variable BB,
MB(t)=E[etB]=e0P(B=0)+etP(B=1)=q+pet.M_B(t)=E[e^{tB}] =e^0P(B=0)+e^tP(B=1) =q+pe^t.
  1. If XBinomial(n,p)X\sim\operatorname{Binomial}(n,p), write
X=B1++BnX=B_1+\cdots+B_n

with independent Bernoulli(p)(p) variables.

  1. Therefore
MX(t)=j=1nMBj(t)=(q+pet)n.M_X(t)=\prod_{j=1}^n M_{B_j}(t) =(q+pe^t)^n.
  1. Differentiate:
MX(t)=n(q+pet)n1pet.M_X'(t)=n(q+pe^t)^{n-1}pe^t.
  1. Evaluate at t=0t=0:
MX(0)=n(q+p)n1p=np.M_X'(0)=n(q+p)^{n-1}p=np.

Checked answer: the transform method agrees with the indicator decomposition result E[X]=npE[X]=np.

Worked example 2: sum of independent Poisson variables

Problem: Use MGFs to show that if XPoisson(λ1)X\sim\operatorname{Poisson}(\lambda_1) and YPoisson(λ2)Y\sim\operatorname{Poisson}(\lambda_2) are independent, then X+YX+Y is Poisson with parameter λ1+λ2\lambda_1+\lambda_2.

Method:

  1. The Poisson MGF is
MX(t)=exp(λ1(et1)),MY(t)=exp(λ2(et1)).M_X(t)=\exp(\lambda_1(e^t-1)), \qquad M_Y(t)=\exp(\lambda_2(e^t-1)).
  1. For Z=X+YZ=X+Y, independence gives
MZ(t)=MX(t)MY(t).M_Z(t)=M_X(t)M_Y(t).
  1. Multiply:
MZ(t)=exp(λ1(et1))exp(λ2(et1))=exp((λ1+λ2)(et1)).\begin{aligned} M_Z(t) &=\exp(\lambda_1(e^t-1))\exp(\lambda_2(e^t-1))\\ &=\exp((\lambda_1+\lambda_2)(e^t-1)). \end{aligned}
  1. This is exactly the MGF of a Poisson random variable with parameter λ1+λ2\lambda_1+\lambda_2.

Checked answer: the result matches the Poisson process interpretation that independent event streams add their rates.

Code

from math import exp

def bernoulli_mgf(p, t):
return (1 - p) + p * exp(t)

def binomial_mgf(n, p, t):
return bernoulli_mgf(p, t) ** n

def poisson_mgf(lam, t):
return exp(lam * (exp(t) - 1))

p = 0.3
n = 10
h = 1e-5
numerical_mean = (binomial_mgf(n, p, h) - binomial_mgf(n, p, -h)) / (2 * h)
print("numerical binomial mean:", numerical_mean)
print("exact binomial mean:", n * p)

lam1, lam2, t = 2.0, 5.0, 0.4
product = poisson_mgf(lam1, t) * poisson_mgf(lam2, t)
combined = poisson_mgf(lam1 + lam2, t)
print("Poisson MGF product equals combined:", product, combined)

Common pitfalls

  • Assuming an MGF exists for every distribution. Heavy-tailed laws such as Cauchy do not have finite MGFs near zero.
  • Forgetting independence when multiplying transforms of sums.
  • Confusing MX(t)M_X(t) and φX(t)\varphi_X(t). Characteristic functions use eitXe^{itX} and may be complex-valued.
  • Thinking equality of a few moments determines a distribution. A transform, when valid in the required sense, carries much more information.
  • Applying a continuity theorem without checking that the limiting function is the transform of a probability law.

Connections