Skip to main content

Random Variables and Distributions

A random variable converts outcomes into numbers. Instead of listing every detailed outcome in the sample space, we often care about a numerical summary: the number of heads, the waiting time until a machine fails, the maximum of several measurements, or whether a test result is positive. Distributions describe how probability is assigned to those numerical values.

A Galton box diagram shows balls falling through pegs into bins.

Figure: A Galton box turns repeated random left-right choices into an approximate bell-shaped distribution. Image: Wikimedia Commons, Marcin Floryan, CC BY-SA 3.0.

This page is the bridge between event-based probability and the distribution-centered language used in statistics, simulation, and modeling. Lane et al. use probability distributions for binomial, Poisson, normal, and sampling-distribution examples. Here we build the general vocabulary: random variables, probability mass functions, probability density functions, and cumulative distribution functions.

Definitions

A random variable is a function

X:ΩRX:\Omega\to \mathbb{R}

that assigns a real number to each outcome. The word "variable" is traditional, but formally XX is a function.

A random variable is discrete if its possible values are finite or countably infinite. Its probability mass function (PMF) is

pX(x)=P(X=x).p_X(x)=P(X=x).

The PMF satisfies

pX(x)0,xpX(x)=1.p_X(x)\ge 0,\quad \sum_x p_X(x)=1.

A random variable is continuous if probabilities are described by a probability density function (PDF) fX(x)f_X(x) such that

P(aXb)=abfX(x)dx.P(a\le X\le b)=\int_a^b f_X(x)\,dx.

The PDF satisfies

fX(x)0,fX(x)dx=1.f_X(x)\ge 0,\quad \int_{-\infty}^{\infty} f_X(x)\,dx=1.

The cumulative distribution function (CDF) is defined for every real random variable by

FX(x)=P(Xx).F_X(x)=P(X\le x).

The CDF is nondecreasing, right-continuous, and satisfies

limxFX(x)=0,limxFX(x)=1.\lim_{x\to -\infty}F_X(x)=0,\quad \lim_{x\to \infty}F_X(x)=1.

For a continuous random variable with density fXf_X,

FX(x)=xfX(t)dt,F_X(x)=\int_{-\infty}^{x} f_X(t)\,dt,

and, where differentiable,

fX(x)=FX(x).f_X(x)=F_X'(x).

The support of a distribution is the set of values where the PMF or PDF is positive. A parameter is a fixed constant that indexes a family of distributions, such as the success probability pp in a Bernoulli distribution or the rate λ\lambda in an exponential distribution.

Key results

The CDF is the most universal way to describe a distribution because it works for discrete, continuous, and mixed random variables. Probability over intervals can be recovered from it:

P(a<Xb)=FX(b)FX(a).P(a<X\le b)=F_X(b)-F_X(a).

For continuous random variables, endpoint choices usually do not matter because

P(X=a)=aafX(x)dx=0.P(X=a)=\int_a^a f_X(x)\,dx=0.

For discrete random variables, endpoint choices matter because points can have positive probability.

If XX is discrete with PMF pXp_X, then its CDF is a step function:

FX(x)=txpX(t).F_X(x)=\sum_{t\le x}p_X(t).

If XX is continuous with PDF fXf_X, then probabilities are areas under the curve:

P(aXb)=abfX(x)dx.P(a\le X\le b)=\int_a^b f_X(x)\,dx.

Do not interpret density height as probability. A density can exceed 11 on a short interval; the area, not the height, is the probability.

A quantile is an inverse-CDF value. A number qαq_\alpha is an α\alpha-quantile if

P(Xqα)αP(X\le q_\alpha)\ge \alpha

and

P(Xqα)1α.P(X\ge q_\alpha)\ge 1-\alpha.

For continuous strictly increasing CDFs, this is simply

qα=FX1(α).q_\alpha=F_X^{-1}(\alpha).

Two random variables can have the same distribution even if they are defined on different sample spaces. For instance, the indicator of heads in a coin toss and the indicator of drawing a red card from a balanced red/black deck are different functions on different experiments, but both can be Bernoulli(1/2)(1/2). Distributional statements describe probabilities of values, not the underlying physical mechanism.

Random variables can also be mixed. A delivery time might have a positive probability of being exactly 00 if an item is already available, plus a continuous density over positive waiting times if it must be shipped. Such variables are neither purely discrete nor purely continuous, but their CDF still works. This is one reason the CDF is more fundamental than either a PMF or a PDF.

When modeling, the support is as important as the formula. A normal distribution may approximate adult heights well near the center but still gives tiny probability to negative heights. A beta distribution is often better for proportions because its support is exactly (0,1)(0,1). A Poisson distribution can count events but cannot model negative values or fixed upper limits. Matching support prevents many silent modeling errors.

CDFs also make inequalities precise. The event XxX\le x is always meaningful, but phrases such as "around xx" or "near xx" require an interval. In continuous models, changing from <\lt to \le usually makes no difference; in discrete models, it can change the answer by a point mass. When using software, check whether a function returns P(Xx)P(X\le x), P(X<x)P(X\lt x), P(Xx)P(X\ge x), or P(X>x)P(X\gt x).

Visual

FeatureDiscrete random variableContinuous random variable
Probability objectPMF pX(x)=P(X=x)p_X(x)=P(X=x)PDF fX(x)f_X(x)
Total probabilityxpX(x)=1\sum_x p_X(x)=1fX(x)dx=1\int f_X(x)\,dx=1
Point probabilitycan be positiveusually P(X=x)=0P(X=x)=0
Interval probabilitysum massesintegrate density
CDF shapestep functionoften smooth curve
Examplenumber of headswaiting time

Worked example 1: distribution of the sum of two dice

Problem. Let XX be the sum of two fair six-sided dice. Find the PMF and compute P(5X9)P(5\le X\le 9).

Method.

  1. The sample space has 3636 equally likely ordered outcomes (i,j)(i,j).

  2. Count outcomes by sum:

    xx23456789101112
    count12345654321
  3. Convert counts to probabilities:

pX(x)=count at sum x36.p_X(x)=\frac{\text{count at sum }x}{36}.
  1. The event 5X95\le X\le 9 includes sums 5,6,7,8,95,6,7,8,9, so
P(5X9)=4+5+6+5+436=2436=23.\begin{aligned} P(5\le X\le 9) &=\frac{4+5+6+5+4}{36}\\ &=\frac{24}{36}\\ &=\frac{2}{3}. \end{aligned}
  1. Check by complement. The excluded sums are 2,3,4,10,11,122,3,4,10,11,12, with counts 1+2+3+3+2+1=121+2+3+3+2+1=12. Thus the desired probability is 112/36=2/31-12/36=2/3.

Checked answer. P(5X9)=2/3P(5\le X\le 9)=2/3. The random variable compresses 3636 outcomes into 1111 possible numerical values.

Worked example 2: a continuous density

Problem. Suppose XX has density

fX(x)=2x,0x1,0,otherwise.f_X(x)= \begin{aligned} &2x,\quad 0\le x\le 1,\\ &0,\quad \text{otherwise}. \end{aligned}

Find FX(x)F_X(x) and compute P(0.25X0.75)P(0.25\le X\le 0.75).

Method.

  1. First verify it is a density:
012xdx=[x2]01=1.\int_0^1 2x\,dx = \left[x^2\right]_0^1=1.
  1. For x<0x\lt 0, no mass has accumulated:
FX(x)=0.F_X(x)=0.
  1. For 0x10\le x\le 1,
FX(x)=0x2tdt=x2.F_X(x)=\int_0^x 2t\,dt=x^2.
  1. For x>1x\gt 1, all mass has accumulated:
FX(x)=1.F_X(x)=1.
  1. Now compute the interval probability:
P(0.25X0.75)=FX(0.75)FX(0.25)=(0.75)2(0.25)2=0.56250.0625=0.5.\begin{aligned} P(0.25\le X\le 0.75) &=F_X(0.75)-F_X(0.25)\\ &=(0.75)^2-(0.25)^2\\ &=0.5625-0.0625\\ &=0.5. \end{aligned}
  1. Check by integration:
0.250.752xdx=[x2]0.250.75=0.5.\int_{0.25}^{0.75}2x\,dx =\left[x^2\right]_{0.25}^{0.75} =0.5.

Checked answer. FX(x)=0F_X(x)=0 for x<0x\lt 0, FX(x)=x2F_X(x)=x^2 for 0x10\le x\le 1, and FX(x)=1F_X(x)=1 for x>1x\gt 1. The interval probability is 0.50.5.

Code

import numpy as np

# PMF for the sum of two dice.
outcomes = [(i, j) for i in range(1, 7) for j in range(1, 7)]
sums = np.array([i + j for i, j in outcomes])

values, counts = np.unique(sums, return_counts=True)
pmf = counts / counts.sum()

for value, probability in zip(values, pmf):
print(f"P(X={value}) = {probability:.4f}")

mask = (values >= 5) & (values <= 9)
print("P(5 <= X <= 9) =", pmf[mask].sum())

# Continuous example: F(x)=x^2 on [0,1].
def cdf(x):
x = np.asarray(x)
return np.where(x < 0, 0, np.where(x <= 1, x**2, 1))

print("P(0.25 <= X <= 0.75) =", cdf(0.75) - cdf(0.25))

Common pitfalls

  • Forgetting that a random variable is a function on outcomes, not the outcome itself.
  • Treating a density value fX(x)f_X(x) as if it were P(X=x)P(X=x).
  • Ignoring support. Formulas such as 2x2x are densities only on the interval where they are defined.
  • Assuming endpoint choices never matter. They do not matter for continuous distributions, but they do matter for discrete distributions.
  • Using a PDF when the problem gives a CDF, or differentiating a discrete CDF as if it were smooth.
  • Failing to check that probabilities or densities normalize to one.

Connections