Random Variables and Distributions
A random variable converts outcomes into numbers. Instead of listing every detailed outcome in the sample space, we often care about a numerical summary: the number of heads, the waiting time until a machine fails, the maximum of several measurements, or whether a test result is positive. Distributions describe how probability is assigned to those numerical values.
Figure: A Galton box turns repeated random left-right choices into an approximate bell-shaped distribution. Image: Wikimedia Commons, Marcin Floryan, CC BY-SA 3.0.
This page is the bridge between event-based probability and the distribution-centered language used in statistics, simulation, and modeling. Lane et al. use probability distributions for binomial, Poisson, normal, and sampling-distribution examples. Here we build the general vocabulary: random variables, probability mass functions, probability density functions, and cumulative distribution functions.
Definitions
A random variable is a function
that assigns a real number to each outcome. The word "variable" is traditional, but formally is a function.
A random variable is discrete if its possible values are finite or countably infinite. Its probability mass function (PMF) is
The PMF satisfies
A random variable is continuous if probabilities are described by a probability density function (PDF) such that
The PDF satisfies
The cumulative distribution function (CDF) is defined for every real random variable by
The CDF is nondecreasing, right-continuous, and satisfies
For a continuous random variable with density ,
and, where differentiable,
The support of a distribution is the set of values where the PMF or PDF is positive. A parameter is a fixed constant that indexes a family of distributions, such as the success probability in a Bernoulli distribution or the rate in an exponential distribution.
Key results
The CDF is the most universal way to describe a distribution because it works for discrete, continuous, and mixed random variables. Probability over intervals can be recovered from it:
For continuous random variables, endpoint choices usually do not matter because
For discrete random variables, endpoint choices matter because points can have positive probability.
If is discrete with PMF , then its CDF is a step function:
If is continuous with PDF , then probabilities are areas under the curve:
Do not interpret density height as probability. A density can exceed on a short interval; the area, not the height, is the probability.
A quantile is an inverse-CDF value. A number is an -quantile if
and
For continuous strictly increasing CDFs, this is simply
Two random variables can have the same distribution even if they are defined on different sample spaces. For instance, the indicator of heads in a coin toss and the indicator of drawing a red card from a balanced red/black deck are different functions on different experiments, but both can be Bernoulli. Distributional statements describe probabilities of values, not the underlying physical mechanism.
Random variables can also be mixed. A delivery time might have a positive probability of being exactly if an item is already available, plus a continuous density over positive waiting times if it must be shipped. Such variables are neither purely discrete nor purely continuous, but their CDF still works. This is one reason the CDF is more fundamental than either a PMF or a PDF.
When modeling, the support is as important as the formula. A normal distribution may approximate adult heights well near the center but still gives tiny probability to negative heights. A beta distribution is often better for proportions because its support is exactly . A Poisson distribution can count events but cannot model negative values or fixed upper limits. Matching support prevents many silent modeling errors.
CDFs also make inequalities precise. The event is always meaningful, but phrases such as "around " or "near " require an interval. In continuous models, changing from to usually makes no difference; in discrete models, it can change the answer by a point mass. When using software, check whether a function returns , , , or .
Visual
| Feature | Discrete random variable | Continuous random variable |
|---|---|---|
| Probability object | PMF | |
| Total probability | ||
| Point probability | can be positive | usually |
| Interval probability | sum masses | integrate density |
| CDF shape | step function | often smooth curve |
| Example | number of heads | waiting time |
Worked example 1: distribution of the sum of two dice
Problem. Let be the sum of two fair six-sided dice. Find the PMF and compute .
Method.
-
The sample space has equally likely ordered outcomes .
-
Count outcomes by sum:
2 3 4 5 6 7 8 9 10 11 12 count 1 2 3 4 5 6 5 4 3 2 1 -
Convert counts to probabilities:
- The event includes sums , so
- Check by complement. The excluded sums are , with counts . Thus the desired probability is .
Checked answer. . The random variable compresses outcomes into possible numerical values.
Worked example 2: a continuous density
Problem. Suppose has density
Find and compute .
Method.
- First verify it is a density:
- For , no mass has accumulated:
- For ,
- For , all mass has accumulated:
- Now compute the interval probability:
- Check by integration:
Checked answer. for , for , and for . The interval probability is .
Code
import numpy as np
# PMF for the sum of two dice.
outcomes = [(i, j) for i in range(1, 7) for j in range(1, 7)]
sums = np.array([i + j for i, j in outcomes])
values, counts = np.unique(sums, return_counts=True)
pmf = counts / counts.sum()
for value, probability in zip(values, pmf):
print(f"P(X={value}) = {probability:.4f}")
mask = (values >= 5) & (values <= 9)
print("P(5 <= X <= 9) =", pmf[mask].sum())
# Continuous example: F(x)=x^2 on [0,1].
def cdf(x):
x = np.asarray(x)
return np.where(x < 0, 0, np.where(x <= 1, x**2, 1))
print("P(0.25 <= X <= 0.75) =", cdf(0.75) - cdf(0.25))
Common pitfalls
- Forgetting that a random variable is a function on outcomes, not the outcome itself.
- Treating a density value as if it were .
- Ignoring support. Formulas such as are densities only on the interval where they are defined.
- Assuming endpoint choices never matter. They do not matter for continuous distributions, but they do matter for discrete distributions.
- Using a PDF when the problem gives a CDF, or differentiating a discrete CDF as if it were smooth.
- Failing to check that probabilities or densities normalize to one.