Skip to main content

Joint, Marginal, and Conditional Distributions

Most probability models involve more than one random variable. A student's study time and exam score, two components in a system, or the two coordinates of a random point are not separate stories; their relationship is often the point of the model. Joint distributions describe variables together. Marginal distributions describe one variable after ignoring the others. Conditional distributions describe one variable after another has been observed.

Tree diagrams organize conditional probabilities for Bayes' theorem.

Figure: Probability trees make the conditioning structure in Bayes' theorem explicit. Image: Wikimedia Commons, Gnathan87, CC0 1.0.

This page generalizes conditional probability from events to random variables. It prepares for covariance, independence, transformations, and Markov chains, all of which depend on understanding what information is contained in a joint distribution.

Definitions

For discrete random variables XX and YY, the joint PMF is

pX,Y(x,y)=P(X=x,Y=y).p_{X,Y}(x,y)=P(X=x,Y=y).

The marginal PMFs are obtained by summing over the other variable:

pX(x)=ypX,Y(x,y),p_X(x)=\sum_y p_{X,Y}(x,y), pY(y)=xpX,Y(x,y).p_Y(y)=\sum_x p_{X,Y}(x,y).

If pY(y)>0p_Y(y)\gt 0, the conditional PMF of XX given Y=yY=y is

pXY(xy)=pX,Y(x,y)pY(y).p_{X\mid Y}(x\mid y)=\frac{p_{X,Y}(x,y)}{p_Y(y)}.

For continuous random variables, the joint PDF fX,Y(x,y)f_{X,Y}(x,y) satisfies

P((X,Y)A)=AfX,Y(x,y)dxdy.P((X,Y)\in A)=\iint_A f_{X,Y}(x,y)\,dx\,dy.

The marginal PDFs are

fX(x)=fX,Y(x,y)dy,f_X(x)=\int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy, fY(y)=fX,Y(x,y)dx.f_Y(y)=\int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dx.

If fY(y)>0f_Y(y)\gt 0, the conditional density is

fXY(xy)=fX,Y(x,y)fY(y).f_{X\mid Y}(x\mid y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}.

The joint CDF is

FX,Y(x,y)=P(Xx,Yy).F_{X,Y}(x,y)=P(X\le x,Y\le y).

Key results

Factorization by conditioning.

For discrete variables,

pX,Y(x,y)=pXY(xy)pY(y).p_{X,Y}(x,y)=p_{X\mid Y}(x\mid y)p_Y(y).

For continuous variables,

fX,Y(x,y)=fXY(xy)fY(y).f_{X,Y}(x,y)=f_{X\mid Y}(x\mid y)f_Y(y).

This is the random-variable version of P(AB)=P(AB)P(B)P(A\cap B)=P(A\mid B)P(B).

Independence. Random variables XX and YY are independent if, for all suitable x,yx,y,

FX,Y(x,y)=FX(x)FY(y).F_{X,Y}(x,y)=F_X(x)F_Y(y).

In the discrete case, this is equivalent to

pX,Y(x,y)=pX(x)pY(y)p_{X,Y}(x,y)=p_X(x)p_Y(y)

for all x,yx,y. In the continuous case, it is equivalent to

fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x,y)=f_X(x)f_Y(y)

where densities exist.

Expectation from a joint distribution. For discrete variables,

E[g(X,Y)]=xyg(x,y)pX,Y(x,y).E[g(X,Y)]=\sum_x\sum_y g(x,y)p_{X,Y}(x,y).

For continuous variables,

E[g(X,Y)]=g(x,y)fX,Y(x,y)dxdy.E[g(X,Y)]=\int_{-\infty}^{\infty}\int_{-\infty}^{\infty} g(x,y)f_{X,Y}(x,y)\,dx\,dy.

Law of total expectation.

E[X]=E[E[XY]].E[X]=E[E[X\mid Y]].

Law of total variance.

Var(X)=E[Var(XY)]+Var(E[XY]).\operatorname{Var}(X)=E[\operatorname{Var}(X\mid Y)]+\operatorname{Var}(E[X\mid Y]).

These laws say that overall variation can be split into within-condition variation and between-condition variation.

The support of a joint distribution is often the most important part of the problem. If the support is rectangular, such as 0<x<10\lt x\lt 1 and 0<y<10\lt y\lt 1, integration limits are usually independent. If the support is triangular, circular, or constrained by inequalities such as 0<y<x<10\lt y\lt x\lt 1, the limits change with the variable being integrated. Many wrong marginal densities come from integrating over the right formula but the wrong region.

Conditional distributions can be ordinary distributions in their own right. After observing Y=yY=y, the function pXY(xy)p_{X\mid Y}(x\mid y) or fXY(xy)f_{X\mid Y}(x\mid y) must still sum or integrate to 11 over xx. This gives a useful check. If the conditional distribution does not normalize, the denominator or support is wrong.

Conditional expectation compresses a conditional distribution into a single function:

E[XY=y].E[X\mid Y=y].

As yy changes, this value can trace a regression curve. In linear regression, the corresponding statistical model focuses on how the conditional mean of a response changes with predictors.

Marginalization can also hide structure. Two groups may have different conditional relationships between XX and YY, while the combined marginal relationship looks weaker, stronger, or even reversed. This is the probability mechanism behind Simpson's paradox. Whenever a joint distribution includes a meaningful grouping variable, compare conditional distributions as well as the aggregate marginal distribution.

For more than two variables, the same ideas scale by summing or integrating over the variables not currently of interest. A Bayesian network, for example, is a structured factorization of a large joint distribution into smaller conditional pieces.

Visual

OperationDiscreteContinuous
joint probability/densitypX,Y(x,y)p_{X,Y}(x,y)fX,Y(x,y)f_{X,Y}(x,y)
marginalize YYypX,Y(x,y)\sum_y p_{X,Y}(x,y)fX,Y(x,y)dy\int f_{X,Y}(x,y)\,dy
condition on Y=yY=ypX,Y(x,y)/pY(y)p_{X,Y}(x,y)/p_Y(y)fX,Y(x,y)/fY(y)f_{X,Y}(x,y)/f_Y(y)
compute probability in regionsum over cellsdouble integral

Worked example 1: joint table, marginals, and conditionals

Problem. The joint PMF of XX and YY is:

pX,Y(x,y)p_{X,Y}(x,y)y=0y=0y=1y=1
x=0x=00.200.200.100.10
x=1x=10.300.300.400.40

Find the marginal distributions, P(X=1Y=1)P(X=1\mid Y=1), and determine whether XX and YY are independent.

Method.

  1. Check total probability:
0.20+0.10+0.30+0.40=1.0.20+0.10+0.30+0.40=1.
  1. Marginal distribution of XX:
P(X=0)=0.20+0.10=0.30,P(X=0)=0.20+0.10=0.30, P(X=1)=0.30+0.40=0.70.P(X=1)=0.30+0.40=0.70.
  1. Marginal distribution of YY:
P(Y=0)=0.20+0.30=0.50,P(Y=0)=0.20+0.30=0.50, P(Y=1)=0.10+0.40=0.50.P(Y=1)=0.10+0.40=0.50.
  1. Conditional probability:
P(X=1Y=1)=P(X=1,Y=1)P(Y=1)=0.400.50=0.80.P(X=1\mid Y=1)=\frac{P(X=1,Y=1)}{P(Y=1)} =\frac{0.40}{0.50}=0.80.
  1. Check independence. If independent, then
P(X=1,Y=1)=P(X=1)P(Y=1)=0.70(0.50)=0.35.P(X=1,Y=1)=P(X=1)P(Y=1)=0.70(0.50)=0.35.

But the table gives 0.400.40.

Checked answer. The marginals are P(X=0)=0.30P(X=0)=0.30, P(X=1)=0.70P(X=1)=0.70, P(Y=0)=0.50P(Y=0)=0.50, P(Y=1)=0.50P(Y=1)=0.50. Also P(X=1Y=1)=0.80P(X=1\mid Y=1)=0.80, and X,YX,Y are not independent.

Worked example 2: a continuous joint density

Problem. Let

fX,Y(x,y)=2f_{X,Y}(x,y)=2

on the triangular region 0<y<x<10\lt y\lt x\lt 1, and 00 otherwise. Find fX(x)f_X(x), fY(y)f_Y(y), and fYX(yx)f_{Y\mid X}(y\mid x).

Method.

  1. Verify normalization. The triangular region has area 1/21/2, so the integral of constant density 22 over it is 11.

  2. For fixed xx, the possible yy values satisfy 0<y<x0\lt y\lt x. Therefore

fX(x)=0x2dy=2x,0<x<1.f_X(x)=\int_0^x 2\,dy=2x,\quad 0<x<1.
  1. For fixed yy, the possible xx values satisfy y<x<1y\lt x\lt 1. Therefore
fY(y)=y12dx=2(1y),0<y<1.f_Y(y)=\int_y^1 2\,dx=2(1-y),\quad 0<y<1.
  1. Conditional density of YY given X=xX=x:
fYX(yx)=fX,Y(x,y)fX(x)=22x=1x,f_{Y\mid X}(y\mid x)=\frac{f_{X,Y}(x,y)}{f_X(x)} =\frac{2}{2x} =\frac{1}{x},

for 0<y<x0\lt y\lt x.

  1. Interpret. Given X=xX=x, the conditional density of YY is uniform on (0,x)(0,x).

  2. Check normalization:

0x1xdy=1.\int_0^x \frac{1}{x}\,dy=1.

Checked answer. fX(x)=2xf_X(x)=2x, fY(y)=2(1y)f_Y(y)=2(1-y), and YX=xUniform(0,x)Y\mid X=x\sim \operatorname{Uniform}(0,x).

Code

import numpy as np

# Discrete joint table example.
joint = np.array([
[0.20, 0.10],
[0.30, 0.40],
])

px = joint.sum(axis=1)
py = joint.sum(axis=0)
conditional_x1_given_y1 = joint[1, 1] / py[1]
independent = np.allclose(joint, np.outer(px, py))

print("P_X:", px)
print("P_Y:", py)
print("P(X=1 | Y=1):", conditional_x1_given_y1)
print("independent:", independent)

# Monte Carlo check for triangular density f=2 on 0<y<x<1.
rng = np.random.default_rng(0)
n = 100_000
# Sample X from density 2x using inverse CDF X=sqrt(U).
x = np.sqrt(rng.random(n))
y = rng.random(n) * x
print("sample mean X:", x.mean())
print("sample mean Y:", y.mean())

Common pitfalls

  • Confusing joint and marginal probabilities. A joint table cell is not the same as a row total.
  • Dividing by the wrong marginal when computing conditionals.
  • Assuming a joint density value is a probability. Probabilities require integrating over a region.
  • Checking independence at only one cell and declaring success. Factorization must hold over the whole support.
  • Forgetting support constraints when integrating. In triangular or curved regions, limits depend on the other variable.
  • Treating zero covariance as independence. That implication holds only under special conditions, not generally.

Connections