Skip to main content

Joint Distributions, Transformations, and Independence

Many probability questions involve several random variables defined on the same sample space. A joint distribution records how they behave together, not merely how each behaves separately. This distinction matters: the marginal laws of XX and YY do not determine whether they are independent, positively related, negatively related, or constrained by an equation such as Y=X2Y=X^2.

Tree diagrams organize conditional probabilities for Bayes' theorem.

Figure: Probability trees make the conditioning structure in Bayes' theorem explicit. Image: Wikimedia Commons, Gnathan87, CC0 1.0.

MIT 18.440 introduces joint mass functions, joint densities, marginal distributions, independent random variables, and distributions of functions of random variables. These ideas are the technical foundation for convolutions, conditional densities, covariance, order statistics, and the transform methods used later in the course.

Definitions

For discrete random variables XX and YY, the joint probability mass function is

pX,Y(x,y)=P(X=x, Y=y).p_{X,Y}(x,y)=P(X=x,\ Y=y).

The marginal mass functions are obtained by summing:

pX(x)=ypX,Y(x,y),pY(y)=xpX,Y(x,y).p_X(x)=\sum_y p_{X,Y}(x,y), \qquad p_Y(y)=\sum_x p_{X,Y}(x,y).

For continuous random variables, a joint density fX,Yf_{X,Y} satisfies

P((X,Y)A)=AfX,Y(x,y)dxdy.P((X,Y)\in A)=\iint_A f_{X,Y}(x,y)\,dx\,dy.

The marginal densities are

fX(x)=fX,Y(x,y)dy,fY(y)=fX,Y(x,y)dx.f_X(x)=\int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy, \qquad f_Y(y)=\int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dx.

Random variables XX and YY are independent if for all suitable sets A,BA,B,

P(XA, YB)=P(XA)P(YB).P(X\in A,\ Y\in B)=P(X\in A)P(Y\in B).

In the discrete case, this is equivalent to

pX,Y(x,y)=pX(x)pY(y)p_{X,Y}(x,y)=p_X(x)p_Y(y)

for all x,yx,y. In the continuous case, it is equivalent to

fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x,y)=f_X(x)f_Y(y)

where densities are defined.

Key results

If Y=g(X)Y=g(X) and gg is strictly increasing, then

FY(a)=P(Ya)=P(g(X)a)=P(Xg1(a))=FX(g1(a)).F_Y(a)=P(Y\le a)=P(g(X)\le a)=P(X\le g^{-1}(a))=F_X(g^{-1}(a)).

If gg is differentiable with differentiable inverse, the density form is

fY(y)=fX(g1(y))ddyg1(y).f_Y(y)=f_X(g^{-1}(y))\left|\frac{d}{dy}g^{-1}(y)\right|.

For a non-one-to-one transformation, sum the contributions from each preimage:

fY(y)=x:g(x)=yfX(x)ddyx(y).f_Y(y)=\sum_{x:g(x)=y} f_X(x)\left|\frac{d}{dy}x(y)\right|.

For a two-dimensional differentiable one-to-one transformation (U,V)=T(X,Y)(U,V)=T(X,Y) with inverse (x,y)=T1(u,v)(x,y)=T^{-1}(u,v),

fU,V(u,v)=fX,Y(x(u,v),y(u,v))det(x,y)(u,v).f_{U,V}(u,v) = f_{X,Y}(x(u,v),y(u,v)) \left| \det \frac{\partial(x,y)}{\partial(u,v)} \right|.

Even when two random variables have the same marginal distribution, their joint laws can be completely different. For example, if XX is uniform on [0,1][0,1], then Y=XY=X has the same marginal as XX but is not independent of XX. If YY is separately sampled uniform on [0,1][0,1], then XX and YY can be independent.

The marginalization formulas are probability versions of "ignore one coordinate". In a joint table, summing a row forgets which value of YY occurred and keeps only the value of XX. In a joint density, integrating over yy performs the same operation continuously. This operation always loses information: many different joint laws can have the same marginals.

Independence is a factorization statement. In a finite table, every entry must equal row sum times column sum. Geometrically, the joint distribution has no interaction term; the probability assigned to a rectangle A×BA\times B is the product of the two side probabilities. For continuous variables, this means the density surface separates into an xx-part and a yy-part.

Transformations require special care because density is tied to scale. If Y=2XY=2X, intervals in YY correspond to intervals half as long in XX, so the density height changes by a factor of 1/21/2. The derivative factor in the change-of-variables formula is exactly this scale correction. In multiple dimensions, the absolute Jacobian determinant measures how areas or volumes are distorted.

The transformation method has two complementary approaches. The CDF method asks for P(g(X)a)P(g(X)\le a) and is often best when the event has a simple inequality description. The density method uses inverse branches and derivatives and is often faster when the transformation is monotone or piecewise monotone. Both methods should give the same answer when applied correctly.

Joint distributions are also the setting for conditional densities. If XX and YY have joint density fX,Yf_{X,Y}, then the conditional density of XX given Y=yY=y is formally

fXY(xy)=fX,Y(x,y)fY(y)f_{X\mid Y}(x\mid y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}

when fY(y)>0f_Y(y)\gt 0. This formula is the continuous analogue of dividing a joint probability table entry by a column total. It prepares for conditional expectation and total variance.

Visual

Discrete joint law as a matrix

Y=1 Y=2 row sum pX
X=1 p11 p12 p1.
X=2 p21 p22 p2.

column p.1 p.2 total 1
sum pY
OperationDiscrete versionContinuous version
Joint lawpX,Y(x,y)p_{X,Y}(x,y)fX,Y(x,y)f_{X,Y}(x,y)
Marginal of XXypX,Y(x,y)\sum_y p_{X,Y}(x,y)fX,Y(x,y)dy\int f_{X,Y}(x,y)\,dy
IndependencepX,Y=pXpYp_{X,Y}=p_Xp_YfX,Y=fXfYf_{X,Y}=f_Xf_Y
Probability of regionsum over grid pointsdouble integral over region
Transformationcollect masses with same valuedensity times Jacobian

The matrix picture is a useful diagnostic for independence. Once row and column sums are known, an independent joint table is forced: it must be the outer product of the marginal vectors. If the actual table differs from that outer product, the variables are dependent. For continuous variables, the same idea is harder to see visually, but a product density has rectangular probabilities that factor exactly.

For transformations, the table reminds us that discrete and continuous cases use different bookkeeping. A discrete transformation moves point masses and combines masses that land on the same value. A continuous transformation moves density through a change of scale. Forgetting this distinction leads to common errors such as assigning positive probability to a point in a continuous model or omitting a Jacobian factor.

Worked example 1: checking independence from a joint table

Problem: Let X,YX,Y take values in {0,1}\{0,1\} with joint probabilities

Y=0Y=0Y=1Y=1
X=0X=00.300.300.200.20
X=1X=10.150.150.350.35

Find the marginal distributions and decide whether XX and YY are independent.

Method:

  1. Sum rows for XX:
P(X=0)=0.30+0.20=0.50,P(X=0)=0.30+0.20=0.50, P(X=1)=0.15+0.35=0.50.P(X=1)=0.15+0.35=0.50.
  1. Sum columns for YY:
P(Y=0)=0.30+0.15=0.45,P(Y=0)=0.30+0.15=0.45, P(Y=1)=0.20+0.35=0.55.P(Y=1)=0.20+0.35=0.55.
  1. If independent, we would need
P(X=0,Y=0)=P(X=0)P(Y=0)=0.500.45=0.225.P(X=0,Y=0)=P(X=0)P(Y=0)=0.50\cdot0.45=0.225.
  1. But the table gives
P(X=0,Y=0)=0.30.P(X=0,Y=0)=0.30.

Checked answer: XX and YY are not independent. The marginals alone do not reveal this; the joint table does.

Worked example 2: transforming XX to Y=X2Y=X^2

Problem: Let XX be uniform on [1,1][-1,1] and define Y=X2Y=X^2. Find the density of YY.

Method:

  1. The support of YY is [0,1][0,1].
  2. For 0<y<10\lt y\lt 1, the equation x2=yx^2=y has two solutions:
x=y,x=y.x=\sqrt y,\qquad x=-\sqrt y.
  1. The density of XX is fX(x)=1/2f_X(x)=1/2 on [1,1][-1,1].
  2. For the positive branch, x=yx=\sqrt y, so
dxdy=12y.\left|\frac{dx}{dy}\right|=\frac{1}{2\sqrt y}.
  1. For the negative branch, x=yx=-\sqrt y, so
dxdy=12y.\left|\frac{dx}{dy}\right|=\frac{1}{2\sqrt y}.
  1. Add both contributions:
fY(y)=1212y+1212y=12y,0<y<1.f_Y(y)=\frac12\cdot\frac{1}{2\sqrt y} +\frac12\cdot\frac{1}{2\sqrt y} =\frac{1}{2\sqrt y}, \qquad 0<y<1.

Checked answer: integrate the density:

0112ydy=[y]01=1.\int_0^1 \frac{1}{2\sqrt y}\,dy =\left[\sqrt y\right]_0^1 =1.

So it is a valid density.

Code

import numpy as np

joint = np.array([[0.30, 0.20],
[0.15, 0.35]])

px = joint.sum(axis=1)
py = joint.sum(axis=0)
independent_table = np.outer(px, py)

print("pX:", px)
print("pY:", py)
print("independent table would be:")
print(independent_table)
print("is independent?", np.allclose(joint, independent_table))

def density_y_square(y):
y = np.asarray(y)
out = np.zeros_like(y, dtype=float)
mask = (y > 0) & (y < 1)
out[mask] = 1 / (2 * np.sqrt(y[mask]))
return out

grid = np.linspace(0.001, 0.999, 1000)
approx_integral = np.trapz(density_y_square(grid), grid)
print("approx integral:", approx_integral)

Common pitfalls

  • Assuming marginals determine the joint distribution. They do not.
  • Checking independence at only one point in a joint table. Independence requires factorization everywhere.
  • Forgetting the absolute derivative factor when transforming a density.
  • Using a one-to-one change-of-variables formula for a many-to-one map like Y=X2Y=X^2.
  • Treating zero-probability conditioning events in continuous models as if the elementary discrete formula applies unchanged.

Connections