Skip to main content

Entropy and Coding

Entropy measures the average information or uncertainty in a random variable. A fair coin has one bit of entropy because one yes-or-no question is needed on average to learn its outcome. A biased coin has less entropy because one outcome is easier to guess. A uniform choice among many possibilities has more entropy because the outcome is harder to identify.

A PGP workflow diagram shows encryption, signing, and message transfer.

Figure: PGP combines public-key and symmetric-key ideas in an applied cryptographic workflow. Image: Wikimedia Commons, xaedes, jfreax, and Acdx, CC BY-SA 3.0.

MIT 18.440 introduces Shannon entropy, noiseless coding, entropy of pairs, and conditional entropy. The main conceptual result is that entropy lower-bounds the expected number of binary questions needed to identify a random outcome. Conditional entropy then formalizes how much uncertainty remains about one random variable after observing another.

Definitions

If a discrete random variable XX takes values x1,,xnx_1,\ldots,x_n with probabilities p1,,pnp_1,\ldots,p_n, its Shannon entropy in bits is

H(X)=i=1npilog2pi.H(X)=-\sum_{i=1}^{n}p_i\log_2 p_i.

Terms with pi=0p_i=0 are omitted, using the convention 0log0=00\log 0=0.

The entropy of a pair is the entropy of the combined random variable:

H(X,Y)=x,ypX,Y(x,y)log2pX,Y(x,y).H(X,Y)=-\sum_{x,y}p_{X,Y}(x,y)\log_2 p_{X,Y}(x,y).

The conditional entropy of XX given Y=yY=y is

H(XY=y)=xp(xy)log2p(xy).H(X\mid Y=y) = -\sum_x p(x\mid y)\log_2 p(x\mid y).

The average conditional entropy is

H(XY)=yP(Y=y)H(XY=y).H(X\mid Y)=\sum_yP(Y=y)H(X\mid Y=y).

A binary code assigns bit strings to possible values of XX. A valid prefix-free code has no codeword that is the prefix of another; such codes can be decoded unambiguously as bits are read.

Key results

Entropy is maximized by the uniform distribution. If XX is uniform on nn values, then

H(X)=n1nlog21n=log2n.H(X)=-n\cdot\frac1n\log_2\frac1n=\log_2 n.

If XX is deterministic, then H(X)=0H(X)=0.

For independent random variables,

H(X,Y)=H(X)+H(Y).H(X,Y)=H(X)+H(Y).

Proof sketch:

pX,Y(x,y)=pX(x)pY(y),p_{X,Y}(x,y)=p_X(x)p_Y(y),

so

H(X,Y)=x,ypX(x)pY(y)log2(pX(x)pY(y))=x,ypX(x)pY(y)(log2pX(x)+log2pY(y))=H(X)+H(Y).\begin{aligned} H(X,Y) &=-\sum_{x,y}p_X(x)p_Y(y)\log_2(p_X(x)p_Y(y))\\ &=-\sum_{x,y}p_X(x)p_Y(y)(\log_2p_X(x)+\log_2p_Y(y))\\ &=H(X)+H(Y). \end{aligned}

The chain rule for entropy is

H(X,Y)=H(Y)+H(XY)=H(X)+H(YX).H(X,Y)=H(Y)+H(X\mid Y) =H(X)+H(Y\mid X).

Conditioning reduces entropy:

H(XY)H(X),H(X\mid Y)\le H(X),

with equality when XX and YY are independent. Intuitively, observing YY cannot increase the average amount of uncertainty left about XX. The proof uses concavity of the entropy function, a vector version of Jensen's inequality.

The noiseless coding theorem in the lecture says that the expected number of binary questions needed to identify XX is at least H(X)H(X). Well-designed codes can approach this limit for long independent sequences.

Entropy is measured relative to the logarithm base. Base 22 gives bits, which match yes-or-no questions. Natural logarithms give nats, which are common in mathematical physics and some information theory formulas. Changing the base only multiplies entropy by a constant, so the qualitative comparisons are unchanged.

The formula plogp-p\log p has two important features. First, it is zero when p=1p=1, because a certain event carries no surprise. Second, it approaches zero as p0p\to0, because an event that almost never occurs contributes little to average uncertainty. Entropy is largest when probability mass is spread evenly, not when it is concentrated.

Coding gives entropy an operational meaning. If one outcome has probability 1/21/2, another has probability 1/41/4, and two more have probability 1/81/8 each, a good binary code can assign shorter strings to more likely outcomes and longer strings to less likely outcomes. The expected code length is then close to the entropy. A fixed-length code ignores probabilities and may waste bits.

Conditional entropy measures remaining uncertainty, not the information in the observed variable itself. If YY determines XX, then H(XY)=0H(X\mid Y)=0. If YY is independent of XX, then H(XY)=H(X)H(X\mid Y)=H(X). Between these extremes, observing YY partially reduces uncertainty about XX.

The chain rule

H(X,Y)=H(Y)+H(XY)H(X,Y)=H(Y)+H(X\mid Y)

matches a two-stage questioning strategy: first learn YY, then learn whatever remains uncertain about XX. The total expected information is the cost of the first stage plus the expected cost of the second stage. This parallels the law of total expectation, but with uncertainty rather than numerical value.

Conditioning reduces entropy on average because the original distribution of XX is a mixture of the conditional distributions given Y=yY=y. Entropy is concave as a function of the probability vector, so the entropy of the mixture is at least the mixture of the entropies. This is Jensen's inequality appearing in information-theoretic form.

Visual

Distribution of XXEntropyInterpretation
deterministic00no question needed
fair bit11 bitone yes-or-no question
uniform on nn valueslog2n\log_2 nbalanced uncertainty
biased bit with P(1)=pP(1)=pplog2p(1p)log2(1p)-p\log_2p-(1-p)\log_2(1-p)less than or equal to 11

The visual separates two meanings of information. Before observing XX, the probability vector determines how uncertain the outcome is. After designing a code, entropy becomes a lower bound on the expected number of bits needed to communicate the value. These are the same quantity because an efficient questioning strategy spends short questions on likely outcomes and longer question sequences on rare outcomes.

Conditional entropy fits the same coding picture. If the decoder already knows YY, the sender only needs enough bits to distinguish the remaining possibilities for XX within the conditional distribution determined by YY. Averaging over the possible YY values gives H(XY)H(X\mid Y). If YY is very informative, the remaining code can be short; if YY is irrelevant, no compression is gained.

Entropy is therefore not a property of the labels xix_i themselves. Renaming outcomes does not change entropy; only the probability vector matters. Numerical distances between labels matter for variance, but not for Shannon entropy.

Worked example 1: entropy of a biased coin

Problem: A coin lands heads with probability p=0.9p=0.9 and tails with probability 0.10.1. Compute its entropy.

Method:

  1. The probabilities are
pH=0.9,pT=0.1.p_H=0.9,\qquad p_T=0.1.
  1. Entropy is
H(X)=0.9log2(0.9)0.1log2(0.1).H(X)=-0.9\log_2(0.9)-0.1\log_2(0.1).
  1. Use approximate logarithms:
log2(0.9)0.1520,log2(0.1)3.3219.\log_2(0.9)\approx -0.1520, \qquad \log_2(0.1)\approx -3.3219.
  1. Substitute:
H(X)=0.9(0.1520)0.1(3.3219)=0.1368+0.3322=0.4690.\begin{aligned} H(X) &=-0.9(-0.1520)-0.1(-3.3219)\\ &=0.1368+0.3322\\ &=0.4690. \end{aligned}

Checked answer: the entropy is about 0.4690.469 bits, less than the 11 bit of a fair coin because the outcome is easier to guess.

Worked example 2: conditional entropy with a noisy copy

Problem: Let XX be a fair bit. Let Y=XY=X with probability 0.80.8, and Y=1XY=1-X with probability 0.20.2. Compute H(X)H(X) and H(XY)H(X\mid Y).

Method:

  1. Since XX is fair,
H(X)=1 bit.H(X)=1\text{ bit}.
  1. By symmetry, YY is also fair.
  2. Given Y=yY=y, the probability that X=yX=y is 0.80.8, and the probability that X=1yX=1-y is 0.20.2.
  3. Therefore the conditional entropy for each observed yy is the entropy of (0.8,0.2)(0.8,0.2):
H(XY=y)=0.8log2(0.8)0.2log2(0.2).H(X\mid Y=y) =-0.8\log_2(0.8)-0.2\log_2(0.2).
  1. Approximate:
log2(0.8)0.3219,log2(0.2)2.3219.\log_2(0.8)\approx -0.3219, \qquad \log_2(0.2)\approx -2.3219.
  1. Then
H(XY)=0.8(0.3219)+0.2(2.3219)=0.2575+0.4644=0.7219.H(X\mid Y) =0.8(0.3219)+0.2(2.3219) =0.2575+0.4644 =0.7219.

Checked answer: observing YY reduces uncertainty from 11 bit to about 0.7220.722 bits, but does not eliminate it because the copy is noisy.

Code

from math import log2

def entropy(probs):
return -sum(p * log2(p) for p in probs if p > 0)

print("biased coin H(0.9,0.1):", entropy([0.9, 0.1]))
print("fair bit:", entropy([0.5, 0.5]))
print("uniform four values:", entropy([0.25] * 4))

conditional = entropy([0.8, 0.2])
print("H(X|Y) for noisy copy:", conditional)
print("information learned:", 1 - conditional)

Common pitfalls

  • Using natural logarithms without tracking units. Natural logs give nats; base-2 logs give bits.
  • Thinking entropy is the same as variance. Entropy measures uncertainty in a distribution, not numerical spread around a mean.
  • Forgetting the convention 0log0=00\log 0=0.
  • Assuming conditioning always reduces entropy for every individual observed value. The average conditional entropy is at most the original entropy.
  • Confusing expected code length lower bounds with exact code lengths for one symbol. Entropy limits are approached most cleanly for long sequences.

Connections