Entropy and Coding
Entropy measures the average information or uncertainty in a random variable. A fair coin has one bit of entropy because one yes-or-no question is needed on average to learn its outcome. A biased coin has less entropy because one outcome is easier to guess. A uniform choice among many possibilities has more entropy because the outcome is harder to identify.
Figure: PGP combines public-key and symmetric-key ideas in an applied cryptographic workflow. Image: Wikimedia Commons, xaedes, jfreax, and Acdx, CC BY-SA 3.0.
MIT 18.440 introduces Shannon entropy, noiseless coding, entropy of pairs, and conditional entropy. The main conceptual result is that entropy lower-bounds the expected number of binary questions needed to identify a random outcome. Conditional entropy then formalizes how much uncertainty remains about one random variable after observing another.
Definitions
If a discrete random variable takes values with probabilities , its Shannon entropy in bits is
Terms with are omitted, using the convention .
The entropy of a pair is the entropy of the combined random variable:
The conditional entropy of given is
The average conditional entropy is
A binary code assigns bit strings to possible values of . A valid prefix-free code has no codeword that is the prefix of another; such codes can be decoded unambiguously as bits are read.
Key results
Entropy is maximized by the uniform distribution. If is uniform on values, then
If is deterministic, then .
For independent random variables,
Proof sketch:
so
The chain rule for entropy is
Conditioning reduces entropy:
with equality when and are independent. Intuitively, observing cannot increase the average amount of uncertainty left about . The proof uses concavity of the entropy function, a vector version of Jensen's inequality.
The noiseless coding theorem in the lecture says that the expected number of binary questions needed to identify is at least . Well-designed codes can approach this limit for long independent sequences.
Entropy is measured relative to the logarithm base. Base gives bits, which match yes-or-no questions. Natural logarithms give nats, which are common in mathematical physics and some information theory formulas. Changing the base only multiplies entropy by a constant, so the qualitative comparisons are unchanged.
The formula has two important features. First, it is zero when , because a certain event carries no surprise. Second, it approaches zero as , because an event that almost never occurs contributes little to average uncertainty. Entropy is largest when probability mass is spread evenly, not when it is concentrated.
Coding gives entropy an operational meaning. If one outcome has probability , another has probability , and two more have probability each, a good binary code can assign shorter strings to more likely outcomes and longer strings to less likely outcomes. The expected code length is then close to the entropy. A fixed-length code ignores probabilities and may waste bits.
Conditional entropy measures remaining uncertainty, not the information in the observed variable itself. If determines , then . If is independent of , then . Between these extremes, observing partially reduces uncertainty about .
The chain rule
matches a two-stage questioning strategy: first learn , then learn whatever remains uncertain about . The total expected information is the cost of the first stage plus the expected cost of the second stage. This parallels the law of total expectation, but with uncertainty rather than numerical value.
Conditioning reduces entropy on average because the original distribution of is a mixture of the conditional distributions given . Entropy is concave as a function of the probability vector, so the entropy of the mixture is at least the mixture of the entropies. This is Jensen's inequality appearing in information-theoretic form.
Visual
| Distribution of | Entropy | Interpretation |
|---|---|---|
| deterministic | no question needed | |
| fair bit | bit | one yes-or-no question |
| uniform on values | balanced uncertainty | |
| biased bit with | less than or equal to |
The visual separates two meanings of information. Before observing , the probability vector determines how uncertain the outcome is. After designing a code, entropy becomes a lower bound on the expected number of bits needed to communicate the value. These are the same quantity because an efficient questioning strategy spends short questions on likely outcomes and longer question sequences on rare outcomes.
Conditional entropy fits the same coding picture. If the decoder already knows , the sender only needs enough bits to distinguish the remaining possibilities for within the conditional distribution determined by . Averaging over the possible values gives . If is very informative, the remaining code can be short; if is irrelevant, no compression is gained.
Entropy is therefore not a property of the labels themselves. Renaming outcomes does not change entropy; only the probability vector matters. Numerical distances between labels matter for variance, but not for Shannon entropy.
Worked example 1: entropy of a biased coin
Problem: A coin lands heads with probability and tails with probability . Compute its entropy.
Method:
- The probabilities are
- Entropy is
- Use approximate logarithms:
- Substitute:
Checked answer: the entropy is about bits, less than the bit of a fair coin because the outcome is easier to guess.
Worked example 2: conditional entropy with a noisy copy
Problem: Let be a fair bit. Let with probability , and with probability . Compute and .
Method:
- Since is fair,
- By symmetry, is also fair.
- Given , the probability that is , and the probability that is .
- Therefore the conditional entropy for each observed is the entropy of :
- Approximate:
- Then
Checked answer: observing reduces uncertainty from bit to about bits, but does not eliminate it because the copy is noisy.
Code
from math import log2
def entropy(probs):
return -sum(p * log2(p) for p in probs if p > 0)
print("biased coin H(0.9,0.1):", entropy([0.9, 0.1]))
print("fair bit:", entropy([0.5, 0.5]))
print("uniform four values:", entropy([0.25] * 4))
conditional = entropy([0.8, 0.2])
print("H(X|Y) for noisy copy:", conditional)
print("information learned:", 1 - conditional)
Common pitfalls
- Using natural logarithms without tracking units. Natural logs give nats; base-2 logs give bits.
- Thinking entropy is the same as variance. Entropy measures uncertainty in a distribution, not numerical spread around a mean.
- Forgetting the convention .
- Assuming conditioning always reduces entropy for every individual observed value. The average conditional entropy is at most the original entropy.
- Confusing expected code length lower bounds with exact code lengths for one symbol. Entropy limits are approached most cleanly for long sequences.