Hash Functions and Random Oracles
Cryptographic hash functions compress arbitrary-length input into fixed-length digests while trying to preserve adversarial unpredictability properties. They are used for integrity checks, commitment schemes, password hashing, key derivation, signatures, Merkle trees, and MAC constructions such as HMAC. But a hash is not "just a checksum." It needs security properties defined against attackers.
Katz and Lindell separate collision resistance, weaker hash notions, Merkle-Damgard domain extension, birthday attacks, HMAC, applications, and the random-oracle model. Smart gives a more applied presentation of hash design, the MD4 family, SHA-style constructions, and MAC use. Together they show both sides: hash functions are concrete algorithms such as SHA-2 and SHA-3, and they are also idealized objects in proofs.
Definitions
A hash function is an efficient deterministic function
The output is a digest of fixed length .
Collision resistance means it is infeasible to find any two distinct inputs such that:
Since the domain is larger than the range, collisions exist. Security says they are hard to find.
Second-preimage resistance means that given , it is infeasible to find with .
Preimage resistance means that given a digest , it is infeasible to find any such that .
The Merkle-Damgard transform builds a variable-length hash from a fixed-length compression function. It iterates:
then outputs the final chaining value, usually after length padding. MD5, SHA-1, and SHA-2 are in this broad family, though their compression functions differ.
SHA-3, based on Keccak, uses a sponge construction instead of Merkle-Damgard. A sponge absorbs input into a larger internal state and squeezes output from it.
A random oracle is an ideal public random function. On each new input it returns an independent uniform output and remembers that answer for consistency. The random-oracle model proves security in a world where hash calls are replaced by this ideal oracle.
Key results
The birthday bound controls collision security. For an -bit hash, generic collision search takes about trials, not . After random hash outputs, the collision probability is approximately:
This is why 128-bit digests are not enough for 128-bit collision security; they give about 64-bit collision resistance. Modern collision-resistant uses often target 256-bit outputs or larger.
Merkle-Damgard needs strengthening, especially length padding, to preserve collision resistance from the compression function. Without unambiguous padding, different message decompositions can collide structurally. Even with proper padding, plain Merkle-Damgard exposes an internal-state style that can enable length-extension attacks on naive MACs such as .
HMAC avoids the naive hash-MAC problem by using nested keyed hashing. Its design separates the inner digest from the final output and includes distinct pads. This is why HMAC-SHA-256 remains widely used even though plain SHA-256 is not a MAC by itself.
The random-oracle model is useful but idealized. A proof in the random-oracle model says the scheme is secure if the hash behaves like a truly random function accessible to all parties. Real hash functions are deterministic public algorithms, not random oracles. Therefore such proofs are evidence, not the same kind of theorem as a standard-model reduction. Katz and Lindell emphasize this distinction: random-oracle proofs can guide design, but the methodology has known limitations.
Hash functions also support Merkle trees. A Merkle tree commits to many leaves with one root digest. A proof for one leaf contains sibling hashes along the path to the root, so verification takes hashes rather than downloading all leaves.
Domain separation is one of the simplest ways to avoid cross-protocol surprises. If the same hash function is used for commitments, Merkle leaves, Merkle internal nodes, password hashing, and transcript hashing, the inputs should be tagged so a value from one domain cannot be reinterpreted in another. For example, a tree may hash leaves as and internal nodes as . This prevents an internal-node encoding from also being a valid leaf encoding.
SHA-2 and SHA-3 have different internal designs, which is useful for algorithm diversity. SHA-256 and SHA-512 are Merkle-Damgard-style hashes built from compression functions. SHA-3 uses the Keccak sponge, with a capacity part of the state controlling security and a rate part controlling throughput. A sponge can naturally support extendable-output functions such as SHAKE, where the caller asks for as many output bytes as needed. The security property still depends on using the function with the right domain and output length.
Password hashing is a separate application with different cost goals. Fast hashes are good for signatures and Merkle trees, but bad for password storage because attackers can try guesses quickly. Password hashing uses salts and deliberately expensive functions to slow offline guessing. That is not a contradiction; it is a reminder that "hash function" names a family of tools, and the security goal determines the right construction.
Random oracles appear often in public-key proofs because they let a proof program oracle answers and extract information from adversary queries. This is powerful, but it creates a gap between the model and real hash functions. A scheme proven only in the random-oracle model should be implemented with conservative hash choices, clear encodings, and awareness that the proof is heuristic evidence about the concrete instantiation.
Collision resistance is not always the right property to ask for. A password reset token needs unpredictability and sufficient entropy, not merely collision resistance. A file identifier in a content-addressed store may need collision resistance because an attacker could try to substitute a different file with the same digest. A KDF needs pseudorandom-looking derived keys under assumptions about the input secret. Naming the exact property prevents both overclaiming and underprotecting.
Hash outputs also need enough bits for the intended threat. A 256-bit digest is common because it gives a comfortable collision margin of about generic work. If a digest is truncated, the security level changes immediately. Truncation can be safe when specified, but it must be included in the proof or concrete analysis.
Length encoding is part of safe hashing. If a construction hashes concatenated fields without delimiters, different structured inputs can collide before the hash function is even applied, such as (A,BC) and (AB,C). Cryptographic hashing cannot repair an ambiguous serialization. Canonical encodings and length prefixes are therefore part of the security boundary.
A good hash design is therefore paired with a good input format.
Both matter.
Visual
| Property | Given to attacker | Goal | Generic work for -bit ideal hash |
|---|---|---|---|
| Preimage resistance | digest | find with | about |
| Second preimage | input | find with same digest | about |
| Collision resistance | no target input | find any collision | about |
| Random oracle behavior | query access | outputs look random and consistent | ideal model, not a real algorithm |
Worked example 1: birthday collision estimate
Problem: estimate the probability of at least one collision after hashing random inputs with a 64-bit hash.
Method:
- Use:
- Substitute and :
- Approximate the numerator as :
- For small , , so:
Checked answer: about . A 64-bit hash may look safe for a million items, but not for adversarial collision resistance at modern security levels.
Worked example 2: Merkle proof verification
Problem: a four-leaf Merkle tree has leaves . The root is:
Verify leaf using proof sibling and upper sibling .
Method:
- Hash the target leaf with its sibling in the correct order:
- Combine with the upper sibling:
- Accept if:
- Direction matters. If is on the left and on the right, use , not .
Checked answer: the proof is valid exactly when the recomputed root equals the trusted root .
Code
import hashlib
def H(data: bytes) -> bytes:
return hashlib.sha256(data).digest()
def merkle_parent(left: bytes, right: bytes) -> bytes:
return H(left + right)
leaves = [H(f"leaf {i}".encode()) for i in range(4)]
a = merkle_parent(leaves[0], leaves[1])
b = merkle_parent(leaves[2], leaves[3])
root = merkle_parent(a, b)
proof_for_leaf2 = [("right", leaves[3]), ("left", a)]
current = leaves[2]
for side, sibling in proof_for_leaf2:
current = merkle_parent(current, sibling) if side == "right" else merkle_parent(sibling, current)
print(current == root)
Common pitfalls
- Using a non-cryptographic checksum where collision resistance is required.
- Assuming an -bit hash gives -bit collision security. Birthday attacks halve the exponent.
- Building a MAC as
hash(key || message)instead of HMAC. - Forgetting domain separation between different hash uses.
- Treating a random-oracle proof as if it directly proves security for SHA-256.
- Omitting leaf positions or directions in Merkle proofs.