Randomized Algorithms
Randomized algorithms use random choices as part of the computation. Randomness can simplify design, improve expected performance, avoid adversarial inputs, reduce memory, or allow approximate answers in streaming settings. The right question is not merely "can it fail?" but "what is random: the running time, the answer, or both, and with what probability?" [1], [5].
This page covers Las Vegas versus Monte Carlo algorithms, randomized quicksort and quickselect, skip lists, randomized hashing, Bloom filters, count-min sketches, Karger's min-cut algorithm, Johnson-Lindenstrauss projections, reservoir sampling, Misra-Gries heavy hitters, HyperLogLog, and Freivalds' matrix-product check. Motwani and Raghavan emphasize probability tools and classification; CLRS and Sedgewick connect the ideas to sorting, selection, hashing, and graph algorithms [1], [2], [5].
Figure: A skip list uses random levels to create express lanes over an ordered linked list. Image: Wikimedia Commons, public domain or CC-BY-SA via Wikimedia Commons.
Definitions
A Las Vegas algorithm always returns a correct answer, but its running time is random. Randomized quicksort is Las Vegas when implemented correctly: it always sorts, while the pivot choices determine the recursion shape. A Monte Carlo algorithm has a bounded probability of returning an incorrect answer, but usually has a deterministic or bounded running time. Miller-Rabin primality testing is Monte Carlo when bases are chosen randomly and no deterministic base theorem is invoked.
The expected running time is the average over the algorithm's own random choices, often independent of any input distribution. A bound that holds with high probability usually means probability at least for a chosen constant . Error probabilities can often be amplified by independent repetition: if one run fails with probability , then independent runs fail with probability .
A family of hash functions is universal if for distinct keys ,
where is the number of buckets. A Bloom filter is a probabilistic set-membership structure with false positives but no false negatives, assuming no deletions. A streaming algorithm processes data in one pass using memory sublinear in the stream length.
Key results
Randomized quicksort chooses a random pivot, partitions, and recursively sorts both sides. For any pair of elements, they are compared only if one of them is the first pivot chosen among the elements whose ranks lie between them. Summing that probability over all pairs gives expected comparisons [1]. Randomized quickselect recurses into only one side, yielding expected time.
Skip lists store each key in level 0 and promote it to higher levels by repeated coin flips [10]. Search starts at the highest level, moves right while possible, then drops down. Expected height is and expected search, insert, and delete time are . Skip lists provide balanced-tree performance with randomized structural simplicity.
Hashing benefits from randomization at several levels. Universal hashing prevents an adversary from knowing which keys collide before the hash function is selected. Perfect hashing for static sets can use a two-level scheme: first hash into buckets, then choose a collision-free secondary table for each bucket. Expected total space is linear when the first-level hash has controlled sum of squared bucket sizes. Cuckoo hashing gives constant worst-case lookup by storing each key in one of two or more possible cells and relocating keys on insertion, with occasional rebuilds.
Bloom filters use bits and hash functions. Inserting an item sets positions. Querying checks whether all positions are set. After insertions, the approximate false-positive probability is
Count-min sketches estimate frequencies using several hash tables. They overestimate counts because collisions add mass, and the minimum across rows controls the error. Misra-Gries maintains at most counters and finds all items with frequency more than in a stream. Reservoir sampling keeps a uniform random sample from a stream of unknown length.
Karger's randomized min-cut algorithm repeatedly contracts a uniformly random edge until two supernodes remain [9]. A fixed global min cut survives one contraction step with probability at least , then at least , and so on, yielding success probability at least for one run. Repeating times boosts success probability.
The Johnson-Lindenstrauss lemma states that points in high-dimensional Euclidean space can be embedded into dimensions while preserving all pairwise distances within a factor [11]. Random projections make this constructive and useful for dimensionality reduction.
Freivalds' algorithm verifies whether by choosing a random vector and checking whether [12]. If , the probability of a false acceptance is at most over suitable random binary vectors. This is much faster than multiplying matrices when only verification is needed.
HyperLogLog estimates the number of distinct elements by hashing stream elements and tracking leading-zero patterns across registers. It is probabilistic, mergeable, and widely used in analytics systems. Its lesson is broader: randomized algorithms often trade exactness for compact summaries with quantifiable error.
Visual
| Algorithm or structure | Type | Guarantee | Main parameter |
|---|---|---|---|
| Randomized quicksort | Las Vegas | expected | pivot distribution |
| Randomized quickselect | Las Vegas | expected | pivot distribution |
| Skip list | Las Vegas structure | expected operations | promotion probability |
| Universal hashing | randomized hashing | expected short chains | hash family |
| Bloom filter | Monte Carlo data structure | no false negatives, tunable false positives | |
| Count-min sketch | Monte Carlo estimate | overestimates within probabilistic error | width and depth |
| Karger min-cut | Monte Carlo | success boosted by repetition | number of trials |
| Freivalds | Monte Carlo verifier | false accept at most per run | random vector |
Worked example 1: reservoir sampling correctness for k = 1
Problem. A stream has unknown length . Algorithm: keep the first item; when item arrives for , replace the current sample with probability . Prove each item is retained with probability after processing items.
Method.
- Consider item . It is selected when it arrives with probability .
- For each later item , item survives if it is not replaced. The probability of not being replaced at step is .
- Multiply selection and survival probabilities:
- The product telescopes:
- Therefore
Checked answer. Every item has equal probability of being the final sample, even though the algorithm never knows in advance.
Worked example 2: Bloom filter false-positive computation
Problem. A Bloom filter has bits, hash functions, and stores inserted keys. Estimate the false-positive probability.
Method.
- Probability a fixed bit remains 0 after one hash placement is .
- There are hash placements, so probability a fixed bit is still 0 is approximately
- Probability the bit is 1 is about .
- A query for an absent key checks positions. Approximate them as independent:
Checked answer. The false-positive rate is about . There are no false negatives for inserted keys unless deletion or implementation error is introduced.
Code
import random
class SkipNode:
def __init__(self, key, level):
self.key = key
self.forward = [None] * level
class SkipListBasic:
def __init__(self, p=0.5, max_level=16):
self.p = p
self.max_level = max_level
self.level = 1
self.head = SkipNode(None, max_level)
def _random_level(self):
level = 1
while level < self.max_level and random.random() < self.p:
level += 1
return level
def contains(self, key):
node = self.head
for i in reversed(range(self.level)):
while node.forward[i] and node.forward[i].key < key:
node = node.forward[i]
node = node.forward[0]
return node is not None and node.key == key
def insert(self, key):
update = [None] * self.max_level
node = self.head
for i in reversed(range(self.level)):
while node.forward[i] and node.forward[i].key < key:
node = node.forward[i]
update[i] = node
if node.forward[0] and node.forward[0].key == key:
return
new_level = self._random_level()
if new_level > self.level:
for i in range(self.level, new_level):
update[i] = self.head
self.level = new_level
new_node = SkipNode(key, new_level)
for i in range(new_level):
new_node.forward[i] = update[i].forward[i]
update[i].forward[i] = new_node
def reservoir_sample(stream, k):
sample = []
for i, item in enumerate(stream, start=1):
if i <= k:
sample.append(item)
else:
j = random.randrange(i)
if j < k:
sample[j] = item
return sample
class BloomFilter:
def __init__(self, bits=1024, hashes=4):
self.bits = bits
self.hashes = hashes
self.table = [0] * bits
def _positions(self, item):
for seed in range(self.hashes):
yield hash((seed, item)) % self.bits
def add(self, item):
for pos in self._positions(item):
self.table[pos] = 1
def might_contain(self, item):
return all(self.table[pos] for pos in self._positions(item))
Common pitfalls
- Saying "average case" when the expectation is over algorithmic randomness, not input distribution.
- Calling a Monte Carlo algorithm Las Vegas because it usually works.
- Forgetting to amplify error probabilities through independent trials.
- Using a predictable hash seed where adversaries can force collisions.
- Treating Bloom filter positives as proof of membership.
- Adding deletion to a standard Bloom filter without counters and then creating false negatives.
- Assuming randomized quicksort has worst-case time.
- Implementing reservoir sampling with probability after the stream length is already known, which misses the online point.
- Reporting count-min sketch estimates as exact counts.
- Running Karger's min-cut only once and expecting high success probability.
- Using Johnson-Lindenstrauss projections without checking the target dimension for the desired and failure probability.
- Forgetting that Python's built-in hash may be salted and process-specific.
Connections
- Sorting Algorithms for randomized quicksort.
- Searching Algorithms for randomized quickselect, hashing, and skip lists.
- Graph Algorithms for Karger's min-cut and randomized graph procedures.
- Number-Theoretic and Algebraic Algorithms for Miller-Rabin, Pollard rho, and Freivalds-style algebraic checking.
- Approximation Algorithms for randomized approximation and probabilistic guarantees.
- Discrete Math for probability, expectation, and concentration bounds.
References
[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 4th ed. MIT Press, 2022.
[2] R. Sedgewick and K. Wayne, Algorithms, 4th ed. Addison-Wesley, 2011.
[3] J. Kleinberg and E. Tardos, Algorithm Design. Pearson, 2005.
[4] K. Mehlhorn and P. Sanders, Algorithms and Data Structures: The Basic Toolbox. Springer, 2008.
[5] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge University Press, 1995.
[6] M. Mitzenmacher and E. Upfal, Probability and Computing, 2nd ed. Cambridge University Press, 2017.
[7] B. H. Bloom, "Space/time trade-offs in hash coding with allowable errors," Communications of the ACM, vol. 13, no. 7, pp. 422-426, 1970.
[8] M. O. Rabin, "Probabilistic algorithms," in Algorithms and Complexity: New Directions and Recent Results, Academic Press, 1976.
[9] D. R. Karger, "Global min-cuts in RNC, and other ramifications of a simple min-out algorithm," SODA, pp. 21-30, 1993.
[10] W. Pugh, "Skip lists: A probabilistic alternative to balanced trees," Communications of the ACM, vol. 33, no. 6, pp. 668-676, 1990.
[11] W. B. Johnson and J. Lindenstrauss, "Extensions of Lipschitz mappings into a Hilbert space," Contemporary Mathematics, vol. 26, pp. 189-206, 1984.
[12] R. Freivalds, "Fast probabilistic algorithms," Mathematical Foundations of Computer Science, pp. 57-69, 1979.
[13] G. Cormode and S. Muthukrishnan, "An improved data stream summary: The count-min sketch and its applications," Journal of Algorithms, vol. 55, no. 1, pp. 58-75, 2005.
[14] P. Flajolet et al., "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm," AofA, 2007.