Searching Algorithms
Searching asks where an item is, whether it exists, or which item has a specified rank. The problem looks elementary, but its variants cover a large part of algorithms: sequential scans, binary decisions over sorted order, selection by partitioning, hash-table lookup, balanced search trees, external-memory indexes, and probabilistic skip structures [1], [2].
The right search method depends on what structure is available before the query arrives. If the data is unsorted and not indexed, a linear scan is unavoidable in the worst case. If the data is sorted, comparison search can discard half or more of the remaining candidates per step. If many queries arrive over a changing set, hash tables and balanced trees shift work into an updateable index. This page connects these options and emphasizes the invariants that make search code correct.
Figure: Binary search repeatedly halves a sorted array. Image: Wikimedia Commons, public domain or CC-BY-SA via Wikimedia Commons.
Figure: Logarithmic search grows slowly as the array grows. Image: Wikimedia Commons, public domain or CC-BY-SA via Wikimedia Commons.
Definitions
The basic dictionary search problem maintains a set of keys and supports find(x), which returns whether and often returns the associated record. A static search structure is built once and queried many times. A dynamic search structure also supports insertions and deletions.
For arrays, a common contract is to return an index with , or -1 if no such index exists. For duplicate keys, the contract must be explicit: any occurrence, first occurrence, last occurrence, lower bound, and upper bound are different problems. The lower bound of is the first index such that ; the upper bound is the first index such that .
An order statistic is the element of rank in sorted order. The minimum is rank , the maximum is rank , and the median is rank or one of the two central ranks, depending on convention. Quickselect solves this by partitioning rather than fully sorting. Median-of-medians gives deterministic worst-case selection [8].
A hash table maps keys to array positions through a hash function . With chaining, each table slot stores a list or small bucket of keys. With open addressing, all keys live directly in the table and collisions are resolved by probing. The load factor is the ratio between stored keys and table slots; it controls expected lookup cost under ordinary hashing assumptions. Cuckoo hashing uses two or more possible positions per key and moves existing keys when necessary [9].
Key results
Linear search scans each element until it finds the target or exhausts the input. It is optimal when nothing is known about the array: an adversary can place the target in the final unchecked position. Its cost is time and space, and it is the fallback inside many more structured algorithms for tiny subproblems.
Binary search assumes sorted order and keeps an interval of possible positions. The invariant is usually half-open: the answer, if it exists, lies in . At each step compute mid = lo + (hi - lo) // 2, compare with the target, and discard one side. It needs comparisons. Lower-bound binary search is often more useful than equality search because it handles duplicates and insertion positions cleanly.
Ternary search is different in discrete arrays and continuous unimodal optimization. On a sorted array, ternary search is worse than binary search because it uses more comparisons to reduce the interval. On a unimodal function, it compares two interior points and discards the side that cannot contain the maximum or minimum. Interpolation search estimates the target's position from key values:
It can have expected time for uniformly distributed keys, but it degrades to on clustered or adversarial data [4]. Exponential search first finds a range of size containing the target, then binary searches inside it; this is useful for unbounded arrays and for targets near the front.
Several interview-famous array searches are binary-search variants over a predicate. In a rotated sorted array, compare the target with the sorted half to decide which side can contain it. For the first occurrence of a duplicate, keep searching left after equality. For a peak in a mountain array, compare adjacent elements: if , the peak lies to the right; otherwise it lies at or left of mid.
Selection asks for rank rather than membership. Quickselect uses the same partition idea as quicksort [7]. If the pivot lands at rank , recurse only into the side containing rank . Random pivots give expected time because the total expected size of recursive subproblems decreases geometrically. Median-of-medians divides the input into groups of five, recursively selects the median of group medians, and partitions around it. At least elements are guaranteed on each side, which yields the recurrence [8].
Hash-based search gives expected lookup when the hash function spreads keys well and the load factor is controlled. Chaining is simple and robust under deletion. Open addressing improves locality but requires careful handling of tombstones and resizing. Linear probing has cache-friendly runs but suffers from primary clustering; quadratic probing and double hashing reduce some clustering effects. Cuckoo hashing offers worst-case constant lookup after successful insertion because each key has a small fixed set of candidate cells, but insertion can trigger cycles and table rebuilds [9].
Tree-based search preserves sorted order. A plain binary search tree can degrade to a linked list, but AVL trees and red-black trees maintain height through rotations. B-trees and B+ trees store many keys per node so a search visits few disk pages or cache lines; they are the standard structure behind database and filesystem indexes. Skip lists use random levels to simulate balanced search paths with simpler code and expected time [10].
Visual
| Method | Precondition | Time | Space | Best use |
|---|---|---|---|---|
| Linear search | none | Tiny arrays, unsorted one-shot data | ||
| Binary search | sorted random access | Membership, lower bound, duplicates | ||
| Exponential search | sorted, unknown size or near-front targets | for position | Streams with random access blocks | |
| Interpolation search | roughly uniform numeric keys | expected , worst | Dense numeric tables | |
| Quickselect | random access | expected | expected stack | k-th smallest, median |
| Median-of-medians | random access | worst | recursion stack | Adversarial selection |
| Hash table | hashable keys | expected | Dynamic dictionaries | |
| Balanced tree | ordered keys | Range queries and ordered iteration |
Worked example 1: binary search for first occurrence
Problem. Find the first occurrence of in
Method. Use a lower-bound search for the first index with . Keep the half-open invariant that the answer is in .
- Start with , .
mid = 3, . Since , the first valid position is at or left of 3, so set .- Now , .
mid = 1, . Since , discard positions through 1 and set . - Now , .
mid = 2, . Set . - Now , , so the loop stops.
Checked answer. The lower bound is index . Because and , the first occurrence exists and is at index . If the target had been , the same search would return insertion position , and the equality check would reject it.
Worked example 2: Quickselect for the k-th smallest
Problem. Find the 4th smallest element of
Use 1-based rank .
Method.
- Choose pivot . Partition into , , .
- The pivot's rank range is positions through , namely rank .
- Since , recurse into with the same rank .
- In , choose pivot . Partition into , , .
- Pivot has rank inside this subarray.
- Since the desired rank is , recurse into with adjusted rank .
- A one-element subarray returns .
Checked answer. Sorting the original array gives , whose 4th element is . Quickselect found it without sorting the right side of the first partition.
Code
from random import randrange
def lower_bound(a, x):
lo, hi = 0, len(a)
while lo < hi:
mid = lo + (hi - lo) // 2
if a[mid] < x:
lo = mid + 1
else:
hi = mid
return lo
def binary_search_first(a, x):
i = lower_bound(a, x)
return i if i < len(a) and a[i] == x else -1
def quickselect(a, k):
"""Return the k-th smallest element for zero-based k."""
if not 0 <= k < len(a):
raise IndexError("rank out of range")
values = list(a)
while True:
pivot = values[randrange(len(values))]
lows = [x for x in values if x < pivot]
highs = [x for x in values if x > pivot]
pivots = [x for x in values if x == pivot]
if k < len(lows):
values = lows
elif k < len(lows) + len(pivots):
return pivot
else:
k -= len(lows) + len(pivots)
values = highs
if __name__ == "__main__":
a = [1, 2, 4, 4, 4, 7, 9]
print(binary_search_first(a, 4))
print(quickselect([8, 3, 2, 7, 4, 1, 6, 5], 3))
Common pitfalls
- Writing binary search without a clear invariant, then patching off-by-one errors by trial.
- Returning the first equality seen when the required contract is first occurrence.
- Using
(lo + hi) // 2in languages where integer overflow is possible. - Applying binary search to a list that is only "usually sorted" after recent updates.
- Confusing ternary search on sorted arrays with ternary search on unimodal functions.
- Assuming interpolation search has logarithmic worst-case time.
- Implementing quickselect but recursing into both sides, which turns it back into quicksort-like work.
- Forgetting to adjust after discarding the lower partition in selection.
- Letting hash-table load factor grow without resizing.
- Deleting from open-addressed hash tables without tombstones or cluster repair.
- Using a plain BST when adversarial insertion order can make height .
- Choosing a B-tree minimum degree without considering cache-line and page size.
Connections
- Sorting Algorithms because sorted order enables binary, interpolation, and exponential search.
- Randomized Algorithms for randomized quickselect, skip lists, and hashing assumptions.
- Data Structures for hash tables, balanced trees, B-trees, and skip lists.
- Divide and Conquer for binary search and selection recurrences.
- Approximation Algorithms where binary search over an answer is often paired with feasibility tests.
- Discrete Math for order, predicates, and proof by invariant.
References
[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 4th ed. MIT Press, 2022.
[2] R. Sedgewick and K. Wayne, Algorithms, 4th ed. Addison-Wesley, 2011.
[3] J. Kleinberg and E. Tardos, Algorithm Design. Pearson, 2005.
[4] D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching, 2nd ed. Addison-Wesley, 1998.
[5] S. S. Skiena, The Algorithm Design Manual, 3rd ed. Springer, 2020.
[6] K. Mehlhorn and P. Sanders, Algorithms and Data Structures: The Basic Toolbox. Springer, 2008.
[7] C. A. R. Hoare, "Algorithm 65: Find," Communications of the ACM, vol. 4, no. 7, pp. 321-322, 1961. https://doi.org/10.1145/366622.366647
[8] M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan, "Time bounds for selection," Journal of Computer and System Sciences, vol. 7, no. 4, pp. 448-461, 1973. https://doi.org/10.1016/S0022-0000(73)80033-9
[9] R. Pagh and F. F. Rodler, "Cuckoo hashing," Journal of Algorithms, vol. 51, no. 2, pp. 122-144, 2004. https://doi.org/10.1016/j.jalgor.2003.12.002
[10] W. Pugh, "Skip lists: A probabilistic alternative to balanced trees," Communications of the ACM, vol. 33, no. 6, pp. 668-676, 1990. https://doi.org/10.1145/78973.78977
[11] G. M. Adelson-Velsky and E. M. Landis, "An algorithm for the organization of information," Soviet Mathematics Doklady, vol. 3, pp. 1259-1263, 1962.
[12] R. Bayer and E. M. McCreight, "Organization and maintenance of large ordered indexes," Acta Informatica, vol. 1, pp. 173-189, 1972.