String Algorithms
String algorithms search, compare, index, compress, and approximately match sequences of symbols. The symbols may be characters in text, bytes in files, tokens in natural language, bases in DNA, or events in a log. The core challenge is avoiding repeated work: naive matching compares the same prefixes again and again, while KMP, Z-values, suffix arrays, suffix trees, tries, and automata reuse information about previous comparisons [1], [2].
This page covers exact single-pattern matching, rolling hashes, multi-pattern matching, suffix-based indexes, and a short bridge to edit distance and compressed search. The theme is preprocessing. Sometimes preprocess the pattern, as KMP does; sometimes preprocess a set of patterns, as Aho-Corasick does; sometimes preprocess the text, as suffix arrays and FM-indexes do [7], [9], [12].
Figure: A trie stores shared prefixes once, which is the structural basis for multi-pattern string search. Image: Wikimedia Commons, public domain or CC-BY-SA via Wikimedia Commons.
Definitions
Let be a text of length and be a pattern of length . Exact pattern matching asks for all positions such that
The naive algorithm checks every alignment and compares up to characters, taking time in the worst case.
A prefix of a string is a substring starting at position . A suffix ends at the final position. A border is a nonempty proper prefix that is also a suffix. KMP's prefix function is the length of the longest border of . The Z-array is the length of the longest substring starting at that matches the prefix of the string.
A trie stores strings by edges labeled with characters, sharing common prefixes. A failure link points from a trie node to the longest proper suffix of that node's string that is also a trie prefix. A suffix array is an array of starting positions of all suffixes of in lexicographic order. The LCP array stores longest common prefix lengths between adjacent suffixes in suffix-array order.
A rolling hash assigns a numeric fingerprint to a substring and updates it in constant time when the window shifts. Hash equality is a candidate match, not necessarily a proof, unless the algorithm verifies characters or uses a collision-free setting.
Key results
Naive matching is simple and sometimes fast enough, especially for very short patterns or when library routines use vectorized comparisons. Its worst case is poor: searching AAAAAB in AAAAAAAAAAAAAAAA repeats almost the same comparisons at many shifts.
Knuth-Morris-Pratt preprocesses the pattern into the prefix function, then scans the text without moving the text index backward [7]. If a mismatch occurs after matching characters, the next possible matched prefix length is . The reason is border structure: any future match ending at the current text position must be a border of the matched pattern prefix. KMP runs in time and space.
Rabin-Karp computes a rolling hash of each length- text window and compares it with the pattern hash [10]. A typical polynomial hash is
When the window shifts, subtract the outgoing character, multiply by , add the incoming character, and reduce modulo . The algorithm has expected time with a good modulus and randomization, but adversarial collisions can force verification work. Its advantage is multi-pattern matching when many patterns of the same length share a hash table.
Boyer-Moore compares the pattern from right to left and uses two shift rules [8]. The bad-character rule shifts the pattern so the mismatched text character aligns with its previous occurrence in the pattern, if any. The good-suffix rule shifts based on a suffix that matched before the mismatch. Full Boyer-Moore has excellent practical performance and sublinear behavior on many texts because it can skip alignments entirely.
Aho-Corasick builds a trie of all patterns and adds failure links [9]. Scanning the text follows trie edges when possible and failure links otherwise. Whenever the automaton reaches a node with output patterns, those patterns occur ending at the current text position. The runtime is .
The Z-algorithm computes all values in linear time by maintaining a rightmost interval known to match the prefix. Prefix-function and Z-array views are interchangeable for many tasks. For pattern matching, compute Z on P + sentinel + T; positions where are matches.
Suffix arrays and suffix trees index a fixed text for many queries. A suffix array can be built by prefix doubling in time or by linear-time algorithms such as SA-IS [12]. Pattern search uses binary search over suffixes in time, or faster with LCP acceleration. A suffix tree compactly represents all suffixes and supports exact pattern queries after construction. Ukkonen's algorithm builds a suffix tree online in linear time, but it is implementation-heavy [11].
Longest common substring between two strings can be found by building a suffix array for A + sentinel + B, computing LCPs, and taking the maximum LCP between adjacent suffixes from different original strings. Edit distance and approximate matching use dynamic programming rather than exact border reuse; they connect string algorithms back to the DP page. Burrows-Wheeler transform and FM-indexes compress the text while retaining search capability. They support backward search over character ranges and are fundamental in read alignment and compressed text indexes.
Visual
| Structure | Preprocesses | Query style | Time highlight |
|---|---|---|---|
| KMP prefix function | pattern | exact one-pattern scan | |
| Rabin-Karp hash | pattern and rolling text windows | exact candidates with verification | expected |
| Boyer-Moore tables | pattern | right-to-left comparisons | often sublinear practical scans |
| Aho-Corasick trie | pattern set | multi-pattern stream | linear plus outputs |
| Z-array | combined string | prefix-match lengths | |
| Suffix array | text | many pattern queries | build or better |
| Suffix tree | text | substring index | query |
| FM-index | compressed text | backward search | depends on rank/select support |
Worked example 1: KMP failure function for ABABCABAB
Problem. Compute the prefix function for pattern ABABCABAB.
Method. Let be the longest proper border length for the prefix ending at .
| prefix | current char | longest border | ||
|---|---|---|---|---|
| 0 | A | A | none | 0 |
| 1 | AB | B | none | 0 |
| 2 | ABA | A | A | 1 |
| 3 | ABAB | B | AB | 2 |
| 4 | ABABC | C | none after fallback | 0 |
| 5 | ABABCA | A | A | 1 |
| 6 | ABABCAB | B | AB | 2 |
| 7 | ABABCABA | A | ABA | 3 |
| 8 | ABABCABAB | B | ABAB | 4 |
Check one nontrivial fallback. At , the matched border length before reading C is , expecting pattern character A, but the character is C. Fall back from length to . At length , C still does not match A, so .
Checked answer. The prefix function is
This means that after matching the full pattern and then seeing a mismatch, KMP can resume as if the prefix ABAB had already matched.
Worked example 2: Rabin-Karp matching ABAB in ABCABABAB
Problem. Find pattern ABAB in text ABCABABAB using a simple rolling hash. Use character values , , , base , and modulus .
Method.
- Pattern hash:
- First text window
ABCA:
No match.
- Roll to
BCAB. Remove , multiply by base, add :
No match.
- Roll to
CABA:
No match.
- Roll to
ABAB:
Hash matches. Verify characters: text positions 3 through 6 are exactly ABAB.
- Roll to the final
BABA, thenABABat positions 5 through 8. The final window also hashes to and verifies.
Checked answer. Matches start at indices and using zero-based indexing.
Code
def prefix_function(pattern):
pi = [0] * len(pattern)
j = 0
for i in range(1, len(pattern)):
while j > 0 and pattern[i] != pattern[j]:
j = pi[j - 1]
if pattern[i] == pattern[j]:
j += 1
pi[i] = j
return pi
def kmp_search(text, pattern):
if pattern == "":
return list(range(len(text) + 1))
pi = prefix_function(pattern)
matches = []
j = 0
for i, ch in enumerate(text):
while j > 0 and ch != pattern[j]:
j = pi[j - 1]
if ch == pattern[j]:
j += 1
if j == len(pattern):
matches.append(i - len(pattern) + 1)
j = pi[j - 1]
return matches
def rabin_karp(text, pattern, base=256, mod=1_000_000_007):
m = len(pattern)
if m == 0:
return list(range(len(text) + 1))
if m > len(text):
return []
high = pow(base, m - 1, mod)
ph = th = 0
for i in range(m):
ph = (ph * base + ord(pattern[i])) % mod
th = (th * base + ord(text[i])) % mod
matches = []
for i in range(len(text) - m + 1):
if ph == th and text[i:i + m] == pattern:
matches.append(i)
if i + m < len(text):
th = (th - ord(text[i]) * high) % mod
th = (th * base + ord(text[i + m])) % mod
return matches
def z_function(s):
z = [0] * len(s)
left = right = 0
for i in range(1, len(s)):
if i <= right:
z[i] = min(right - i + 1, z[i - left])
while i + z[i] < len(s) and s[z[i]] == s[i + z[i]]:
z[i] += 1
if i + z[i] - 1 > right:
left, right = i, i + z[i] - 1
return z
Common pitfalls
- Treating hash equality as proof of equality without verification or a collision argument.
- Building the KMP prefix table over the text instead of the pattern for standard scanning.
- Forgetting to fall back repeatedly in KMP; one fallback may not be enough.
- Using a sentinel in
P + sentinel + Tthat can also appear in the input. - Claiming Boyer-Moore is worst-case linear without the full set of rules and implementation details.
- Reporting overlapping matches incorrectly after a full KMP match.
- Storing every suffix string explicitly when building a suffix array, causing quadratic memory use.
- Confusing longest common subsequence with longest common substring.
- Implementing Aho-Corasick outputs only at terminal nodes and missing patterns inherited through failure links.
- Using a tiny modulus in Rabin-Karp on adversarial input.
- Ignoring Unicode normalization when matching human-language text.
- Applying byte-oriented algorithms directly to variable-width encoded text without defining the symbol alphabet.
Connections
- Dynamic Programming for edit distance, LCS, and approximate matching.
- Searching Algorithms for binary search over suffix arrays.
- Data Structures for tries, automata, arrays, and hash tables.
- Randomized Algorithms for rolling-hash collision analysis and randomized fingerprints.
- Number-Theoretic and Algebraic Algorithms for modular hashing and FFT-based convolution matching.
- Theory of Computation for finite automata and regular languages.
References
[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 4th ed. MIT Press, 2022.
[2] R. Sedgewick and K. Wayne, Algorithms, 4th ed. Addison-Wesley, 2011.
[3] D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching, 2nd ed. Addison-Wesley, 1998.
[4] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings. Cambridge University Press, 2002.
[5] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.
[6] K. Mehlhorn and P. Sanders, Algorithms and Data Structures: The Basic Toolbox. Springer, 2008.
[7] D. E. Knuth, J. H. Morris, Jr., and V. R. Pratt, "Fast pattern matching in strings," SIAM Journal on Computing, vol. 6, no. 2, pp. 323-350, 1977.
[8] R. S. Boyer and J. S. Moore, "A fast string searching algorithm," Communications of the ACM, vol. 20, no. 10, pp. 762-772, 1977. https://doi.org/10.1145/359842.359859
[9] A. V. Aho and M. J. Corasick, "Efficient string matching: An aid to bibliographic search," Communications of the ACM, vol. 18, no. 6, pp. 333-340, 1975.
[10] R. M. Karp and M. O. Rabin, "Efficient randomized pattern-matching algorithms," IBM Journal of Research and Development, vol. 31, no. 2, pp. 249-260, 1987.
[11] E. Ukkonen, "On-line construction of suffix trees," Algorithmica, vol. 14, pp. 249-260, 1995.
[12] G. Nong, S. Zhang, and W. H. Chan, "Linear suffix array construction by almost pure induced-sorting," Data Compression Conference, pp. 193-202, 2009.
[13] M. Burrows and D. J. Wheeler, "A block-sorting lossless data compression algorithm," Digital SRC Research Report 124, 1994.