Skip to main content

String Algorithms

String algorithms search, compare, index, compress, and approximately match sequences of symbols. The symbols may be characters in text, bytes in files, tokens in natural language, bases in DNA, or events in a log. The core challenge is avoiding repeated work: naive matching compares the same prefixes again and again, while KMP, Z-values, suffix arrays, suffix trees, tries, and automata reuse information about previous comparisons [1], [2].

This page covers exact single-pattern matching, rolling hashes, multi-pattern matching, suffix-based indexes, and a short bridge to edit distance and compressed search. The theme is preprocessing. Sometimes preprocess the pattern, as KMP does; sometimes preprocess a set of patterns, as Aho-Corasick does; sometimes preprocess the text, as suffix arrays and FM-indexes do [7], [9], [12].

Trie example.

Figure: A trie stores shared prefixes once, which is the structural basis for multi-pattern string search. Image: Wikimedia Commons, public domain or CC-BY-SA via Wikimedia Commons.

Definitions

Let TT be a text of length nn and PP be a pattern of length mm. Exact pattern matching asks for all positions ii such that

T[i:i+m]=P.T[i:i+m]=P.

The naive algorithm checks every alignment and compares up to mm characters, taking O(nm)O(nm) time in the worst case.

A prefix of a string is a substring starting at position 00. A suffix ends at the final position. A border is a nonempty proper prefix that is also a suffix. KMP's prefix function π[i]\pi[i] is the length of the longest border of P[0:i+1]P[0:i+1]. The Z-array Z[i]Z[i] is the length of the longest substring starting at ii that matches the prefix of the string.

A trie stores strings by edges labeled with characters, sharing common prefixes. A failure link points from a trie node to the longest proper suffix of that node's string that is also a trie prefix. A suffix array is an array of starting positions of all suffixes of TT in lexicographic order. The LCP array stores longest common prefix lengths between adjacent suffixes in suffix-array order.

A rolling hash assigns a numeric fingerprint to a substring and updates it in constant time when the window shifts. Hash equality is a candidate match, not necessarily a proof, unless the algorithm verifies characters or uses a collision-free setting.

Key results

Naive matching is simple and sometimes fast enough, especially for very short patterns or when library routines use vectorized comparisons. Its worst case is poor: searching AAAAAB in AAAAAAAAAAAAAAAA repeats almost the same comparisons at many shifts.

Knuth-Morris-Pratt preprocesses the pattern into the prefix function, then scans the text without moving the text index backward [7]. If a mismatch occurs after matching qq characters, the next possible matched prefix length is π[q1]\pi[q-1]. The reason is border structure: any future match ending at the current text position must be a border of the matched pattern prefix. KMP runs in O(n+m)O(n+m) time and O(m)O(m) space.

Rabin-Karp computes a rolling hash of each length-mm text window and compares it with the pattern hash [10]. A typical polynomial hash is

H(S)=i=0m1S[i]bm1imodq.H(S)=\sum_{i=0}^{m-1}S[i]b^{m-1-i}\bmod q.

When the window shifts, subtract the outgoing character, multiply by bb, add the incoming character, and reduce modulo qq. The algorithm has expected O(n+m)O(n+m) time with a good modulus and randomization, but adversarial collisions can force verification work. Its advantage is multi-pattern matching when many patterns of the same length share a hash table.

Boyer-Moore compares the pattern from right to left and uses two shift rules [8]. The bad-character rule shifts the pattern so the mismatched text character aligns with its previous occurrence in the pattern, if any. The good-suffix rule shifts based on a suffix that matched before the mismatch. Full Boyer-Moore has excellent practical performance and sublinear behavior on many texts because it can skip alignments entirely.

Aho-Corasick builds a trie of all patterns and adds failure links [9]. Scanning the text follows trie edges when possible and failure links otherwise. Whenever the automaton reaches a node with output patterns, those patterns occur ending at the current text position. The runtime is O(n+total pattern length+number of matches)O(n+\text{total pattern length}+\text{number of matches}).

The Z-algorithm computes all Z[i]Z[i] values in linear time by maintaining a rightmost interval [L,R][L,R] known to match the prefix. Prefix-function and Z-array views are interchangeable for many tasks. For pattern matching, compute Z on P + sentinel + T; positions where Z[i]mZ[i]\ge m are matches.

Suffix arrays and suffix trees index a fixed text for many queries. A suffix array can be built by prefix doubling in O(nlogn)O(n\log n) time or by linear-time algorithms such as SA-IS [12]. Pattern search uses binary search over suffixes in O(mlogn)O(m\log n) time, or faster with LCP acceleration. A suffix tree compactly represents all suffixes and supports O(m+occ)O(m+\text{occ}) exact pattern queries after construction. Ukkonen's algorithm builds a suffix tree online in linear time, but it is implementation-heavy [11].

Longest common substring between two strings can be found by building a suffix array for A + sentinel + B, computing LCPs, and taking the maximum LCP between adjacent suffixes from different original strings. Edit distance and approximate matching use dynamic programming rather than exact border reuse; they connect string algorithms back to the DP page. Burrows-Wheeler transform and FM-indexes compress the text while retaining search capability. They support backward search over character ranges and are fundamental in read alignment and compressed text indexes.

Visual

StructurePreprocessesQuery styleTime highlight
KMP prefix functionpatternexact one-pattern scanO(n+m)O(n+m)
Rabin-Karp hashpattern and rolling text windowsexact candidates with verificationexpected O(n+m)O(n+m)
Boyer-Moore tablespatternright-to-left comparisonsoften sublinear practical scans
Aho-Corasick triepattern setmulti-pattern streamlinear plus outputs
Z-arraycombined stringprefix-match lengthsO(n+m)O(n+m)
Suffix arraytextmany pattern queriesbuild O(nlogn)O(n\log n) or better
Suffix treetextsubstring indexquery O(m+occ)O(m+\text{occ})
FM-indexcompressed textbackward searchdepends on rank/select support

Worked example 1: KMP failure function for ABABCABAB

Problem. Compute the prefix function π\pi for pattern ABABCABAB.

Method. Let π[i]\pi[i] be the longest proper border length for the prefix ending at ii.

iiprefixcurrent charlongest borderπ[i]\pi[i]
0AAnone0
1ABBnone0
2ABAAA1
3ABABBAB2
4ABABCCnone after fallback0
5ABABCAAA1
6ABABCABBAB2
7ABABCABAAABA3
8ABABCABABBABAB4

Check one nontrivial fallback. At i=4i=4, the matched border length before reading C is 22, expecting pattern character A, but the character is C. Fall back from length 22 to π[1]=0\pi[1]=0. At length 00, C still does not match A, so π[4]=0\pi[4]=0.

Checked answer. The prefix function is

[0,0,1,2,0,1,2,3,4].[0,0,1,2,0,1,2,3,4].

This means that after matching the full pattern and then seeing a mismatch, KMP can resume as if the prefix ABAB had already matched.

Worked example 2: Rabin-Karp matching ABAB in ABCABABAB

Problem. Find pattern ABAB in text ABCABABAB using a simple rolling hash. Use character values A=1A=1, B=2B=2, C=3C=3, base 1010, and modulus 101101.

Method.

  1. Pattern hash:

H(P)=1103+2102+110+2=12120(mod101).H(P)=1\cdot10^3+2\cdot10^2+1\cdot10+2=1212\equiv0\pmod{101}.

  1. First text window ABCA:

H=11000+2100+310+1=123119.H=1\cdot1000+2\cdot100+3\cdot10+1=1231\equiv19.

No match.

  1. Roll to BCAB. Remove A=1A=1, multiply by base, add B=2B=2:

((123111000)10+2)=231290.((1231-1\cdot1000)\cdot10+2)=2312\equiv90.

No match.

  1. Roll to CABA:

((231221000)10+1)=312191.((2312-2\cdot1000)\cdot10+1)=3121\equiv91.

No match.

  1. Roll to ABAB:

((312131000)10+2)=12120.((3121-3\cdot1000)\cdot10+2)=1212\equiv0.

Hash matches. Verify characters: text positions 3 through 6 are exactly ABAB.

  1. Roll to the final BABA, then ABAB at positions 5 through 8. The final window also hashes to 00 and verifies.

Checked answer. Matches start at indices 33 and 55 using zero-based indexing.

Code

def prefix_function(pattern):
pi = [0] * len(pattern)
j = 0
for i in range(1, len(pattern)):
while j > 0 and pattern[i] != pattern[j]:
j = pi[j - 1]
if pattern[i] == pattern[j]:
j += 1
pi[i] = j
return pi

def kmp_search(text, pattern):
if pattern == "":
return list(range(len(text) + 1))
pi = prefix_function(pattern)
matches = []
j = 0
for i, ch in enumerate(text):
while j > 0 and ch != pattern[j]:
j = pi[j - 1]
if ch == pattern[j]:
j += 1
if j == len(pattern):
matches.append(i - len(pattern) + 1)
j = pi[j - 1]
return matches

def rabin_karp(text, pattern, base=256, mod=1_000_000_007):
m = len(pattern)
if m == 0:
return list(range(len(text) + 1))
if m > len(text):
return []
high = pow(base, m - 1, mod)
ph = th = 0
for i in range(m):
ph = (ph * base + ord(pattern[i])) % mod
th = (th * base + ord(text[i])) % mod
matches = []
for i in range(len(text) - m + 1):
if ph == th and text[i:i + m] == pattern:
matches.append(i)
if i + m < len(text):
th = (th - ord(text[i]) * high) % mod
th = (th * base + ord(text[i + m])) % mod
return matches

def z_function(s):
z = [0] * len(s)
left = right = 0
for i in range(1, len(s)):
if i <= right:
z[i] = min(right - i + 1, z[i - left])
while i + z[i] < len(s) and s[z[i]] == s[i + z[i]]:
z[i] += 1
if i + z[i] - 1 > right:
left, right = i, i + z[i] - 1
return z

Common pitfalls

  • Treating hash equality as proof of equality without verification or a collision argument.
  • Building the KMP prefix table over the text instead of the pattern for standard scanning.
  • Forgetting to fall back repeatedly in KMP; one fallback may not be enough.
  • Using a sentinel in P + sentinel + T that can also appear in the input.
  • Claiming Boyer-Moore is worst-case linear without the full set of rules and implementation details.
  • Reporting overlapping matches incorrectly after a full KMP match.
  • Storing every suffix string explicitly when building a suffix array, causing quadratic memory use.
  • Confusing longest common subsequence with longest common substring.
  • Implementing Aho-Corasick outputs only at terminal nodes and missing patterns inherited through failure links.
  • Using a tiny modulus in Rabin-Karp on adversarial input.
  • Ignoring Unicode normalization when matching human-language text.
  • Applying byte-oriented algorithms directly to variable-width encoded text without defining the symbol alphabet.

Connections

References

[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 4th ed. MIT Press, 2022.

[2] R. Sedgewick and K. Wayne, Algorithms, 4th ed. Addison-Wesley, 2011.

[3] D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching, 2nd ed. Addison-Wesley, 1998.

[4] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings. Cambridge University Press, 2002.

[5] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.

[6] K. Mehlhorn and P. Sanders, Algorithms and Data Structures: The Basic Toolbox. Springer, 2008.

[7] D. E. Knuth, J. H. Morris, Jr., and V. R. Pratt, "Fast pattern matching in strings," SIAM Journal on Computing, vol. 6, no. 2, pp. 323-350, 1977.

[8] R. S. Boyer and J. S. Moore, "A fast string searching algorithm," Communications of the ACM, vol. 20, no. 10, pp. 762-772, 1977. https://doi.org/10.1145/359842.359859

[9] A. V. Aho and M. J. Corasick, "Efficient string matching: An aid to bibliographic search," Communications of the ACM, vol. 18, no. 6, pp. 333-340, 1975.

[10] R. M. Karp and M. O. Rabin, "Efficient randomized pattern-matching algorithms," IBM Journal of Research and Development, vol. 31, no. 2, pp. 249-260, 1987.

[11] E. Ukkonen, "On-line construction of suffix trees," Algorithmica, vol. 14, pp. 249-260, 1995.

[12] G. Nong, S. Zhang, and W. H. Chan, "Linear suffix array construction by almost pure induced-sorting," Data Compression Conference, pp. 193-202, 2009.

[13] M. Burrows and D. J. Wheeler, "A block-sorting lossless data compression algorithm," Digital SRC Research Report 124, 1994.