Skip to main content

Regular Expressions and Nonregularity

Regular expressions give an algebraic notation for regular languages. They describe languages from atomic pieces using union, concatenation, and star. In practice, regex syntax in programming languages includes many extensions, but the mathematical core is deliberately small. That small core is exactly equivalent to finite automata.

The same chapter of the theory also introduces nonregularity. Once we know many languages are regular, we need a principled way to show that some are not. The pumping lemma is the standard first tool. It proves that every regular language has a finite-memory repetition property, then shows that certain languages violate it.

Definitions

A regular expression over alphabet Σ\Sigma is defined recursively. The expressions \emptyset, ϵ\epsilon, and each symbol aΣa\in\Sigma are regular expressions. If RR and SS are regular expressions, then (RS)(R\cup S), (RS)(RS), and (R)(R^*) are regular expressions.

The language of a regular expression is defined by the same recursion. \emptyset denotes no strings, ϵ\epsilon denotes the language containing only the empty string, and aa denotes the language containing the one-character string aa. Union, concatenation, and star have their language-operation meanings.

A generalized nondeterministic finite automaton, or GNFA, labels transitions by regular expressions rather than single symbols. State elimination converts a GNFA to one regular expression by removing states while preserving the language of paths from the start to the accept state.

A language is nonregular if no DFA, NFA, or regular expression recognizes it. Because these three formalisms are equivalent, ruling out one rules out all.

The pumping lemma for regular languages says that if AA is regular, then there is a pumping length pp such that any string sAs\in A with sp\vert s\vert \ge p can be split as s=xyzs=xyz satisfying xyp\vert xy\vert \le p, y>0\vert y\vert \gt 0, and xyizAxy^iz\in A for every i0i\ge0.

Key results

Regular expressions and finite automata are equivalent. To convert a regular expression to an NFA, use structural induction: atomic expressions have tiny NFAs, union uses epsilon branching, concatenation links final states to the next start, and star adds loops and an accepting empty path. To convert an automaton to a regular expression, use state elimination or an equivalent dynamic-programming construction.

The pumping lemma is a necessary condition for regularity, not a characterization. Every regular language pumps, but some nonregular languages may still have pumpable strings or even satisfy variants of pumping behavior. Therefore the lemma is mainly a proof-by-contradiction tool for showing nonregularity.

The structure of a pumping proof is rigid. Assume the target language is regular and let pp be its pumping length. Choose a string ss in the language with length at least pp. Consider an arbitrary split s=xyzs=xyz satisfying the pumping constraints. Show that some pump value, often i=0i=0 or i=2i=2, produces a string outside the language. Because the split was arbitrary, no valid pumping split exists.

Closure properties can also prove nonregularity. If a target language were regular, intersecting it with a carefully chosen regular language or applying a homomorphism might produce a known nonregular language. This strategy becomes more powerful after reductions are introduced.

Regular expressions are syntax trees. The expression (01)01(0\cup1)^*01 is not merely a string of symbols; it has a root concatenation, a starred subexpression, and two literal leaves. Structural induction on that tree is the natural way to prove things about all regular expressions. For example, the regex-to-NFA conversion proves one case for \emptyset, one for ϵ\epsilon, one for a literal, and then one case each for union, concatenation, and star.

The GNFA state-elimination formula is a compact way to preserve all paths through a removed state. If a state rr is removed, and there are remaining states ii and jj, the new label from ii to jj becomes the old direct label plus the paths that go from ii to rr, loop around rr any number of times, and then go from rr to jj. Algebraically this is RijRir(Rrr)RrjR_{ij}\cup R_{ir}(R_{rr})^*R_{rj}. The formula is less important than the path idea: no accepted string should be lost or added when the state disappears.

Pumping proofs are adversarial. You choose the pumping string after seeing only the pumping length, but an adversary chooses the split subject to the constraints. Therefore the chosen string should force every legal yy into a region whose repetition breaks the language. For 0p1p0^p1^p, the constraint xyp\vert xy\vert \le p traps yy inside the zeros. For more complicated languages, the hardest part is designing a string that traps the pumpable pieces.

The pumping lemma also explains why finite memory implies repetition. A DFA reading a long enough string must visit some state twice while processing an early segment. The substring read between the two visits can be repeated or removed because the machine returns to the same state. This is the machine-level reason behind the algebraic split s=xyzs=xyz. Understanding that reason makes the lemma easier to adapt and prevents treating it as a black-box ritual.

When a pumping proof becomes messy, closure may simplify it. Intersect the suspected language with a regular language that isolates a clean pattern, or apply a homomorphism that erases irrelevant symbols. Because regular languages are closed under these operations, a contradiction against a known nonregular language transfers back to the original target.

Algebraic identities for regular expressions are helpful but should be used semantically. For instance, R=R\emptyset=\emptyset, Rϵ=RR\epsilon=R, and RR=RR\cup R=R are true because the denoted languages are equal. Operator precedence is normally star before concatenation before union, but adding parentheses avoids ambiguity in handwritten solutions. When converting to automata, the parse tree induced by those parentheses determines the construction order.

Some languages look nonregular only because they are described awkwardly. "Binary strings whose numeric value is divisible by 3" is regular because a DFA can store the value modulo 3 while scanning bits. "Strings with the same number of zeros and ones" is not regular because the difference can grow without bound. The key question is whether all future-relevant information fits into finitely many summaries.

For pumping, the chosen string should be simple enough that the constraints force the pumpable part into a known region. If a language has several cases, choose a string lying deep inside one case so that pumping cannot escape into another valid case. If the language includes short exceptions, choose a string much longer than the pumping length. The proof should not depend on knowing the exact value of pp, only on using it to construct a sufficiently long witness.

Visual

RegexLanguageNotes
0*any number of zerosincludes ϵ\epsilon
(0 union 1)*all binary stringswritten algebraically as (01)(0\cup1)^*
(0 union 1)*01strings ending in 01arbitrary prefix, fixed suffix
0*1*zeros followed by onescounts need not match
emptyset*{ϵ}\{\epsilon\}zero repetitions are allowed

Worked example 1: Designing a regular expression

Problem. Find a regular expression for binary strings that contain at least one 1 and end in 00.

Method. Separate the required final suffix from the earlier requirement.

  1. Ending in 00 means every accepted string has form x00x00 for some binary string xx.
  2. The string must contain at least one 1. That 1 could be in xx, since the final suffix has only zeros.
  3. A binary string with at least one 1 can be written (01)1(01)(0\cup1)^*1(0\cup1)^*.
  4. Appending the fixed suffix gives (01)1(01)00(0\cup1)^*1(0\cup1)^*00.
  5. Test 100: choose the first 1, then empty middle, then 00; accepted.
  6. Test 000: no 1 can be matched; rejected.
  7. Test 10100: choose the first or second 1; accepted.

Checked answer. A correct expression is (01)1(01)00(0\cup1)^*1(0\cup1)^*00. It enforces both the existence of a 1 and the final suffix.

Worked example 2: Pumping {0n1n:n0}\{0^n1^n:n\ge0\}

Problem. Prove that L={0n1n:n0}L=\{0^n1^n:n\ge0\} is not regular.

Method. Use the pumping lemma by contradiction.

  1. Assume LL is regular. Let pp be the pumping length.
  2. Choose s=0p1ps=0^p1^p, which is in LL and has length at least pp.
  3. Let s=xyzs=xyz be any split with xyp\vert xy\vert \le p and y>0\vert y\vert \gt 0.
  4. Since the first pp symbols of ss are all zeros, both xx and yy lie within the block of zeros. Thus y=0ky=0^k for some k>0k\gt 0.
  5. Pump down with i=0i=0. The string becomes xz=0pk1pxz=0^{p-k}1^p.
  6. This string has fewer zeros than ones, so it is not in LL.

Checked answer. Every permitted split fails when pumped down. This contradicts the pumping lemma, so LL is nonregular.

Code

def regex_like_ends_with_00_and_has_1(w):
return w.endswith("00") and "1" in w

def in_0n1n(w):
zeros = 0
while zeros < len(w) and w[zeros] == "0":
zeros += 1
ones = len(w) - zeros
return w == "0" * zeros + "1" * ones and zeros == ones

for sample in ["100", "000", "0011", "000111", "00111"]:
print(sample, regex_like_ends_with_00_and_has_1(sample), in_0n1n(sample))

Common pitfalls

  • Confusing mathematical regular expressions with full programming-language regex features such as backreferences. Backreferences can exceed regular power.
  • Forgetting that star allows zero repetitions, so RR^* always contains ϵ\epsilon.
  • Using the pumping lemma backward. Showing one string pumps does not prove regularity.
  • Choosing a pumping string not in the language. The lemma applies only to strings in the assumed regular language.
  • Picking one convenient split in a nonregularity proof. You must handle every split satisfying the pumping constraints.

Connections