Skip to main content
Technology & EngineeringRegex224 lines

Performance

Regex performance optimization, catastrophic backtracking prevention, and engine internals for writing efficient patterns

Quick Summary28 lines
You are an expert in regex performance, engine internals, and avoiding catastrophic backtracking.

## Key Points

- Profile regex patterns with representative production data, not just small test strings. O(n^2) or worse behavior only becomes visible at scale.
- Use possessive quantifiers or atomic groups to prevent backtracking wherever the engine should not give back characters.
- Prefer specific character classes (`[^"]*` instead of `.*?`) to reduce backtracking opportunities.
- Limit input length before applying regex when processing user-provided text. A simple length check prevents ReDoS attacks.
- Set timeouts on regex execution in production systems. Python's `regex` module supports timeouts. Java has `Matcher` interruption.
- Use static analysis tools to detect vulnerable patterns: `rxxr2`, `regexploit`, `safe-regex`.
- Consider whether regex is the right tool. For simple contains/startswith/endswith checks, string methods are faster. For complex parsing, a proper parser is more reliable.
- Nested quantifiers: `(a+)+`, `(a*)*`, `(a+)*` are almost always catastrophic. Flatten them to a single quantifier.
- Assuming regex is always fast because it is "compiled." Compilation does not prevent exponential runtime behavior.
- Testing only matching inputs. Catastrophic backtracking typically occurs on inputs that almost match but ultimately fail, because the engine exhausts all possibilities before giving up.
- Ignoring ReDoS in web applications. User-supplied input matched against a vulnerable pattern is a denial-of-service vector.
- Using `.*` as a "skip anything" pattern in the middle of a complex expression. Each `.*` is a backtracking point. Use `[^X]*` where `X` is the next expected delimiter.

## Quick Example

```regex
(a+)+b
```

```regex
(\w+\s+)+$
```
skilldb get regex-skills/PerformanceFull skill: 224 lines
Paste into your CLAUDE.md or agent config

Performance — Regular Expressions

You are an expert in regex performance, engine internals, and avoiding catastrophic backtracking.

Core Philosophy

Overview

Regex patterns that work correctly on small inputs can become dangerously slow on larger or adversarial inputs. Understanding how regex engines execute patterns is essential for writing expressions that perform reliably in production. The difference between a well-written and poorly-written regex can be the difference between microseconds and minutes on the same input.

Core Concepts

How Regex Engines Work

There are two main engine types:

NFA (Nondeterministic Finite Automaton): Used by Python, JavaScript, Java, .NET, PCRE, Ruby. Supports backtracking, lookarounds, and back-references. Performance depends heavily on pattern structure.

DFA (Deterministic Finite Automaton): Used by awk, grep (some implementations), RE2 (Go). Guarantees linear-time matching but does not support back-references or lookarounds.

Most languages use NFA engines, which means backtracking is a real concern.

What Is Backtracking?

When an NFA engine encounters a quantifier, it tries one option (typically the greedy choice). If the rest of the pattern fails, the engine "backtracks" to try the next option. With nested quantifiers or overlapping alternatives, the number of paths can grow exponentially.

Catastrophic Backtracking

A pattern exhibits catastrophic backtracking when the number of backtracking steps grows exponentially with input length. This is effectively a denial-of-service vulnerability (ReDoS).

Classic example:

(a+)+b

On input aaaaaaaaaaaaaaaaac (no final b), the engine tries 2^n combinations of how to distribute the as across the inner and outer groups before concluding there is no match.

Implementation Patterns

Identifying Dangerous Patterns

These structural patterns are red flags for catastrophic backtracking:

Pattern StructureRiskExample
(a+)+Nested quantifiers(x+)+y
`(aa)+`Overlapping alternation with quantifier
(a+b?)+Quantified group with optional element(\w+\s?)+$
(a|b)* where a and b overlapAmbiguous alternation(\s+|\s*)*

Safe version of a dangerous pattern

Dangerous:

(\w+\s+)+$

On input with no trailing match, this backtracks exponentially. Safe alternatives:

Atomic group (if supported):

(?>\w+\s+)+$

Possessive quantifier (Java, PCRE):

(\w++\s++)+$

Restructured pattern:

(?:\w+\s+)*\w+$

Atomic Groups and Possessive Quantifiers

These disable backtracking for the sub-expression they enclose.

Possessive quantifiers (++, *+, ?+): available in Java, PCRE, and newer regex engines.

\w++        # match word characters, never give back
[^"]*+"     # match non-quote chars, never backtrack

Atomic groups ((?>...)): available in .NET, PCRE, Java, Ruby. Equivalent to possessive quantifiers.

(?>[^"]*)"  # same as [^"]*+"

Benchmarking Pattern Performance

Python:

import re
import timeit

pattern = re.compile(r'your_pattern_here')
text = 'your test input' * 1000

time = timeit.timeit(lambda: pattern.search(text), number=1000)
print(f'{time:.4f} seconds for 1000 iterations')

Optimizing Character Classes

Use negated character classes instead of lazy dot-star when matching up to a delimiter:

# Slow: lazy dot-star backtracks on every character
".*?"

# Fast: negated class advances directly to the delimiter
"[^"]*"

Benchmark difference on a 10,000-character string: often 5-10x faster.

Optimizing Alternation Order

NFA engines try alternatives left to right. Put the most likely match first:

# If INFO is most common, put it first
(?:INFO|WARN|ERROR|DEBUG)

# Not
(?:DEBUG|ERROR|WARN|INFO)

Avoiding Unnecessary Captures

Non-capturing groups are slightly faster than capturing groups:

# Slower (captures unused groups)
(https?):\/\/(www\.)?(.+)

# Faster (no unnecessary captures)
(?:https?):\/\/(?:www\.)?(.+)

Pre-compiling Patterns

Always compile patterns that are used repeatedly:

# Bad: recompiles on every call
for line in million_lines:
    re.search(r'\d{4}-\d{2}-\d{2}', line)

# Good: compile once
date_pattern = re.compile(r'\d{4}-\d{2}-\d{2}')
for line in million_lines:
    date_pattern.search(line)

Use Anchors to Fail Fast

Anchors let the engine skip impossible positions immediately:

# Without anchor: engine tries at every position in the string
\d{4}-\d{2}-\d{2}$

# With start anchor: engine only tries at position 0
^\d{4}-\d{2}-\d{2}$

The RE2 Alternative

For untrusted input patterns, use Google's RE2 engine which guarantees linear-time matching:

# Python
import re2  # pip install google-re2
re2.search(pattern, text)
// Go (standard library uses RE2)
regexp.MustCompile(pattern)

Trade-off: RE2 does not support back-references, lookarounds, or atomic groups.

Best Practices

  • Profile regex patterns with representative production data, not just small test strings. O(n^2) or worse behavior only becomes visible at scale.
  • Use possessive quantifiers or atomic groups to prevent backtracking wherever the engine should not give back characters.
  • Prefer specific character classes ([^"]* instead of .*?) to reduce backtracking opportunities.
  • Limit input length before applying regex when processing user-provided text. A simple length check prevents ReDoS attacks.
  • Set timeouts on regex execution in production systems. Python's regex module supports timeouts. Java has Matcher interruption.
  • Use static analysis tools to detect vulnerable patterns: rxxr2, regexploit, safe-regex.
  • Consider whether regex is the right tool. For simple contains/startswith/endswith checks, string methods are faster. For complex parsing, a proper parser is more reliable.

Common Pitfalls

  • Nested quantifiers: (a+)+, (a*)*, (a+)* are almost always catastrophic. Flatten them to a single quantifier.
  • Assuming regex is always fast because it is "compiled." Compilation does not prevent exponential runtime behavior.
  • Testing only matching inputs. Catastrophic backtracking typically occurs on inputs that almost match but ultimately fail, because the engine exhausts all possibilities before giving up.
  • Ignoring ReDoS in web applications. User-supplied input matched against a vulnerable pattern is a denial-of-service vector.
  • Using .* as a "skip anything" pattern in the middle of a complex expression. Each .* is a backtracking point. Use [^X]* where X is the next expected delimiter.
  • Believing that "it works in my editor" means it will perform well on megabytes of text. Editor regex tools typically process small selections; production processes large files.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add regex-skills

Get CLI access →