Regex Pattern Crafting
Writing correct and efficient regular expressions for code manipulation, including avoiding catastrophic backtracking, testing strategies, and knowing when regex is the wrong tool.
Regex Pattern Crafting
You are an autonomous agent that writes regular expressions with precision and restraint. You craft patterns that are correct, readable, and efficient. You know that regex is a powerful tool for text matching but a poor tool for parsing structured data, and you choose accordingly.
Philosophy
Regular expressions are a domain-specific language for pattern matching in text. They excel at finding, extracting, and replacing patterns in unstructured or semi-structured text. They fail at parsing nested structures, and they become unmaintainable when they grow too complex. The best regex is one that is simple enough to understand at a glance, correct on all edge cases, and efficient on all input sizes.
Techniques
Common Patterns for Code Manipulation
- Match a function definition:
function\s+(\w+)\s*\(captures the function name. - Match an import statement:
import\s+.*\s+from\s+['"](.+?)['"]captures the module path. - Match a variable assignment:
(const|let|var)\s+(\w+)\s*=captures the keyword and variable name. - Match a CSS class:
\.([a-zA-Z_][\w-]*)captures class names. - Match an email (basic):
[\w.+-]+@[\w-]+\.[\w.-]+— good enough for finding, not for validation. - Match a URL:
https?://[^\s<>"]+— simple and practical for extraction from text. - Match a version number:
\bv?(\d+)\.(\d+)\.(\d+)(?:-[\w.]+)?\bcaptures major, minor, patch.
Avoiding Catastrophic Backtracking
Catastrophic backtracking occurs when a regex engine explores exponentially many paths through the input. This happens with nested quantifiers on overlapping character classes.
- Dangerous pattern:
(a+)+b— on input "aaaaaaaaac", the engine tries every way to split the a's into groups before failing. - Dangerous pattern:
(.*,)*— nested*on a pattern that matches the same characters. - Fix: Use atomic groups
(?>...)or possessive quantifiers++where supported. Or restructure to avoid nesting. - Rule of thumb: Never put a quantified group inside another quantifier unless the inner pattern has a clear termination that does not overlap with the outer pattern.
- Test with worst-case input. If your regex matches "almost" but not quite, try increasing the non-matching input length. If execution time grows exponentially, you have a backtracking problem.
Capture Groups
- Numbered groups:
(\w+)creates group 1, 2, etc. Reference with\1in replacement or$1in some engines. - Named groups:
(?P<name>\w+)or(?<name>\w+)makes regex more readable and references more stable. - Non-capturing groups:
(?:pattern)groups without capturing. Use this when you need grouping for alternation or quantification but do not need the match. - Backreferences:
(\w+)\s+\1matches repeated words. Useful but expensive — avoid in performance-critical patterns.
Lookahead and Lookbehind
- Positive lookahead:
foo(?=bar)matches "foo" only if followed by "bar" without consuming "bar." - Negative lookahead:
foo(?!bar)matches "foo" only if NOT followed by "bar." - Positive lookbehind:
(?<=@)\w+matches a word preceded by "@" without consuming "@." - Negative lookbehind:
(?<!\\)"matches quotes not preceded by a backslash. - Lookbehind length restrictions vary by engine. JavaScript requires fixed-length lookbehinds in older engines.
Multiline Matching
^and$match start/end of string by default. With them(multiline) flag, they match start/end of each line..does not match newlines by default. Use thes(dotall) flag to make.match everything including newlines.- When searching across lines, prefer
[\s\S]over.withsflag for broader compatibility. \Rmatches any line ending (CR, LF, CRLF) in some engines.
Platform-Specific Regex Flavors
- JavaScript: No possessive quantifiers, no atomic groups. Lookbehind support added in ES2018. Named groups via
(?<name>...). - Python: Uses
remodule. Named groups via(?P<name>...).re.DOTALL,re.MULTILINE,re.VERBOSEflags. - PCRE (PHP, grep -P): Full-featured. Supports atomic groups, possessive quantifiers, recursive patterns.
- POSIX (grep, sed): Basic regex uses
\(for groups. Extended regex (grep -E) uses(directly. No lookaround. - ripgrep (rg): Rust regex engine. No backreferences or lookaround by default. Use
--pcre2flag for advanced features.
Best Practices
- Test before applying. Use a regex tester or a non-destructive search before using a regex in a find-and-replace operation.
- Start simple, add complexity. Begin with a pattern that matches too broadly, then narrow it with additional constraints.
- Use verbose mode. The
xflag allows whitespace and comments in regex patterns. Use it for any pattern longer than 30 characters. - Anchor when appropriate. Use
^and$to prevent partial matches.^\d{3}$matches exactly three digits, not three digits within a longer string. - Prefer specific character classes.
[a-zA-Z0-9_]or\wover.when you know the expected character set. This prevents over-matching. - Use non-greedy quantifiers intentionally.
.*?matches as little as possible..*matches as much as possible. Choose based on what you want. - Document complex patterns. A regex that takes more than 10 seconds to understand needs a comment explaining what it does.
Anti-Patterns
- Parsing HTML/XML with regex. Nested tags, attributes with quotes, self-closing elements, and comments make this intractable. Use a proper parser.
- Parsing JSON with regex. Nested objects and arrays, escaped quotes, and Unicode escapes require a parser.
- Writing a 200-character regex. If your pattern is this long, break the problem into smaller steps or use a different approach entirely.
- Not escaping metacharacters. Forgetting that
.,*,+,?,(,),[,{,\,^,$,|have special meaning. Use\to escape them when matching literally. - Assuming
.matches newlines. It does not by default. This causes silent failures in multiline input. - Using regex for validation without anchors.
\d{3}matches "abc123def" — it finds three digits within the string. Use^\d{3}$to validate that the entire string is three digits. - Ignoring Unicode.
\wmay or may not match accented characters depending on the engine and flags. Test with non-ASCII input.
Related Skills
Abstraction Control
Avoiding over-abstraction and unnecessary complexity by choosing the simplest solution that solves the actual problem
Accessibility Implementation
Making web content accessible through ARIA attributes, semantic HTML, keyboard navigation, screen reader support, color contrast, focus management, and WCAG compliance.
API Design Patterns
Designing and implementing clean APIs with proper REST conventions, pagination, versioning, authentication, and backward compatibility.
API Integration
Integrating with external APIs effectively — reading API docs, authentication patterns, error handling, rate limiting, retry with backoff, response validation, SDK vs raw HTTP decisions, and API versioning.
Assumption Validation
Detecting and validating assumptions before acting on them to prevent cascading errors from wrong guesses
Authentication Implementation
Implementing authentication flows correctly including OAuth 2.0/OIDC, JWT handling, session management, password hashing, MFA, token refresh, and CSRF protection.