Basics Syntax
Core regular expression syntax including character classes, quantifiers, anchors, and alternation
You are an expert in foundational regex syntax for pattern matching across programming languages.
## Key Points
- Start with the simplest pattern that works and refine incrementally.
- Use raw strings in your language (Python `r"..."`, C# `@"..."`) to avoid double-escaping backslashes.
- Always anchor patterns with `^` and `$` when validating an entire string.
- Prefer character classes over alternation when matching single characters: `[aeiou]` not `(a|e|i|o|u)`.
- Use non-capturing groups `(?:...)` when you do not need the captured value.
- Comment complex patterns using the verbose/extended flag (`x` or `re.VERBOSE`).
- Forgetting that `.` does not match `\n` by default. Use the `s` (dotall) flag if you need it to match newlines.
- Using `.*` when `.*?` is intended, leading to over-matching due to greedy behavior.
- Omitting anchors during validation, allowing partial matches to pass (e.g., `\d+` matching the "123" inside "abc123xyz").
- Confusing `^` inside a character class (`[^...]` = negation) with `^` outside (anchor for start of string).
- Assuming `\d` matches only ASCII digits in all engines. In Python 3 and JavaScript with unicode mode, `\d` can match non-ASCII digit characters. Use `[0-9]` for ASCII-only matching.
## Quick Example
```
. ^ $ * + ? { } [ ] \ | ( )
```
```regex
.*? # lazy zero or more
.+? # lazy one or more
.{2,5}? # lazy between 2 and 5
```skilldb get regex-skills/Basics SyntaxFull skill: 167 linesBasics & Syntax — Regular Expressions
You are an expert in foundational regex syntax for pattern matching across programming languages.
Core Philosophy
Overview
Regular expressions are a concise language for describing text patterns. Every regex engine supports a common set of metacharacters, character classes, quantifiers, and anchors. Mastering these building blocks is essential before tackling advanced features.
Core Concepts
Literal Characters
Any character that is not a metacharacter matches itself. The metacharacters that require escaping with a backslash are:
. ^ $ * + ? { } [ ] \ | ( )
To match a literal dot: \.
To match a literal backslash: \\
Character Classes
Character classes match one character from a defined set.
| Pattern | Meaning |
|---|---|
[abc] | Matches a, b, or c |
[a-z] | Matches any lowercase letter |
[^abc] | Matches any character except a, b, or c |
[a-zA-Z0-9] | Matches any alphanumeric character |
Shorthand Character Classes
| Shorthand | Equivalent | Meaning |
|---|---|---|
\d | [0-9] | Digit |
\D | [^0-9] | Non-digit |
\w | [a-zA-Z0-9_] | Word character |
\W | [^a-zA-Z0-9_] | Non-word character |
\s | [ \t\n\r\f\v] | Whitespace |
\S | [^ \t\n\r\f\v] | Non-whitespace |
. | [^\n] (default) | Any character except newline |
Quantifiers
Quantifiers control how many times a preceding element must occur.
| Quantifier | Meaning |
|---|---|
* | Zero or more (greedy) |
+ | One or more (greedy) |
? | Zero or one (greedy) |
{n} | Exactly n times |
{n,} | At least n times |
{n,m} | Between n and m times (inclusive) |
Append ? to make any quantifier lazy (match as few as possible):
.*? # lazy zero or more
.+? # lazy one or more
.{2,5}? # lazy between 2 and 5
Anchors
Anchors match positions, not characters.
| Anchor | Meaning |
|---|---|
^ | Start of string (or line in multiline mode) |
$ | End of string (or line in multiline mode) |
\b | Word boundary |
\B | Non-word boundary |
Example — match a whole word:
\bcat\b
Matches cat in "the cat sat" but not in "concatenate".
Alternation and Grouping
The pipe | acts as OR. Parentheses group sub-expressions:
(cat|dog) # matches "cat" or "dog"
gr(a|e)y # matches "gray" or "grey"
(ab)+ # matches "ab", "abab", "ababab", ...
Non-capturing groups avoid creating a capture:
(?:cat|dog) # groups without capturing
Implementation Patterns
Match a simple integer (positive or negative)
^-?\d+$
Match a hex color code
^#([0-9a-fA-F]{3}|[0-9a-fA-F]{6})$
Match a date in YYYY-MM-DD format (loose)
^\d{4}-\d{2}-\d{2}$
Extract all words from a string
\b\w+\b
Match a quoted string (handles escaped quotes)
"([^"\\]|\\.)*"
Best Practices
- Start with the simplest pattern that works and refine incrementally.
- Use raw strings in your language (Python
r"...", C#@"...") to avoid double-escaping backslashes. - Always anchor patterns with
^and$when validating an entire string. - Prefer character classes over alternation when matching single characters:
[aeiou]not(a|e|i|o|u). - Use non-capturing groups
(?:...)when you do not need the captured value. - Comment complex patterns using the verbose/extended flag (
xorre.VERBOSE).
Common Pitfalls
- Forgetting that
.does not match\nby default. Use thes(dotall) flag if you need it to match newlines. - Using
.*when.*?is intended, leading to over-matching due to greedy behavior. - Omitting anchors during validation, allowing partial matches to pass (e.g.,
\d+matching the "123" inside "abc123xyz"). - Confusing
^inside a character class ([^...]= negation) with^outside (anchor for start of string). - Assuming
\dmatches only ASCII digits in all engines. In Python 3 and JavaScript with unicode mode,\dcan match non-ASCII digit characters. Use[0-9]for ASCII-only matching.
Anti-Patterns
Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.
Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.
Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.
Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.
Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.
Install this skill directly: skilldb add regex-skills
Related Skills
Email URL Validation
Practical regex patterns for validating emails, URLs, IP addresses, and other common string formats
Log Parsing
Regex patterns for parsing structured and semi-structured log files from common servers, applications, and systems
Lookahead Lookbehind
Lookahead and lookbehind assertions for matching patterns based on surrounding context without consuming characters
Named Groups
Named capture groups for readable, maintainable regex patterns with structured data extraction
Performance
Regex performance optimization, catastrophic backtracking prevention, and engine internals for writing efficient patterns
Search Replace
Regex-powered find and replace patterns for text transformation, refactoring, and data reformatting