Text Extraction
Regex patterns for extracting structured data from unstructured or semi-structured text sources
You are an expert in using regular expressions to extract meaningful data from unstructured and semi-structured text.
## Key Points
- **Python:** `re.findall()` or `re.finditer()`
- **JavaScript:** `string.matchAll(regex)` with the `g` flag
- **Java:** loop on `matcher.find()`
- Use `findall` or `finditer` (global matching) for extraction tasks. A single `match` or `search` only returns the first occurrence.
- Prefer specific character classes over `.` to avoid capturing unintended content. `[^\s]+` is better than `.+?` for a "word."
- When extracting from HTML or XML, prefer a proper parser (BeautifulSoup, lxml). Regex is acceptable for quick extractions from fragments but unreliable for complex nested structures.
- Test extraction patterns against real-world samples with varied formatting. Edge cases are common in unstructured text.
- Post-process extracted data: strip whitespace, normalize case, convert types. Regex gets the raw text; your code cleans it.
- Use named groups to create structured output directly, making downstream processing clearer.
- Over-matching with greedy quantifiers. Always consider whether `.*` should be `.*?` when content is between delimiters.
- Extracting from HTML with regex. Patterns like `<div>(.*?)</div>` break with nested `<div>` tags. Use a parser for nested structures.
- Assuming whitespace is a reliable delimiter. Text may contain tabs, non-breaking spaces, or multiple consecutive spaces.
## Quick Example
```regex
"(.*?)" # lazy: captures content of each quoted string individually
"(.*)" # greedy: captures from first quote to last quote on the line
```
```regex
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
```skilldb get regex-skills/Text ExtractionFull skill: 175 linesText Extraction — Regular Expressions
You are an expert in using regular expressions to extract meaningful data from unstructured and semi-structured text.
Core Philosophy
Overview
Text extraction uses regex to pull specific pieces of information out of larger bodies of text. Unlike validation (which checks an entire string against a format), extraction scans through content to find and capture all occurrences of a pattern. This is fundamental to data scraping, ETL pipelines, document processing, and content analysis.
Core Concepts
Global Matching
Extraction typically requires finding all matches in a body of text, not just the first.
- Python:
re.findall()orre.finditer() - JavaScript:
string.matchAll(regex)with thegflag - Java: loop on
matcher.find()
Anchored vs. Unanchored Patterns
Extraction patterns are usually unanchored (no ^ or $) because the target data is embedded within surrounding text.
Greedy vs. Lazy for Delimited Content
When extracting content between delimiters, lazy quantifiers prevent over-matching:
"(.*?)" # lazy: captures content of each quoted string individually
"(.*)" # greedy: captures from first quote to last quote on the line
Implementation Patterns
Extract all email addresses from text
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Extract all URLs from text
https?://[^\s<>"']+
Extract hashtags
#([a-zA-Z]\w{0,138})
Captures the tag name without the # symbol.
Extract @mentions
@([a-zA-Z_]\w{0,38})
Extract monetary amounts
\$[\d,]+(?:\.\d{2})?
Matches: $1,234.56, $99, $1,000,000.00
For multiple currencies:
(?P<currency>[$$\u20AC\u00A3\u00A5])(?P<amount>[\d,]+(?:\.\d{2})?)
Extract dates in multiple formats
(?P<date>\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}[/-]\d{2}[/-]\d{2}|\w+ \d{1,2},?\s*\d{4})
Matches: 03/17/2026, 2026-03-17, March 17, 2026
Extract key-value pairs from configuration text
^(?P<key>[A-Z_][A-Z0-9_]*)\s*=\s*(?P<value>.+)$
With multiline flag, extracts all KEY=value pairs from a config file.
Extract HTML tag attributes
(?P<attr>\w+)\s*=\s*"(?P<value>[^"]*)"
From <a href="https://example.com" class="link">, captures href/https://example.com and class/link.
Extract CSV fields (handling quoted fields)
(?:^|,)(?:"([^"]*(?:""[^"]*)*)"|([^,]*))
Handles fields with commas inside quotes and escaped double quotes.
Extract code blocks from Markdown
```(\w*)\n([\s\S]*?)```
Captures the language identifier and code content.
Extract table data from plain text
Pipe-delimited table rows:
^\|\s*(?P<cells>[^|]+(?:\|[^|]+)*)\s*\|$
Extract structured data from natural language
Extract ages from biographical text:
(?P<name>[A-Z][a-z]+ [A-Z][a-z]+)(?:,?\s+(?:age|aged)\s+)(?P<age>\d{1,3})
Matches "Jane Smith, age 34" and "John Doe aged 28".
Extract version numbers
v?(?P<major>\d+)\.(?P<minor>\d+)(?:\.(?P<patch>\d+))?(?:-(?P<label>[a-zA-Z0-9.]+))?
Matches: v2.1.0, 3.14, 1.0.0-beta.2
Best Practices
- Use
findallorfinditer(global matching) for extraction tasks. A singlematchorsearchonly returns the first occurrence. - Prefer specific character classes over
.to avoid capturing unintended content.[^\s]+is better than.+?for a "word." - When extracting from HTML or XML, prefer a proper parser (BeautifulSoup, lxml). Regex is acceptable for quick extractions from fragments but unreliable for complex nested structures.
- Test extraction patterns against real-world samples with varied formatting. Edge cases are common in unstructured text.
- Post-process extracted data: strip whitespace, normalize case, convert types. Regex gets the raw text; your code cleans it.
- Use named groups to create structured output directly, making downstream processing clearer.
Common Pitfalls
- Over-matching with greedy quantifiers. Always consider whether
.*should be.*?when content is between delimiters. - Extracting from HTML with regex. Patterns like
<div>(.*?)</div>break with nested<div>tags. Use a parser for nested structures. - Assuming whitespace is a reliable delimiter. Text may contain tabs, non-breaking spaces, or multiple consecutive spaces.
- Forgetting multiline mode when extracting from multi-line documents. Without
re.MULTILINE,^and$only match start and end of the entire string. - Not handling Unicode text. Names, addresses, and content in many languages contain characters outside ASCII. Use the Unicode flag or Unicode-aware character classes.
- Capturing too much or too little context. Always verify extraction results against the source text.
Anti-Patterns
Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.
Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.
Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.
Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.
Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.
Install this skill directly: skilldb add regex-skills
Related Skills
Basics Syntax
Core regular expression syntax including character classes, quantifiers, anchors, and alternation
Email URL Validation
Practical regex patterns for validating emails, URLs, IP addresses, and other common string formats
Log Parsing
Regex patterns for parsing structured and semi-structured log files from common servers, applications, and systems
Lookahead Lookbehind
Lookahead and lookbehind assertions for matching patterns based on surrounding context without consuming characters
Named Groups
Named capture groups for readable, maintainable regex patterns with structured data extraction
Performance
Regex performance optimization, catastrophic backtracking prevention, and engine internals for writing efficient patterns