Skip to main content
Technology & EngineeringRegex175 lines

Text Extraction

Regex patterns for extracting structured data from unstructured or semi-structured text sources

Quick Summary29 lines
You are an expert in using regular expressions to extract meaningful data from unstructured and semi-structured text.

## Key Points

- **Python:** `re.findall()` or `re.finditer()`
- **JavaScript:** `string.matchAll(regex)` with the `g` flag
- **Java:** loop on `matcher.find()`
- Use `findall` or `finditer` (global matching) for extraction tasks. A single `match` or `search` only returns the first occurrence.
- Prefer specific character classes over `.` to avoid capturing unintended content. `[^\s]+` is better than `.+?` for a "word."
- When extracting from HTML or XML, prefer a proper parser (BeautifulSoup, lxml). Regex is acceptable for quick extractions from fragments but unreliable for complex nested structures.
- Test extraction patterns against real-world samples with varied formatting. Edge cases are common in unstructured text.
- Post-process extracted data: strip whitespace, normalize case, convert types. Regex gets the raw text; your code cleans it.
- Use named groups to create structured output directly, making downstream processing clearer.
- Over-matching with greedy quantifiers. Always consider whether `.*` should be `.*?` when content is between delimiters.
- Extracting from HTML with regex. Patterns like `<div>(.*?)</div>` break with nested `<div>` tags. Use a parser for nested structures.
- Assuming whitespace is a reliable delimiter. Text may contain tabs, non-breaking spaces, or multiple consecutive spaces.

## Quick Example

```regex
"(.*?)"     # lazy: captures content of each quoted string individually
"(.*)"      # greedy: captures from first quote to last quote on the line
```

```regex
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
```
skilldb get regex-skills/Text ExtractionFull skill: 175 lines
Paste into your CLAUDE.md or agent config

Text Extraction — Regular Expressions

You are an expert in using regular expressions to extract meaningful data from unstructured and semi-structured text.

Core Philosophy

Overview

Text extraction uses regex to pull specific pieces of information out of larger bodies of text. Unlike validation (which checks an entire string against a format), extraction scans through content to find and capture all occurrences of a pattern. This is fundamental to data scraping, ETL pipelines, document processing, and content analysis.

Core Concepts

Global Matching

Extraction typically requires finding all matches in a body of text, not just the first.

  • Python: re.findall() or re.finditer()
  • JavaScript: string.matchAll(regex) with the g flag
  • Java: loop on matcher.find()

Anchored vs. Unanchored Patterns

Extraction patterns are usually unanchored (no ^ or $) because the target data is embedded within surrounding text.

Greedy vs. Lazy for Delimited Content

When extracting content between delimiters, lazy quantifiers prevent over-matching:

"(.*?)"     # lazy: captures content of each quoted string individually
"(.*)"      # greedy: captures from first quote to last quote on the line

Implementation Patterns

Extract all email addresses from text

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

Extract all URLs from text

https?://[^\s<>"']+

Extract hashtags

#([a-zA-Z]\w{0,138})

Captures the tag name without the # symbol.

Extract @mentions

@([a-zA-Z_]\w{0,38})

Extract monetary amounts

\$[\d,]+(?:\.\d{2})?

Matches: $1,234.56, $99, $1,000,000.00

For multiple currencies:

(?P<currency>[$$\u20AC\u00A3\u00A5])(?P<amount>[\d,]+(?:\.\d{2})?)

Extract dates in multiple formats

(?P<date>\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}[/-]\d{2}[/-]\d{2}|\w+ \d{1,2},?\s*\d{4})

Matches: 03/17/2026, 2026-03-17, March 17, 2026

Extract key-value pairs from configuration text

^(?P<key>[A-Z_][A-Z0-9_]*)\s*=\s*(?P<value>.+)$

With multiline flag, extracts all KEY=value pairs from a config file.

Extract HTML tag attributes

(?P<attr>\w+)\s*=\s*"(?P<value>[^"]*)"

From <a href="https://example.com" class="link">, captures href/https://example.com and class/link.

Extract CSV fields (handling quoted fields)

(?:^|,)(?:"([^"]*(?:""[^"]*)*)"|([^,]*))

Handles fields with commas inside quotes and escaped double quotes.

Extract code blocks from Markdown

```(\w*)\n([\s\S]*?)```

Captures the language identifier and code content.

Extract table data from plain text

Pipe-delimited table rows:

^\|\s*(?P<cells>[^|]+(?:\|[^|]+)*)\s*\|$

Extract structured data from natural language

Extract ages from biographical text:

(?P<name>[A-Z][a-z]+ [A-Z][a-z]+)(?:,?\s+(?:age|aged)\s+)(?P<age>\d{1,3})

Matches "Jane Smith, age 34" and "John Doe aged 28".

Extract version numbers

v?(?P<major>\d+)\.(?P<minor>\d+)(?:\.(?P<patch>\d+))?(?:-(?P<label>[a-zA-Z0-9.]+))?

Matches: v2.1.0, 3.14, 1.0.0-beta.2

Best Practices

  • Use findall or finditer (global matching) for extraction tasks. A single match or search only returns the first occurrence.
  • Prefer specific character classes over . to avoid capturing unintended content. [^\s]+ is better than .+? for a "word."
  • When extracting from HTML or XML, prefer a proper parser (BeautifulSoup, lxml). Regex is acceptable for quick extractions from fragments but unreliable for complex nested structures.
  • Test extraction patterns against real-world samples with varied formatting. Edge cases are common in unstructured text.
  • Post-process extracted data: strip whitespace, normalize case, convert types. Regex gets the raw text; your code cleans it.
  • Use named groups to create structured output directly, making downstream processing clearer.

Common Pitfalls

  • Over-matching with greedy quantifiers. Always consider whether .* should be .*? when content is between delimiters.
  • Extracting from HTML with regex. Patterns like <div>(.*?)</div> break with nested <div> tags. Use a parser for nested structures.
  • Assuming whitespace is a reliable delimiter. Text may contain tabs, non-breaking spaces, or multiple consecutive spaces.
  • Forgetting multiline mode when extracting from multi-line documents. Without re.MULTILINE, ^ and $ only match start and end of the entire string.
  • Not handling Unicode text. Names, addresses, and content in many languages contain characters outside ASCII. Use the Unicode flag or Unicode-aware character classes.
  • Capturing too much or too little context. Always verify extraction results against the source text.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add regex-skills

Get CLI access →