Technology & EngineeringRegex161 lines

Log Parsing

Regex patterns for parsing structured and semi-structured log files from common servers, applications, and systems

Quick Summary28 lines

You are an expert in using regular expressions to parse, extract, and analyze data from log files produced by web servers, application frameworks, and system services.

## Key Points

- Build patterns incrementally. Start with the timestamp, verify it matches, then add the next field.
- Use named groups so extracted data is self-documenting and easy to load into data structures.
- Anchor patterns with `^` for line-start when processing line by line to avoid false matches within message text.
- Handle optional fields with `(?:...)?` rather than making them required, since log formats often have missing fields.
- Pre-compile regex patterns when processing large log files for significant performance gains.
- When processing multi-GB files, use line-by-line streaming rather than loading the entire file into memory.
- Test patterns against real log samples, including edge cases like empty fields, special characters in messages, and multi-line entries.
- Assuming consistent formatting. Log formats drift over time, especially when multiple applications write to the same file. Build in flexibility.
- Using `.*` to skip over fields instead of a negated character class. `[^\s]+` is both faster and more precise for a non-whitespace field.
- Forgetting that log timestamps vary widely: ISO 8601, Unix epoch, custom formats. Always verify the actual format in sample data.
- Not handling multi-line log entries (stack traces, multi-line JSON). Standard line-by-line processing will split these across multiple matches.
- Trying to parse JSON logs with regex when a JSON parser is available and more reliable.

## Quick Example

```
TIMESTAMP LEVEL SOURCE MESSAGE
```

```
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/5.0"
```

skilldb get regex-skills/Log ParsingFull skill: 161 lines

Paste into your CLAUDE.md or agent config

Log Parsing — Regular Expressions

You are an expert in using regular expressions to parse, extract, and analyze data from log files produced by web servers, application frameworks, and system services.

Core Philosophy

Overview

Log files are semi-structured text. Each line typically follows a predictable format defined by the application or logging framework. Regular expressions are the primary tool for decomposing log lines into structured fields like timestamps, severity levels, source identifiers, and message bodies.

Core Concepts

Anatomy of a Log Line

Most log formats share common fields:

TIMESTAMP  LEVEL  SOURCE  MESSAGE

The challenge is that each component can vary in format, and the message field is free-form text that may contain characters that look like delimiters.

Greedy vs. Lazy in Log Parsing

Log lines often have a free-text message at the end. Use greedy (.+) for the final field to capture everything remaining. Use lazy (.+?) or negated character classes for fields that are followed by known delimiters.

Implementation Patterns

Apache/Nginx Combined Log Format

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/5.0"

^(?P<ip>\S+)\s+\S+\s+(?P<user>\S+)\s+\[(?P<timestamp>[^\]]+)\]\s+"(?P<method>\w+)\s+(?P<path>\S+)\s+(?P<protocol>[^"]+)"\s+(?P<status>\d{3})\s+(?P<bytes>\d+|-)\s+"(?P<referer>[^"]*)"\s+"(?P<agent>[^"]*)"$

Syslog (RFC 3164)

Mar 17 14:30:00 webserver01 sshd[12345]: Accepted publickey for admin from 10.0.0.1 port 22 ssh2

^(?P<timestamp>\w{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\s+(?P<host>\S+)\s+(?P<program>[^\[]+)\[(?P<pid>\d+)\]:\s+(?P<message>.+)$

JSON-structured logs (NDJSON)

Many modern applications emit one JSON object per line. While a JSON parser is preferred, regex can extract individual fields without full parsing:

"level"\s*:\s*"(?P<level>[^"]+)"

"timestamp"\s*:\s*"(?P<ts>[^"]+)"

Java/Log4j pattern

2026-03-17 14:30:00.123 ERROR [main] com.example.App - Connection refused

^(?P<timestamp>\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}\.\d{3})\s+(?P<level>\w+)\s+\[(?P<thread>[^\]]+)\]\s+(?P<logger>\S+)\s+-\s+(?P<message>.+)$

Python logging default format

ERROR:root:Something went wrong

^(?P<level>\w+):(?P<logger>[^:]+):(?P<message>.+)$

With a timestamp format:

2026-03-17 14:30:00,123 - myapp - ERROR - Database connection failed

^(?P<timestamp>[\d-]+\s+[\d:,]+)\s+-\s+(?P<logger>\S+)\s+-\s+(?P<level>\w+)\s+-\s+(?P<message>.+)$

Kubernetes pod logs

2026-03-17T14:30:00.123456789Z stderr F Error: container crashed

^(?P<timestamp>\d{4}-\d{2}-\d{2}T[\d:.]+Z)\s+(?P<stream>stdout|stderr)\s+(?P<flag>\S+)\s+(?P<message>.+)$

Multi-line stack trace extraction

Match a Java exception with its stack trace (requires multiline/dotall mode):

(?P<exception>[\w.]+(?:Exception|Error)):\s*(?P<message>[^\n]*)\n(?P<stacktrace>(?:\s+at\s+.+\n?)+)

Extract IP addresses from any log

\b(?P<ip>(?:\d{1,3}\.){3}\d{1,3})\b

Extract HTTP status codes and group by class

\b(?P<status>[1-5]\d{2})\b

Filter for errors only:

\b(?P<status>[45]\d{2})\b

Best Practices

Build patterns incrementally. Start with the timestamp, verify it matches, then add the next field.
Use named groups so extracted data is self-documenting and easy to load into data structures.
Anchor patterns with ^ for line-start when processing line by line to avoid false matches within message text.
Handle optional fields with (?:...)? rather than making them required, since log formats often have missing fields.
Pre-compile regex patterns when processing large log files for significant performance gains.
When processing multi-GB files, use line-by-line streaming rather than loading the entire file into memory.
Test patterns against real log samples, including edge cases like empty fields, special characters in messages, and multi-line entries.

Common Pitfalls

Assuming consistent formatting. Log formats drift over time, especially when multiple applications write to the same file. Build in flexibility.
Using .* to skip over fields instead of a negated character class. [^\s]+ is both faster and more precise for a non-whitespace field.
Forgetting that log timestamps vary widely: ISO 8601, Unix epoch, custom formats. Always verify the actual format in sample data.
Not handling multi-line log entries (stack traces, multi-line JSON). Standard line-by-line processing will split these across multiple matches.
Trying to parse JSON logs with regex when a JSON parser is available and more reliable.
Ignoring character encoding issues. Log files may contain UTF-8, Latin-1, or mixed encodings that cause regex engines to fail on malformed bytes.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add regex-skills

Get CLI access →