Technology & EngineeringBrowser Automation186 lines

Web Scraping at Scale

Build scrapers that run reliably across thousands of pages, handle

Quick Summary18 lines

Single-page scraping is a few lines of code. Scraping at scale — thousands or millions of pages, across dozens of sites, running daily — is engineering. The patterns that work for the first hundred pages break at scale; the patterns that work for one site fail on the next.

## Key Points

- **Terms of Service.** Many sites prohibit scraping in their ToS. Violation may not be illegal but creates legal exposure. Read the relevant ToS for each target.
- **Public vs. private data.** Public-facing data is generally more defensible to scrape than data behind authentication.
- **Personal data and GDPR.** Scraping personal data (names, emails, profiles) triggers data-protection regulations. The work may require legal review.
- **Rate limits.** Even when ToS doesn't prohibit scraping, hammering a site is rude and counterproductive. Stay below rates that affect the site's other users.
- **robots.txt.** Not legally binding but operationally important. Sites flag aggressive scrapers; ignoring robots.txt invites IP bans and legal escalation.
- Persists across restarts.
- Deduplicates URLs.
- Supports priority (newer URLs first; failed URLs retried later).
- Survives crashes; the in-flight URL doesn't get lost.
- Uses proxies (often a rotating pool).
- Respects rate limits (per-domain, per-proxy).
- Handles retries on transient failures.

skilldb get browser-automation-skills/Web Scraping at ScaleFull skill: 186 lines

Paste into your CLAUDE.md or agent config

Single-page scraping is a few lines of code. Scraping at scale — thousands or millions of pages, across dozens of sites, running daily — is engineering. The patterns that work for the first hundred pages break at scale; the patterns that work for one site fail on the next.

This skill covers the architecture and operational practices that make scrapers reliable at scale.

Legal and Ethical Frame

Before scraping, consider:

Terms of Service. Many sites prohibit scraping in their ToS. Violation may not be illegal but creates legal exposure. Read the relevant ToS for each target.
Public vs. private data. Public-facing data is generally more defensible to scrape than data behind authentication.
Personal data and GDPR. Scraping personal data (names, emails, profiles) triggers data-protection regulations. The work may require legal review.
Rate limits. Even when ToS doesn't prohibit scraping, hammering a site is rude and counterproductive. Stay below rates that affect the site's other users.
robots.txt. Not legally binding but operationally important. Sites flag aggressive scrapers; ignoring robots.txt invites IP bans and legal escalation.

When in doubt, talk to legal. The cost of a 30-minute consultation is small relative to the cost of a cease-and-desist or a regulatory inquiry.

Architecture

A scalable scraper has separable components:

1. URL Queue

The work the scraper must do. URLs to crawl, with metadata (depth, source, timestamp). The queue:

Persists across restarts.
Deduplicates URLs.
Supports priority (newer URLs first; failed URLs retried later).
Survives crashes; the in-flight URL doesn't get lost.

Implementations: Redis queues, PostgreSQL with FOR UPDATE SKIP LOCKED, dedicated queue services.

2. Fetcher

Pulls URLs from the queue, fetches them, returns raw responses. The fetcher:

Uses proxies (often a rotating pool).
Respects rate limits (per-domain, per-proxy).
Handles retries on transient failures.
Records timestamps and HTTP status for audit.

The fetcher is decoupled from the parser. Raw responses are stored; reprocessing happens against the stored data, not by re-fetching.

3. Parser

Extracts structured data from raw responses. The parser:

Has site-specific extraction rules.
Validates output against a schema.
Reports parsing errors so they can be fixed.

The parser is where most maintenance happens. Sites change their HTML; selectors break; the parser updates. Decoupling parsing from fetching means you can re-parse the entire historical dataset after fixing a bug, without re-fetching.

4. Storage

Where extracted data lives. Database, blob storage, both.

For dynamic sites, store both the raw HTML and the parsed output. The raw HTML is the audit trail and the data source for re-parsing. The parsed output is what downstream systems use.

5. Monitoring

What's being scraped, what's failing, what's blocked. The dashboard shows:

URLs scraped per hour.
Success rate by site.
Parsing error rate by site.
Average fetch time and proxy health.

When a site starts blocking, the dashboard shows it before the data dries up.

Rate Limiting

Per-domain rate limiting is essential. Hammering a site is rude and gets you blocked.

Token bucket per domain:

Configure per-domain rates (1 req/sec for small sites; 10/sec for large; 0.1/sec for sensitive ones).
The fetcher acquires a token before each request.
Tokens replenish over time.

Stagger across the day. Don't run a scraper that hits a site only at noon every day; spread the load across hours.

For sites with stricter limits, work with their official APIs instead of scraping. Many provide free tiers for low-volume use.

Proxy Strategy

For sites that block IP ranges or have strict rate limits, proxies are necessary.

Proxy types:

Datacenter proxies. Cheap, fast, easily detected by anti-bot systems.
Residential proxies. Real consumer IPs, expensive, harder to detect.
Mobile proxies. Cellular IPs, very expensive, hardest to detect.

For most scraping, residential proxies are the right balance. For aggressive anti-bot, mobile.

Proxy rotation:

New IP for each request, or each session, depending on site behavior.
Health-check proxies; remove ones that fail or are blocked.
Geographic distribution if the target serves region-specific content.

Proxy services (Bright Data, Smartproxy, Oxylabs) handle the rotation. Their cost is per GB of traffic; budget accordingly.

CAPTCHAs

Sites detect bots via CAPTCHAs. Three responses:

Avoid them. Scrape less aggressively; use better proxies; don't trigger the detection.
Solve them. Services (2Captcha, Anti-Captcha) solve CAPTCHAs for a fee. Latency adds 30-90 seconds per CAPTCHA.
Stop. If the site is determined to block automation, accept it. Look for an API or partnership instead.

CAPTCHAs are an anti-bot signal. Hitting them frequently means the scraper has been detected; back off, change strategy.

Evasion

For sites that aggressively detect scrapers:

User-agent rotation. Use realistic, current user-agent strings.
Headless browser detection. Use playwright-extra-plugin-stealth or similar to mask automation signals.
Timing. Add jitter between actions; don't send requests at perfectly regular intervals.
Header completeness. Real browsers send many headers; scrapers often send few. Match real browser fingerprints.
TLS fingerprinting. Some sites detect via TLS handshake patterns (JA3 fingerprint). Specialized libraries (curl-impersonate) help.
Browser fingerprinting. Canvas, WebGL, font enumeration. Hardest to mask; high effort.

Evasion is an arms race. The cost of staying undetected grows over time; reassess whether the data is worth the effort.

Schema Validation

Every parser output is validated against a schema. The schema:

Specifies expected fields and types.
Allows missing fields when they're genuinely optional (and the parser flags this).
Rejects malformed output for human review.

When a site changes its HTML, the parser produces wrong output. Schema validation catches it before the bad data hits storage.

Trends in error rate are the canary. A parser that was 99% successful and is now 60% successful indicates a site change. Investigate; update the parser.

Operational Practices

For production scrapers:

Health checks. Test that each scraper still runs successfully on a small sample daily.
Alerting. When a scraper's success rate drops, alert the team.
Rollback path. Sometimes a parser update is wrong and produces worse output than the old one. Be able to revert.
Audit log. Who triggered what, when, with what configuration. For legal and operational reasons.
Cost tracking. Per-site, per-day costs. Some sites are expensive to scrape (lots of proxy traffic, lots of CAPTCHAs); knowing the cost lets you make business-level decisions.

When to Use APIs Instead

Many sites offer official APIs. Even paid APIs are usually cheaper, more reliable, and more legally defensible than scraping.

Switch to an API when:

The site has one and the cost is reasonable.
The data you need is available through the API.
The terms of service permit the use.

Scraping is for when the API doesn't exist, doesn't expose what you need, or costs more than scraping does. It's a last resort, not a default.

Anti-Patterns

Hardcoded selectors. Brittle CSS or XPath that breaks on every redesign. Use multiple fallback selectors; favor semantic anchors (data attributes, accessibility tree).

No rate limiting. The scraper hammers a site; the IP is banned. Throttle per-domain.

Storing only parsed output. When the parser is buggy, you can't reprocess. Store raw HTML.

No schema validation. Bad data flows downstream silently. Validate output; alert on parsing failures.

No proxy rotation when needed. The IP is blocked; the scraper produces zero data. Use residential proxies for sensitive sites.

Ignoring robots.txt. Even when not legally binding, ignoring it invites bans and legal escalation. Respect.

Scraping when an API exists. API is more reliable, more legal, often cheaper. Default to API when one is available.

Install this skill directly: skilldb add browser-automation-skills

Get CLI access →

Web Scraping at Scale

Legal and Ethical Frame

Architecture

1. URL Queue

2. Fetcher

3. Parser

4. Storage

5. Monitoring

Rate Limiting

Proxy Strategy

CAPTCHAs

Evasion

Schema Validation

Operational Practices

When to Use APIs Instead

Anti-Patterns

Related Skills

Agent-Driven Browser Tasks

Debugging Flaky Browser Tests

Playwright Fundamentals for Reliable Automation

Adversarial Code Review

API Design Testing

Architecture