Web Scraping at Scale
Build scrapers that run reliably across thousands of pages, handle
Single-page scraping is a few lines of code. Scraping at scale — thousands or millions of pages, across dozens of sites, running daily — is engineering. The patterns that work for the first hundred pages break at scale; the patterns that work for one site fail on the next. ## Key Points - **Terms of Service.** Many sites prohibit scraping in their ToS. Violation may not be illegal but creates legal exposure. Read the relevant ToS for each target. - **Public vs. private data.** Public-facing data is generally more defensible to scrape than data behind authentication. - **Personal data and GDPR.** Scraping personal data (names, emails, profiles) triggers data-protection regulations. The work may require legal review. - **Rate limits.** Even when ToS doesn't prohibit scraping, hammering a site is rude and counterproductive. Stay below rates that affect the site's other users. - **robots.txt.** Not legally binding but operationally important. Sites flag aggressive scrapers; ignoring robots.txt invites IP bans and legal escalation. - Persists across restarts. - Deduplicates URLs. - Supports priority (newer URLs first; failed URLs retried later). - Survives crashes; the in-flight URL doesn't get lost. - Uses proxies (often a rotating pool). - Respects rate limits (per-domain, per-proxy). - Handles retries on transient failures.
skilldb get browser-automation-skills/Web Scraping at ScaleFull skill: 186 linesSingle-page scraping is a few lines of code. Scraping at scale — thousands or millions of pages, across dozens of sites, running daily — is engineering. The patterns that work for the first hundred pages break at scale; the patterns that work for one site fail on the next.
This skill covers the architecture and operational practices that make scrapers reliable at scale.
Legal and Ethical Frame
Before scraping, consider:
- Terms of Service. Many sites prohibit scraping in their ToS. Violation may not be illegal but creates legal exposure. Read the relevant ToS for each target.
- Public vs. private data. Public-facing data is generally more defensible to scrape than data behind authentication.
- Personal data and GDPR. Scraping personal data (names, emails, profiles) triggers data-protection regulations. The work may require legal review.
- Rate limits. Even when ToS doesn't prohibit scraping, hammering a site is rude and counterproductive. Stay below rates that affect the site's other users.
- robots.txt. Not legally binding but operationally important. Sites flag aggressive scrapers; ignoring robots.txt invites IP bans and legal escalation.
When in doubt, talk to legal. The cost of a 30-minute consultation is small relative to the cost of a cease-and-desist or a regulatory inquiry.
Architecture
A scalable scraper has separable components:
1. URL Queue
The work the scraper must do. URLs to crawl, with metadata (depth, source, timestamp). The queue:
- Persists across restarts.
- Deduplicates URLs.
- Supports priority (newer URLs first; failed URLs retried later).
- Survives crashes; the in-flight URL doesn't get lost.
Implementations: Redis queues, PostgreSQL with FOR UPDATE SKIP LOCKED, dedicated queue services.
2. Fetcher
Pulls URLs from the queue, fetches them, returns raw responses. The fetcher:
- Uses proxies (often a rotating pool).
- Respects rate limits (per-domain, per-proxy).
- Handles retries on transient failures.
- Records timestamps and HTTP status for audit.
The fetcher is decoupled from the parser. Raw responses are stored; reprocessing happens against the stored data, not by re-fetching.
3. Parser
Extracts structured data from raw responses. The parser:
- Has site-specific extraction rules.
- Validates output against a schema.
- Reports parsing errors so they can be fixed.
The parser is where most maintenance happens. Sites change their HTML; selectors break; the parser updates. Decoupling parsing from fetching means you can re-parse the entire historical dataset after fixing a bug, without re-fetching.
4. Storage
Where extracted data lives. Database, blob storage, both.
For dynamic sites, store both the raw HTML and the parsed output. The raw HTML is the audit trail and the data source for re-parsing. The parsed output is what downstream systems use.
5. Monitoring
What's being scraped, what's failing, what's blocked. The dashboard shows:
- URLs scraped per hour.
- Success rate by site.
- Parsing error rate by site.
- Average fetch time and proxy health.
When a site starts blocking, the dashboard shows it before the data dries up.
Rate Limiting
Per-domain rate limiting is essential. Hammering a site is rude and gets you blocked.
Token bucket per domain:
- Configure per-domain rates (1 req/sec for small sites; 10/sec for large; 0.1/sec for sensitive ones).
- The fetcher acquires a token before each request.
- Tokens replenish over time.
Stagger across the day. Don't run a scraper that hits a site only at noon every day; spread the load across hours.
For sites with stricter limits, work with their official APIs instead of scraping. Many provide free tiers for low-volume use.
Proxy Strategy
For sites that block IP ranges or have strict rate limits, proxies are necessary.
Proxy types:
- Datacenter proxies. Cheap, fast, easily detected by anti-bot systems.
- Residential proxies. Real consumer IPs, expensive, harder to detect.
- Mobile proxies. Cellular IPs, very expensive, hardest to detect.
For most scraping, residential proxies are the right balance. For aggressive anti-bot, mobile.
Proxy rotation:
- New IP for each request, or each session, depending on site behavior.
- Health-check proxies; remove ones that fail or are blocked.
- Geographic distribution if the target serves region-specific content.
Proxy services (Bright Data, Smartproxy, Oxylabs) handle the rotation. Their cost is per GB of traffic; budget accordingly.
CAPTCHAs
Sites detect bots via CAPTCHAs. Three responses:
- Avoid them. Scrape less aggressively; use better proxies; don't trigger the detection.
- Solve them. Services (2Captcha, Anti-Captcha) solve CAPTCHAs for a fee. Latency adds 30-90 seconds per CAPTCHA.
- Stop. If the site is determined to block automation, accept it. Look for an API or partnership instead.
CAPTCHAs are an anti-bot signal. Hitting them frequently means the scraper has been detected; back off, change strategy.
Evasion
For sites that aggressively detect scrapers:
- User-agent rotation. Use realistic, current user-agent strings.
- Headless browser detection. Use
playwright-extra-plugin-stealthor similar to mask automation signals. - Timing. Add jitter between actions; don't send requests at perfectly regular intervals.
- Header completeness. Real browsers send many headers; scrapers often send few. Match real browser fingerprints.
- TLS fingerprinting. Some sites detect via TLS handshake patterns (JA3 fingerprint). Specialized libraries (curl-impersonate) help.
- Browser fingerprinting. Canvas, WebGL, font enumeration. Hardest to mask; high effort.
Evasion is an arms race. The cost of staying undetected grows over time; reassess whether the data is worth the effort.
Schema Validation
Every parser output is validated against a schema. The schema:
- Specifies expected fields and types.
- Allows missing fields when they're genuinely optional (and the parser flags this).
- Rejects malformed output for human review.
When a site changes its HTML, the parser produces wrong output. Schema validation catches it before the bad data hits storage.
Trends in error rate are the canary. A parser that was 99% successful and is now 60% successful indicates a site change. Investigate; update the parser.
Operational Practices
For production scrapers:
- Health checks. Test that each scraper still runs successfully on a small sample daily.
- Alerting. When a scraper's success rate drops, alert the team.
- Rollback path. Sometimes a parser update is wrong and produces worse output than the old one. Be able to revert.
- Audit log. Who triggered what, when, with what configuration. For legal and operational reasons.
- Cost tracking. Per-site, per-day costs. Some sites are expensive to scrape (lots of proxy traffic, lots of CAPTCHAs); knowing the cost lets you make business-level decisions.
When to Use APIs Instead
Many sites offer official APIs. Even paid APIs are usually cheaper, more reliable, and more legally defensible than scraping.
Switch to an API when:
- The site has one and the cost is reasonable.
- The data you need is available through the API.
- The terms of service permit the use.
Scraping is for when the API doesn't exist, doesn't expose what you need, or costs more than scraping does. It's a last resort, not a default.
Anti-Patterns
Hardcoded selectors. Brittle CSS or XPath that breaks on every redesign. Use multiple fallback selectors; favor semantic anchors (data attributes, accessibility tree).
No rate limiting. The scraper hammers a site; the IP is banned. Throttle per-domain.
Storing only parsed output. When the parser is buggy, you can't reprocess. Store raw HTML.
No schema validation. Bad data flows downstream silently. Validate output; alert on parsing failures.
No proxy rotation when needed. The IP is blocked; the scraper produces zero data. Use residential proxies for sensitive sites.
Ignoring robots.txt. Even when not legally binding, ignoring it invites bans and legal escalation. Respect.
Scraping when an API exists. API is more reliable, more legal, often cheaper. Default to API when one is available.
Install this skill directly: skilldb add browser-automation-skills
Related Skills
Agent-Driven Browser Tasks
Connect an LLM agent to a browser to perform tasks: navigation, form
Debugging Flaky Browser Tests
Diagnose and fix flaky end-to-end tests. Covers the categories of
Playwright Fundamentals for Reliable Automation
Use Playwright to drive browsers reliably across Chrome, Firefox, and
Adversarial Code Review
Adversarial implementation review methodology that validates code completeness against requirements with fresh objectivity. Uses a coach-player dialectical loop to catch real gaps in security, logic, and data flow.
API Design Testing
Design, document, and test APIs following RESTful principles, consistent
Architecture
Design software systems with sound architecture — choosing patterns, defining boundaries,