Technology & EngineeringWeb Scraping225 lines

Anti Detection

Ethical techniques for handling CAPTCHAs, rate limiting, and bot detection while scraping responsibly

Quick Summary18 lines

You are an expert in ethical web scraping practices including handling bot detection, CAPTCHAs, rate limiting, and fingerprinting countermeasures while respecting website terms of service.

## Key Points

- **Always check `robots.txt` and Terms of Service** before scraping. Respect `Crawl-delay` directives and disallowed paths.
- **Identify your scraper** with a descriptive User-Agent string that includes contact information when scraping at scale: `MyScraper/1.0 (+https://mysite.com/bot)`.
- **Implement exponential backoff** on 429 (Too Many Requests) and 503 (Service Unavailable) responses. Never retry aggressively.
- **Randomize request timing.** Uniform delays are detectable. Add random jitter to appear more human-like.
- **Rotate User-Agent strings** across requests, but keep them consistent within a single session/context.
- **Cache responses locally** to avoid re-fetching pages you have already scraped. Use SQLite or a file-based cache.
- **Prefer APIs when available.** Many sites offer public or authenticated APIs that are faster, more reliable, and explicitly permitted.
- **Monitor your impact.** If a site starts returning errors or slowing down, reduce your request rate.
- **Ignoring legal boundaries.** Scraping behind a login, circumventing access controls, or violating ToS can have legal consequences. Always consult legal guidance for commercial scraping.
- **Using a single IP for high-volume requests.** This leads to quick IP bans. Use rotating residential proxies for large-scale operations, or reduce volume.
- **Inconsistent fingerprints.** Setting a Windows User-Agent but having a Linux-based `navigator.platform` is a red flag. Ensure all browser properties are consistent.
- **Over-engineering detection evasion.** Many sites do not employ advanced detection. Start with simple polite scraping (proper headers, rate limits) before adding complexity.

skilldb get web-scraping-skills/Anti DetectionFull skill: 225 lines

Paste into your CLAUDE.md or agent config

Anti-Detection — Web Scraping

You are an expert in ethical web scraping practices including handling bot detection, CAPTCHAs, rate limiting, and fingerprinting countermeasures while respecting website terms of service.

Core Philosophy

Overview

Modern websites employ various bot detection mechanisms: CAPTCHAs, rate limiting, browser fingerprinting, IP reputation systems, and behavioral analysis. This skill covers ethical strategies to make scrapers resilient while respecting website resources and legal boundaries. The goal is reliable data collection, not circumventing security for malicious purposes.

Setup & Configuration

Common libraries for resilient scraping:

# Node.js
npm install puppeteer-extra puppeteer-extra-plugin-stealth
npm install proxy-chain

# Python
pip install undetected-chromedriver
pip install fake-useragent
pip install tenacity        # retry logic

Core Patterns

Request header management

import requests
from fake_useragent import UserAgent

ua = UserAgent()

session = requests.Session()
session.headers.update({
    'User-Agent': ua.random,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
})

Puppeteer stealth plugin

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();

// The stealth plugin automatically patches:
// - navigator.webdriver
// - chrome.runtime
// - WebGL vendor/renderer
// - language/platform inconsistencies
// - permission API

Undetected ChromeDriver (Python)

import undetected_chromedriver as uc

options = uc.ChromeOptions()
options.add_argument('--headless=new')

driver = uc.Chrome(options=options)
driver.get('https://example.com')

Rate limiting and polite delays

import time
import random

def polite_delay(min_seconds=1.0, max_seconds=3.0):
    """Add a random human-like delay between requests."""
    delay = random.uniform(min_seconds, max_seconds)
    time.sleep(delay)

# Token bucket rate limiter
class RateLimiter:
    def __init__(self, requests_per_second=1.0):
        self.min_interval = 1.0 / requests_per_second
        self.last_request = 0

    def wait(self):
        elapsed = time.time() - self.last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        self.last_request = time.time()

limiter = RateLimiter(requests_per_second=0.5)  # 1 request every 2 seconds
for url in urls:
    limiter.wait()
    response = session.get(url)

Retry logic with exponential backoff

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_result

def is_rate_limited(response):
    return response.status_code == 429

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    retry=retry_if_result(is_rate_limited),
)
def fetch_with_retry(url, session):
    response = session.get(url, timeout=15)
    return response

Proxy rotation

import itertools

proxies_list = [
    'http://proxy1.example.com:8080',
    'http://proxy2.example.com:8080',
    'http://proxy3.example.com:8080',
]

proxy_cycle = itertools.cycle(proxies_list)

def fetch_with_proxy(url, session):
    proxy = next(proxy_cycle)
    return session.get(url, proxies={'http': proxy, 'https': proxy}, timeout=15)

Handling cookies and sessions properly

session = requests.Session()

# First visit the homepage to get initial cookies
session.get('https://example.com')

# Now requests carry the session cookies naturally
response = session.get('https://example.com/data')

Respecting robots.txt programmatically

from urllib.robotparser import RobotFileParser

def can_fetch(url, user_agent='*'):
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f'{parsed.scheme}://{parsed.netloc}/robots.txt'

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch(user_agent, url)

# Check before scraping
if can_fetch('https://example.com/products'):
    response = session.get('https://example.com/products')

Fingerprint randomization in Playwright

const context = await browser.newContext({
    userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...',
    viewport: { width: 1366, height: 768 },
    locale: 'en-US',
    timezoneId: 'America/New_York',
    geolocation: { latitude: 40.7128, longitude: -74.0060 },
    permissions: ['geolocation'],
});

Best Practices

Always check robots.txt and Terms of Service before scraping. Respect Crawl-delay directives and disallowed paths.
Identify your scraper with a descriptive User-Agent string that includes contact information when scraping at scale: MyScraper/1.0 (+https://mysite.com/bot).
Implement exponential backoff on 429 (Too Many Requests) and 503 (Service Unavailable) responses. Never retry aggressively.
Randomize request timing. Uniform delays are detectable. Add random jitter to appear more human-like.
Rotate User-Agent strings across requests, but keep them consistent within a single session/context.
Cache responses locally to avoid re-fetching pages you have already scraped. Use SQLite or a file-based cache.
Prefer APIs when available. Many sites offer public or authenticated APIs that are faster, more reliable, and explicitly permitted.
Monitor your impact. If a site starts returning errors or slowing down, reduce your request rate.

Common Pitfalls

Ignoring legal boundaries. Scraping behind a login, circumventing access controls, or violating ToS can have legal consequences. Always consult legal guidance for commercial scraping.
Using a single IP for high-volume requests. This leads to quick IP bans. Use rotating residential proxies for large-scale operations, or reduce volume.
Inconsistent fingerprints. Setting a Windows User-Agent but having a Linux-based navigator.platform is a red flag. Ensure all browser properties are consistent.
Solving CAPTCHAs at scale with third-party services. This is ethically questionable and often violates ToS. Consider whether the data is available through other means (API, data partnerships, public datasets).
Over-engineering detection evasion. Many sites do not employ advanced detection. Start with simple polite scraping (proper headers, rate limits) before adding complexity.
Not handling session expiry. Long-running scrapers may have sessions expire mid-crawl. Implement session refresh logic and monitor for authentication redirects.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add web-scraping-skills

Get CLI access →