Anti Detection
Ethical techniques for handling CAPTCHAs, rate limiting, and bot detection while scraping responsibly
You are an expert in ethical web scraping practices including handling bot detection, CAPTCHAs, rate limiting, and fingerprinting countermeasures while respecting website terms of service. ## Key Points - **Always check `robots.txt` and Terms of Service** before scraping. Respect `Crawl-delay` directives and disallowed paths. - **Identify your scraper** with a descriptive User-Agent string that includes contact information when scraping at scale: `MyScraper/1.0 (+https://mysite.com/bot)`. - **Implement exponential backoff** on 429 (Too Many Requests) and 503 (Service Unavailable) responses. Never retry aggressively. - **Randomize request timing.** Uniform delays are detectable. Add random jitter to appear more human-like. - **Rotate User-Agent strings** across requests, but keep them consistent within a single session/context. - **Cache responses locally** to avoid re-fetching pages you have already scraped. Use SQLite or a file-based cache. - **Prefer APIs when available.** Many sites offer public or authenticated APIs that are faster, more reliable, and explicitly permitted. - **Monitor your impact.** If a site starts returning errors or slowing down, reduce your request rate. - **Ignoring legal boundaries.** Scraping behind a login, circumventing access controls, or violating ToS can have legal consequences. Always consult legal guidance for commercial scraping. - **Using a single IP for high-volume requests.** This leads to quick IP bans. Use rotating residential proxies for large-scale operations, or reduce volume. - **Inconsistent fingerprints.** Setting a Windows User-Agent but having a Linux-based `navigator.platform` is a red flag. Ensure all browser properties are consistent. - **Over-engineering detection evasion.** Many sites do not employ advanced detection. Start with simple polite scraping (proper headers, rate limits) before adding complexity.
skilldb get web-scraping-skills/Anti DetectionFull skill: 225 linesAnti-Detection — Web Scraping
You are an expert in ethical web scraping practices including handling bot detection, CAPTCHAs, rate limiting, and fingerprinting countermeasures while respecting website terms of service.
Core Philosophy
Overview
Modern websites employ various bot detection mechanisms: CAPTCHAs, rate limiting, browser fingerprinting, IP reputation systems, and behavioral analysis. This skill covers ethical strategies to make scrapers resilient while respecting website resources and legal boundaries. The goal is reliable data collection, not circumventing security for malicious purposes.
Setup & Configuration
Common libraries for resilient scraping:
# Node.js
npm install puppeteer-extra puppeteer-extra-plugin-stealth
npm install proxy-chain
# Python
pip install undetected-chromedriver
pip install fake-useragent
pip install tenacity # retry logic
Core Patterns
Request header management
import requests
from fake_useragent import UserAgent
ua = UserAgent()
session = requests.Session()
session.headers.update({
'User-Agent': ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
})
Puppeteer stealth plugin
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
// The stealth plugin automatically patches:
// - navigator.webdriver
// - chrome.runtime
// - WebGL vendor/renderer
// - language/platform inconsistencies
// - permission API
Undetected ChromeDriver (Python)
import undetected_chromedriver as uc
options = uc.ChromeOptions()
options.add_argument('--headless=new')
driver = uc.Chrome(options=options)
driver.get('https://example.com')
Rate limiting and polite delays
import time
import random
def polite_delay(min_seconds=1.0, max_seconds=3.0):
"""Add a random human-like delay between requests."""
delay = random.uniform(min_seconds, max_seconds)
time.sleep(delay)
# Token bucket rate limiter
class RateLimiter:
def __init__(self, requests_per_second=1.0):
self.min_interval = 1.0 / requests_per_second
self.last_request = 0
def wait(self):
elapsed = time.time() - self.last_request
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_request = time.time()
limiter = RateLimiter(requests_per_second=0.5) # 1 request every 2 seconds
for url in urls:
limiter.wait()
response = session.get(url)
Retry logic with exponential backoff
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_result
def is_rate_limited(response):
return response.status_code == 429
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=2, max=60),
retry=retry_if_result(is_rate_limited),
)
def fetch_with_retry(url, session):
response = session.get(url, timeout=15)
return response
Proxy rotation
import itertools
proxies_list = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
'http://proxy3.example.com:8080',
]
proxy_cycle = itertools.cycle(proxies_list)
def fetch_with_proxy(url, session):
proxy = next(proxy_cycle)
return session.get(url, proxies={'http': proxy, 'https': proxy}, timeout=15)
Handling cookies and sessions properly
session = requests.Session()
# First visit the homepage to get initial cookies
session.get('https://example.com')
# Now requests carry the session cookies naturally
response = session.get('https://example.com/data')
Respecting robots.txt programmatically
from urllib.robotparser import RobotFileParser
def can_fetch(url, user_agent='*'):
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f'{parsed.scheme}://{parsed.netloc}/robots.txt'
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)
# Check before scraping
if can_fetch('https://example.com/products'):
response = session.get('https://example.com/products')
Fingerprint randomization in Playwright
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...',
viewport: { width: 1366, height: 768 },
locale: 'en-US',
timezoneId: 'America/New_York',
geolocation: { latitude: 40.7128, longitude: -74.0060 },
permissions: ['geolocation'],
});
Best Practices
- Always check
robots.txtand Terms of Service before scraping. RespectCrawl-delaydirectives and disallowed paths. - Identify your scraper with a descriptive User-Agent string that includes contact information when scraping at scale:
MyScraper/1.0 (+https://mysite.com/bot). - Implement exponential backoff on 429 (Too Many Requests) and 503 (Service Unavailable) responses. Never retry aggressively.
- Randomize request timing. Uniform delays are detectable. Add random jitter to appear more human-like.
- Rotate User-Agent strings across requests, but keep them consistent within a single session/context.
- Cache responses locally to avoid re-fetching pages you have already scraped. Use SQLite or a file-based cache.
- Prefer APIs when available. Many sites offer public or authenticated APIs that are faster, more reliable, and explicitly permitted.
- Monitor your impact. If a site starts returning errors or slowing down, reduce your request rate.
Common Pitfalls
- Ignoring legal boundaries. Scraping behind a login, circumventing access controls, or violating ToS can have legal consequences. Always consult legal guidance for commercial scraping.
- Using a single IP for high-volume requests. This leads to quick IP bans. Use rotating residential proxies for large-scale operations, or reduce volume.
- Inconsistent fingerprints. Setting a Windows User-Agent but having a Linux-based
navigator.platformis a red flag. Ensure all browser properties are consistent. - Solving CAPTCHAs at scale with third-party services. This is ethically questionable and often violates ToS. Consider whether the data is available through other means (API, data partnerships, public datasets).
- Over-engineering detection evasion. Many sites do not employ advanced detection. Start with simple polite scraping (proper headers, rate limits) before adding complexity.
- Not handling session expiry. Long-running scrapers may have sessions expire mid-crawl. Implement session refresh logic and monitor for authentication redirects.
Anti-Patterns
Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.
Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.
Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.
Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.
Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.
Install this skill directly: skilldb add web-scraping-skills
Related Skills
Beautifulsoup
HTML and XML parsing with Beautiful Soup in Python for flexible data extraction
Cheerio
Fast server-side HTML parsing and data extraction with Cheerio using jQuery-like syntax
Data Pipeline
Patterns for building robust scraping data pipelines with validation, deduplication, storage, and monitoring
Playwright Scraping
Cross-browser web scraping with Playwright, supporting Chromium, Firefox, and WebKit
Puppeteer
Headless Chrome browser automation with Puppeteer for scraping dynamic, JavaScript-rendered pages
Scrapy
Production-grade web scraping framework in Python with built-in crawling, pipelines, and middleware