Playwright Scraping
Cross-browser web scraping with Playwright, supporting Chromium, Firefox, and WebKit
You are an expert in using Playwright for web scraping and automated data extraction across Chromium, Firefox, and WebKit browsers. ## Key Points - **Use browser contexts** instead of separate browser instances for parallel scraping. Contexts are lightweight and share the browser process. - **Rely on auto-waiting** via locators instead of explicit `waitForTimeout` calls. Playwright waits for elements to be actionable automatically. - **Intercept and abort unnecessary resources** (images, fonts, analytics) to speed up page loads. - **Use `domcontentloaded` for `waitUntil`** when the data you need is in the initial HTML; use `networkidle` only when you must wait for async API calls. - **Persist storage state** with `context.storageState()` to reuse login sessions across runs without re-authenticating. - **Set realistic browser fingerprints** — locale, timezone, viewport, user agent — to reduce bot detection. - **Forgetting to close contexts and browsers.** Always call `context.close()` and `browser.close()` in a finally block to prevent zombie processes. - **Using `page.waitForTimeout` as a crutch.** Fixed delays are fragile. Wait for specific selectors, network responses, or load states instead. - **Mixing `$$eval` with locators.** Stick to one paradigm per project. Locators are preferred in newer Playwright versions for their auto-waiting behavior. - **Running too many contexts at once.** Each context consumes memory. Limit concurrency to 5-10 contexts per browser instance depending on system resources. - **Ignoring `--disable-gpu` in CI environments.** Headless browsers in containers may fail without this flag. ## Quick Example ```bash npm install playwright npx playwright install # downloads browser binaries npx playwright install chromium # or install just one browser ``` ```bash pip install playwright playwright install ```
skilldb get web-scraping-skills/Playwright ScrapingFull skill: 172 linesPlaywright — Web Scraping
You are an expert in using Playwright for web scraping and automated data extraction across Chromium, Firefox, and WebKit browsers.
Core Philosophy
Overview
Playwright is a browser automation library from Microsoft that supports Chromium, Firefox, and WebKit with a single API. It offers auto-waiting, network interception, and multiple browser contexts, making it well-suited for robust scraping of modern web applications.
Setup & Configuration
npm install playwright
npx playwright install # downloads browser binaries
npx playwright install chromium # or install just one browser
For Python:
pip install playwright
playwright install
Basic launch (Node.js):
const { chromium } = require('playwright');
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
viewport: { width: 1280, height: 800 },
locale: 'en-US',
});
const page = await context.newPage();
Core Patterns
Extracting structured data
await page.goto('https://example.com/listings', { waitUntil: 'domcontentloaded' });
const items = await page.$$eval('.listing-card', cards =>
cards.map(card => ({
title: card.querySelector('h2')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
link: card.querySelector('a')?.href,
}))
);
Auto-waiting and locators
Playwright auto-waits for elements before acting on them:
// Clicks once the button is visible and stable
await page.locator('button:has-text("Load More")').click();
// Wait for a specific element to appear
await page.locator('.results-container').waitFor({ state: 'visible' });
// Extract text from a locator
const heading = await page.locator('h1').textContent();
Pagination
async function scrapeAllPages(page) {
const allItems = [];
while (true) {
const items = await page.$$eval('.item', els =>
els.map(el => el.textContent.trim())
);
allItems.push(...items);
const nextBtn = page.locator('a.next-page');
if (await nextBtn.isVisible()) {
await nextBtn.click();
await page.waitForLoadState('networkidle');
} else {
break;
}
}
return allItems;
}
Network interception and route handling
// Block images and CSS for faster scraping
await context.route('**/*.{png,jpg,jpeg,gif,css,woff,woff2}', route => route.abort());
// Capture API responses
page.on('response', async response => {
if (response.url().includes('/api/products')) {
const json = await response.json();
console.log('Intercepted API data:', json);
}
});
Browser contexts for isolation
// Each context has its own cookies, storage, and cache
const context1 = await browser.newContext();
const context2 = await browser.newContext();
// Useful for scraping with different sessions simultaneously
const page1 = await context1.newPage();
const page2 = await context2.newPage();
Python example
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com")
titles = page.locator("h2.title").all_text_contents()
print(titles)
browser.close()
Best Practices
- Use browser contexts instead of separate browser instances for parallel scraping. Contexts are lightweight and share the browser process.
- Rely on auto-waiting via locators instead of explicit
waitForTimeoutcalls. Playwright waits for elements to be actionable automatically. - Intercept and abort unnecessary resources (images, fonts, analytics) to speed up page loads.
- Use
domcontentloadedforwaitUntilwhen the data you need is in the initial HTML; usenetworkidleonly when you must wait for async API calls. - Persist storage state with
context.storageState()to reuse login sessions across runs without re-authenticating. - Set realistic browser fingerprints — locale, timezone, viewport, user agent — to reduce bot detection.
Common Pitfalls
- Forgetting to close contexts and browsers. Always call
context.close()andbrowser.close()in a finally block to prevent zombie processes. - Using
page.waitForTimeoutas a crutch. Fixed delays are fragile. Wait for specific selectors, network responses, or load states instead. - Mixing
$$evalwith locators. Stick to one paradigm per project. Locators are preferred in newer Playwright versions for their auto-waiting behavior. - Not handling navigation-triggered by clicks. If a click navigates the page, use
await Promise.all([page.waitForNavigation(), element.click()])or rely on locator auto-waiting after navigation. - Running too many contexts at once. Each context consumes memory. Limit concurrency to 5-10 contexts per browser instance depending on system resources.
- Ignoring
--disable-gpuin CI environments. Headless browsers in containers may fail without this flag.
Anti-Patterns
Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.
Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.
Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.
Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.
Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.
Install this skill directly: skilldb add web-scraping-skills
Related Skills
Anti Detection
Ethical techniques for handling CAPTCHAs, rate limiting, and bot detection while scraping responsibly
Beautifulsoup
HTML and XML parsing with Beautiful Soup in Python for flexible data extraction
Cheerio
Fast server-side HTML parsing and data extraction with Cheerio using jQuery-like syntax
Data Pipeline
Patterns for building robust scraping data pipelines with validation, deduplication, storage, and monitoring
Puppeteer
Headless Chrome browser automation with Puppeteer for scraping dynamic, JavaScript-rendered pages
Scrapy
Production-grade web scraping framework in Python with built-in crawling, pipelines, and middleware