Skip to main content
Technology & EngineeringWeb Scraping172 lines

Playwright Scraping

Cross-browser web scraping with Playwright, supporting Chromium, Firefox, and WebKit

Quick Summary30 lines
You are an expert in using Playwright for web scraping and automated data extraction across Chromium, Firefox, and WebKit browsers.

## Key Points

- **Use browser contexts** instead of separate browser instances for parallel scraping. Contexts are lightweight and share the browser process.
- **Rely on auto-waiting** via locators instead of explicit `waitForTimeout` calls. Playwright waits for elements to be actionable automatically.
- **Intercept and abort unnecessary resources** (images, fonts, analytics) to speed up page loads.
- **Use `domcontentloaded` for `waitUntil`** when the data you need is in the initial HTML; use `networkidle` only when you must wait for async API calls.
- **Persist storage state** with `context.storageState()` to reuse login sessions across runs without re-authenticating.
- **Set realistic browser fingerprints** — locale, timezone, viewport, user agent — to reduce bot detection.
- **Forgetting to close contexts and browsers.** Always call `context.close()` and `browser.close()` in a finally block to prevent zombie processes.
- **Using `page.waitForTimeout` as a crutch.** Fixed delays are fragile. Wait for specific selectors, network responses, or load states instead.
- **Mixing `$$eval` with locators.** Stick to one paradigm per project. Locators are preferred in newer Playwright versions for their auto-waiting behavior.
- **Running too many contexts at once.** Each context consumes memory. Limit concurrency to 5-10 contexts per browser instance depending on system resources.
- **Ignoring `--disable-gpu` in CI environments.** Headless browsers in containers may fail without this flag.

## Quick Example

```bash
npm install playwright
npx playwright install          # downloads browser binaries
npx playwright install chromium # or install just one browser
```

```bash
pip install playwright
playwright install
```
skilldb get web-scraping-skills/Playwright ScrapingFull skill: 172 lines
Paste into your CLAUDE.md or agent config

Playwright — Web Scraping

You are an expert in using Playwright for web scraping and automated data extraction across Chromium, Firefox, and WebKit browsers.

Core Philosophy

Overview

Playwright is a browser automation library from Microsoft that supports Chromium, Firefox, and WebKit with a single API. It offers auto-waiting, network interception, and multiple browser contexts, making it well-suited for robust scraping of modern web applications.

Setup & Configuration

npm install playwright
npx playwright install          # downloads browser binaries
npx playwright install chromium # or install just one browser

For Python:

pip install playwright
playwright install

Basic launch (Node.js):

const { chromium } = require('playwright');

const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  viewport: { width: 1280, height: 800 },
  locale: 'en-US',
});
const page = await context.newPage();

Core Patterns

Extracting structured data

await page.goto('https://example.com/listings', { waitUntil: 'domcontentloaded' });

const items = await page.$$eval('.listing-card', cards =>
  cards.map(card => ({
    title: card.querySelector('h2')?.textContent?.trim(),
    price: card.querySelector('.price')?.textContent?.trim(),
    link: card.querySelector('a')?.href,
  }))
);

Auto-waiting and locators

Playwright auto-waits for elements before acting on them:

// Clicks once the button is visible and stable
await page.locator('button:has-text("Load More")').click();

// Wait for a specific element to appear
await page.locator('.results-container').waitFor({ state: 'visible' });

// Extract text from a locator
const heading = await page.locator('h1').textContent();

Pagination

async function scrapeAllPages(page) {
  const allItems = [];

  while (true) {
    const items = await page.$$eval('.item', els =>
      els.map(el => el.textContent.trim())
    );
    allItems.push(...items);

    const nextBtn = page.locator('a.next-page');
    if (await nextBtn.isVisible()) {
      await nextBtn.click();
      await page.waitForLoadState('networkidle');
    } else {
      break;
    }
  }

  return allItems;
}

Network interception and route handling

// Block images and CSS for faster scraping
await context.route('**/*.{png,jpg,jpeg,gif,css,woff,woff2}', route => route.abort());

// Capture API responses
page.on('response', async response => {
  if (response.url().includes('/api/products')) {
    const json = await response.json();
    console.log('Intercepted API data:', json);
  }
});

Browser contexts for isolation

// Each context has its own cookies, storage, and cache
const context1 = await browser.newContext();
const context2 = await browser.newContext();

// Useful for scraping with different sessions simultaneously
const page1 = await context1.newPage();
const page2 = await context2.newPage();

Python example

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")

    titles = page.locator("h2.title").all_text_contents()
    print(titles)

    browser.close()

Best Practices

  • Use browser contexts instead of separate browser instances for parallel scraping. Contexts are lightweight and share the browser process.
  • Rely on auto-waiting via locators instead of explicit waitForTimeout calls. Playwright waits for elements to be actionable automatically.
  • Intercept and abort unnecessary resources (images, fonts, analytics) to speed up page loads.
  • Use domcontentloaded for waitUntil when the data you need is in the initial HTML; use networkidle only when you must wait for async API calls.
  • Persist storage state with context.storageState() to reuse login sessions across runs without re-authenticating.
  • Set realistic browser fingerprints — locale, timezone, viewport, user agent — to reduce bot detection.

Common Pitfalls

  • Forgetting to close contexts and browsers. Always call context.close() and browser.close() in a finally block to prevent zombie processes.
  • Using page.waitForTimeout as a crutch. Fixed delays are fragile. Wait for specific selectors, network responses, or load states instead.
  • Mixing $$eval with locators. Stick to one paradigm per project. Locators are preferred in newer Playwright versions for their auto-waiting behavior.
  • Not handling navigation-triggered by clicks. If a click navigates the page, use await Promise.all([page.waitForNavigation(), element.click()]) or rely on locator auto-waiting after navigation.
  • Running too many contexts at once. Each context consumes memory. Limit concurrency to 5-10 contexts per browser instance depending on system resources.
  • Ignoring --disable-gpu in CI environments. Headless browsers in containers may fail without this flag.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add web-scraping-skills

Get CLI access →