Skip to main content
Technology & EngineeringWeb Scraping164 lines

Puppeteer

Headless Chrome browser automation with Puppeteer for scraping dynamic, JavaScript-rendered pages

Quick Summary28 lines
You are an expert in Puppeteer for web scraping, browser automation, and data extraction from JavaScript-heavy websites.

## Key Points

- **Reuse browser instances** across multiple page scrapes instead of launching a new browser each time. Create pages with `browser.newPage()` and close them when done.
- **Block unnecessary resources** (images, fonts, CSS) via request interception to speed up scrapes significantly.
- **Set realistic user agents** and viewport sizes to reduce detection.
- **Use `waitUntil: 'networkidle2'`** for most scraping scenarios; it waits until there are no more than 2 network connections for 500ms.
- **Handle errors and timeouts gracefully** with try/catch and configurable timeout values.
- **Run in Docker** using the `node:slim` image with the `--no-sandbox` flag and `--disable-dev-shm-usage` to avoid shared-memory issues.
- **Limit concurrency** — open a bounded number of pages at once to avoid memory exhaustion.
- **Memory leaks from unclosed pages/browsers.** Always wrap scraping logic in try/finally to ensure `browser.close()` runs.
- **Stale element references.** After navigation or DOM mutations, previously captured element handles become invalid. Re-query after any page change.
- **`page.evaluate` serialization boundary.** You cannot pass Node.js variables directly into `evaluate`; pass them as arguments: `page.evaluate((arg) => { ... }, myVar)`.
- **Timeout errors on slow pages.** Increase the default navigation timeout with `page.setDefaultNavigationTimeout(60000)` instead of relying on the 30-second default.
- **Over-relying on `networkidle0`.** On pages with persistent WebSocket connections or polling, `networkidle0` may never resolve. Use `networkidle2` or explicit selector waits instead.

## Quick Example

```bash
npm install puppeteer
```

```bash
npm install puppeteer-core
```
skilldb get web-scraping-skills/PuppeteerFull skill: 164 lines
Paste into your CLAUDE.md or agent config

Puppeteer — Web Scraping

You are an expert in Puppeteer for web scraping, browser automation, and data extraction from JavaScript-heavy websites.

Core Philosophy

Overview

Puppeteer is a Node.js library that provides a high-level API to control headless (or full) Chrome/Chromium browsers. It excels at scraping single-page applications and sites that rely on client-side rendering, where simple HTTP requests cannot retrieve the final DOM.

Setup & Configuration

Install Puppeteer (bundles a compatible Chromium binary):

npm install puppeteer

For a lighter install that uses an existing Chrome installation:

npm install puppeteer-core

Basic launch configuration:

const puppeteer = require('puppeteer');

const browser = await puppeteer.launch({
  headless: 'new',          // use the new headless mode
  args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',  // prevents crashes in Docker
  ],
  defaultViewport: { width: 1280, height: 800 },
});

Core Patterns

Simple page scrape

async function scrapePage(url) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  );

  await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });

  const data = await page.evaluate(() => {
    const title = document.querySelector('h1')?.textContent?.trim();
    const items = [...document.querySelectorAll('.product-card')].map(el => ({
      name: el.querySelector('.title')?.textContent?.trim(),
      price: el.querySelector('.price')?.textContent?.trim(),
    }));
    return { title, items };
  });

  await browser.close();
  return data;
}

Waiting for dynamic content

// Wait for a specific selector to appear
await page.waitForSelector('.results-loaded', { timeout: 10000 });

// Wait for a network request to complete
await page.waitForResponse(
  res => res.url().includes('/api/data') && res.status() === 200
);

// Wait for navigation after a click
await Promise.all([
  page.waitForNavigation({ waitUntil: 'networkidle0' }),
  page.click('button.load-more'),
]);

Infinite scroll handling

async function autoScroll(page) {
  await page.evaluate(async () => {
    await new Promise(resolve => {
      let totalHeight = 0;
      const distance = 400;
      const timer = setInterval(() => {
        window.scrollBy(0, distance);
        totalHeight += distance;
        if (totalHeight >= document.body.scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 200);
    });
  });
}

Intercepting network requests

await page.setRequestInterception(true);
page.on('request', req => {
  const blocked = ['image', 'stylesheet', 'font'];
  if (blocked.includes(req.resourceType())) {
    req.abort();
  } else {
    req.continue();
  }
});

Taking screenshots and PDFs

await page.screenshot({ path: 'page.png', fullPage: true });
await page.pdf({ path: 'page.pdf', format: 'A4' });

Best Practices

  • Reuse browser instances across multiple page scrapes instead of launching a new browser each time. Create pages with browser.newPage() and close them when done.
  • Block unnecessary resources (images, fonts, CSS) via request interception to speed up scrapes significantly.
  • Set realistic user agents and viewport sizes to reduce detection.
  • Use waitUntil: 'networkidle2' for most scraping scenarios; it waits until there are no more than 2 network connections for 500ms.
  • Handle errors and timeouts gracefully with try/catch and configurable timeout values.
  • Run in Docker using the node:slim image with the --no-sandbox flag and --disable-dev-shm-usage to avoid shared-memory issues.
  • Limit concurrency — open a bounded number of pages at once to avoid memory exhaustion.

Common Pitfalls

  • Memory leaks from unclosed pages/browsers. Always wrap scraping logic in try/finally to ensure browser.close() runs.
  • Stale element references. After navigation or DOM mutations, previously captured element handles become invalid. Re-query after any page change.
  • page.evaluate serialization boundary. You cannot pass Node.js variables directly into evaluate; pass them as arguments: page.evaluate((arg) => { ... }, myVar).
  • Headless detection. Some sites detect headless Chrome via navigator.webdriver. Use page.evaluateOnNewDocument to delete that property, or use a stealth plugin like puppeteer-extra-plugin-stealth.
  • Timeout errors on slow pages. Increase the default navigation timeout with page.setDefaultNavigationTimeout(60000) instead of relying on the 30-second default.
  • Over-relying on networkidle0. On pages with persistent WebSocket connections or polling, networkidle0 may never resolve. Use networkidle2 or explicit selector waits instead.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add web-scraping-skills

Get CLI access →