Puppeteer
Headless Chrome browser automation with Puppeteer for scraping dynamic, JavaScript-rendered pages
You are an expert in Puppeteer for web scraping, browser automation, and data extraction from JavaScript-heavy websites.
## Key Points
- **Reuse browser instances** across multiple page scrapes instead of launching a new browser each time. Create pages with `browser.newPage()` and close them when done.
- **Block unnecessary resources** (images, fonts, CSS) via request interception to speed up scrapes significantly.
- **Set realistic user agents** and viewport sizes to reduce detection.
- **Use `waitUntil: 'networkidle2'`** for most scraping scenarios; it waits until there are no more than 2 network connections for 500ms.
- **Handle errors and timeouts gracefully** with try/catch and configurable timeout values.
- **Run in Docker** using the `node:slim` image with the `--no-sandbox` flag and `--disable-dev-shm-usage` to avoid shared-memory issues.
- **Limit concurrency** — open a bounded number of pages at once to avoid memory exhaustion.
- **Memory leaks from unclosed pages/browsers.** Always wrap scraping logic in try/finally to ensure `browser.close()` runs.
- **Stale element references.** After navigation or DOM mutations, previously captured element handles become invalid. Re-query after any page change.
- **`page.evaluate` serialization boundary.** You cannot pass Node.js variables directly into `evaluate`; pass them as arguments: `page.evaluate((arg) => { ... }, myVar)`.
- **Timeout errors on slow pages.** Increase the default navigation timeout with `page.setDefaultNavigationTimeout(60000)` instead of relying on the 30-second default.
- **Over-relying on `networkidle0`.** On pages with persistent WebSocket connections or polling, `networkidle0` may never resolve. Use `networkidle2` or explicit selector waits instead.
## Quick Example
```bash
npm install puppeteer
```
```bash
npm install puppeteer-core
```skilldb get web-scraping-skills/PuppeteerFull skill: 164 linesPuppeteer — Web Scraping
You are an expert in Puppeteer for web scraping, browser automation, and data extraction from JavaScript-heavy websites.
Core Philosophy
Overview
Puppeteer is a Node.js library that provides a high-level API to control headless (or full) Chrome/Chromium browsers. It excels at scraping single-page applications and sites that rely on client-side rendering, where simple HTTP requests cannot retrieve the final DOM.
Setup & Configuration
Install Puppeteer (bundles a compatible Chromium binary):
npm install puppeteer
For a lighter install that uses an existing Chrome installation:
npm install puppeteer-core
Basic launch configuration:
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch({
headless: 'new', // use the new headless mode
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage', // prevents crashes in Docker
],
defaultViewport: { width: 1280, height: 800 },
});
Core Patterns
Simple page scrape
async function scrapePage(url) {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
);
await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
const data = await page.evaluate(() => {
const title = document.querySelector('h1')?.textContent?.trim();
const items = [...document.querySelectorAll('.product-card')].map(el => ({
name: el.querySelector('.title')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
}));
return { title, items };
});
await browser.close();
return data;
}
Waiting for dynamic content
// Wait for a specific selector to appear
await page.waitForSelector('.results-loaded', { timeout: 10000 });
// Wait for a network request to complete
await page.waitForResponse(
res => res.url().includes('/api/data') && res.status() === 200
);
// Wait for navigation after a click
await Promise.all([
page.waitForNavigation({ waitUntil: 'networkidle0' }),
page.click('button.load-more'),
]);
Infinite scroll handling
async function autoScroll(page) {
await page.evaluate(async () => {
await new Promise(resolve => {
let totalHeight = 0;
const distance = 400;
const timer = setInterval(() => {
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= document.body.scrollHeight) {
clearInterval(timer);
resolve();
}
}, 200);
});
});
}
Intercepting network requests
await page.setRequestInterception(true);
page.on('request', req => {
const blocked = ['image', 'stylesheet', 'font'];
if (blocked.includes(req.resourceType())) {
req.abort();
} else {
req.continue();
}
});
Taking screenshots and PDFs
await page.screenshot({ path: 'page.png', fullPage: true });
await page.pdf({ path: 'page.pdf', format: 'A4' });
Best Practices
- Reuse browser instances across multiple page scrapes instead of launching a new browser each time. Create pages with
browser.newPage()and close them when done. - Block unnecessary resources (images, fonts, CSS) via request interception to speed up scrapes significantly.
- Set realistic user agents and viewport sizes to reduce detection.
- Use
waitUntil: 'networkidle2'for most scraping scenarios; it waits until there are no more than 2 network connections for 500ms. - Handle errors and timeouts gracefully with try/catch and configurable timeout values.
- Run in Docker using the
node:slimimage with the--no-sandboxflag and--disable-dev-shm-usageto avoid shared-memory issues. - Limit concurrency — open a bounded number of pages at once to avoid memory exhaustion.
Common Pitfalls
- Memory leaks from unclosed pages/browsers. Always wrap scraping logic in try/finally to ensure
browser.close()runs. - Stale element references. After navigation or DOM mutations, previously captured element handles become invalid. Re-query after any page change.
page.evaluateserialization boundary. You cannot pass Node.js variables directly intoevaluate; pass them as arguments:page.evaluate((arg) => { ... }, myVar).- Headless detection. Some sites detect headless Chrome via
navigator.webdriver. Usepage.evaluateOnNewDocumentto delete that property, or use a stealth plugin likepuppeteer-extra-plugin-stealth. - Timeout errors on slow pages. Increase the default navigation timeout with
page.setDefaultNavigationTimeout(60000)instead of relying on the 30-second default. - Over-relying on
networkidle0. On pages with persistent WebSocket connections or polling,networkidle0may never resolve. Usenetworkidle2or explicit selector waits instead.
Anti-Patterns
Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.
Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.
Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.
Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.
Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.
Install this skill directly: skilldb add web-scraping-skills
Related Skills
Anti Detection
Ethical techniques for handling CAPTCHAs, rate limiting, and bot detection while scraping responsibly
Beautifulsoup
HTML and XML parsing with Beautiful Soup in Python for flexible data extraction
Cheerio
Fast server-side HTML parsing and data extraction with Cheerio using jQuery-like syntax
Data Pipeline
Patterns for building robust scraping data pipelines with validation, deduplication, storage, and monitoring
Playwright Scraping
Cross-browser web scraping with Playwright, supporting Chromium, Firefox, and WebKit
Scrapy
Production-grade web scraping framework in Python with built-in crawling, pipelines, and middleware