Skip to main content
Technology & EngineeringDocument Generation Services256 lines

Puppeteer

"Puppeteer: headless Chrome, PDF generation from HTML, screenshots, web scraping, page automation, Chromium control"

Quick Summary18 lines
Puppeteer provides programmatic control over headless Chromium, making it the most faithful HTML-to-PDF and screenshot tool available. Because it renders through a real browser engine, the output matches exactly what a user would see. Prefer Puppeteer when pixel-perfect fidelity to web content matters, when you need JavaScript execution before capture, or when generating PDFs from complex layouts that CSS-only converters struggle with. Accept the heavier resource footprint in exchange for rendering accuracy.

## Key Points

- Reuse a single `Browser` instance across requests; launching Chromium is expensive.
- Always close pages in a `finally` block to prevent memory leaks.
- Set explicit timeouts on `goto`, `setContent`, and `waitForSelector` calls.
- Use `waitUntil: "networkidle0"` for content that loads external resources; use `"domcontentloaded"` when all content is inline.
- Run with `--no-sandbox` and `--disable-dev-shm-usage` in containers but never on untrusted user machines.
- For high-throughput services, implement a page pool rather than creating and destroying pages per request.
- Set `printBackground: true` in PDF options to capture CSS background colors and images.
- Use `page.emulateMediaType("print")` before PDF generation if styles differ between screen and print.
- Prefer `page.setContent()` over `page.goto("data:...")` for large HTML payloads.
- **Launching a new browser per request.** Chromium startup is slow and memory-heavy. Pool or reuse a single instance.
- **Omitting page cleanup.** Leaked pages accumulate memory until the process crashes.
- **Using `waitUntil: "load"` for SPAs.** Single-page apps often fire `load` before content renders; use `networkidle0` or explicit `waitForSelector`.
skilldb get document-generation-services-skills/PuppeteerFull skill: 256 lines
Paste into your CLAUDE.md or agent config

Puppeteer Document Generation

Core Philosophy

Puppeteer provides programmatic control over headless Chromium, making it the most faithful HTML-to-PDF and screenshot tool available. Because it renders through a real browser engine, the output matches exactly what a user would see. Prefer Puppeteer when pixel-perfect fidelity to web content matters, when you need JavaScript execution before capture, or when generating PDFs from complex layouts that CSS-only converters struggle with. Accept the heavier resource footprint in exchange for rendering accuracy.

Setup

Install Puppeteer with its bundled Chromium:

// package.json dependencies
// "puppeteer": "^22.0.0"

import puppeteer, { Browser, Page, PDFOptions } from "puppeteer";

// Launch a shared browser instance for reuse across requests
let browser: Browser | null = null;

async function getBrowser(): Promise<Browser> {
  if (!browser || !browser.connected) {
    browser = await puppeteer.launch({
      headless: true,
      args: [
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--disable-dev-shm-usage",
        "--disable-gpu",
        "--font-render-hinting=none",
      ],
    });
  }
  return browser;
}

// Graceful shutdown
process.on("SIGTERM", async () => {
  if (browser) await browser.close();
  process.exit(0);
});

For Docker deployments, use puppeteer-core with a separately installed Chromium to reduce image size:

import puppeteer from "puppeteer-core";

const browser = await puppeteer.launch({
  executablePath: "/usr/bin/chromium-browser",
  headless: true,
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

Key Techniques

PDF Generation from HTML String

interface PdfGenerationOptions {
  html: string;
  headerTemplate?: string;
  footerTemplate?: string;
  landscape?: boolean;
  format?: "A4" | "Letter" | "Legal";
  margin?: { top: string; right: string; bottom: string; left: string };
}

async function generatePdfFromHtml(
  options: PdfGenerationOptions
): Promise<Buffer> {
  const browser = await getBrowser();
  const page = await browser.newPage();

  try {
    await page.setContent(options.html, {
      waitUntil: "networkidle0",
      timeout: 30_000,
    });

    const pdfOptions: PDFOptions = {
      format: options.format ?? "A4",
      landscape: options.landscape ?? false,
      printBackground: true,
      margin: options.margin ?? {
        top: "20mm",
        right: "15mm",
        bottom: "20mm",
        left: "15mm",
      },
      displayHeaderFooter: !!(options.headerTemplate || options.footerTemplate),
      headerTemplate: options.headerTemplate ?? "<span></span>",
      footerTemplate:
        options.footerTemplate ??
        '<div style="font-size:10px;text-align:center;width:100%;"><span class="pageNumber"></span> / <span class="totalPages"></span></div>',
    };

    const pdf = await page.pdf(pdfOptions);
    return Buffer.from(pdf);
  } finally {
    await page.close();
  }
}

PDF Generation from a Live URL

async function generatePdfFromUrl(
  url: string,
  cookies?: Array<{ name: string; value: string; domain: string }>
): Promise<Buffer> {
  const browser = await getBrowser();
  const page = await browser.newPage();

  try {
    if (cookies) {
      await page.setCookie(...cookies);
    }

    await page.goto(url, { waitUntil: "networkidle0", timeout: 60_000 });

    // Wait for any lazy-loaded content
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    await page.waitForNetworkIdle({ idleTime: 500 });

    return Buffer.from(await page.pdf({ format: "A4", printBackground: true }));
  } finally {
    await page.close();
  }
}

Screenshot Capture

async function captureScreenshot(
  html: string,
  viewport: { width: number; height: number } = { width: 1280, height: 800 }
): Promise<Buffer> {
  const browser = await getBrowser();
  const page = await browser.newPage();

  try {
    await page.setViewport(viewport);
    await page.setContent(html, { waitUntil: "networkidle0" });

    // Capture a specific element rather than full page
    const element = await page.$(".capture-target");
    if (element) {
      return Buffer.from(
        await element.screenshot({ type: "png", omitBackground: true })
      );
    }

    return Buffer.from(await page.screenshot({ type: "png", fullPage: true }));
  } finally {
    await page.close();
  }
}

Injecting Styles and Waiting for Fonts

async function generateStyledPdf(
  html: string,
  cssUrl: string
): Promise<Buffer> {
  const browser = await getBrowser();
  const page = await browser.newPage();

  try {
    await page.setContent(html, { waitUntil: "domcontentloaded" });

    await page.addStyleTag({ url: cssUrl });

    // Wait for web fonts to finish loading
    await page.evaluateHandle("document.fonts.ready");

    return Buffer.from(
      await page.pdf({ format: "A4", printBackground: true })
    );
  } finally {
    await page.close();
  }
}

Connection Pooling with Page Reuse

class PuppeteerPool {
  private pages: Page[] = [];
  private browser: Browser | null = null;

  constructor(private maxPages: number = 5) {}

  async initialize(): Promise<void> {
    this.browser = await puppeteer.launch({ headless: true, args: ["--no-sandbox"] });
  }

  async acquirePage(): Promise<Page> {
    if (!this.browser) throw new Error("Pool not initialized");

    if (this.pages.length > 0) {
      return this.pages.pop()!;
    }

    return this.browser.newPage();
  }

  async releasePage(page: Page): Promise<void> {
    if (this.pages.length < this.maxPages) {
      await page.goto("about:blank");
      this.pages.push(page);
    } else {
      await page.close();
    }
  }

  async destroy(): Promise<void> {
    for (const page of this.pages) await page.close();
    this.pages = [];
    if (this.browser) await this.browser.close();
  }
}

Best Practices

  • Reuse a single Browser instance across requests; launching Chromium is expensive.
  • Always close pages in a finally block to prevent memory leaks.
  • Set explicit timeouts on goto, setContent, and waitForSelector calls.
  • Use waitUntil: "networkidle0" for content that loads external resources; use "domcontentloaded" when all content is inline.
  • Run with --no-sandbox and --disable-dev-shm-usage in containers but never on untrusted user machines.
  • For high-throughput services, implement a page pool rather than creating and destroying pages per request.
  • Set printBackground: true in PDF options to capture CSS background colors and images.
  • Use page.emulateMediaType("print") before PDF generation if styles differ between screen and print.
  • Prefer page.setContent() over page.goto("data:...") for large HTML payloads.

Anti-Patterns

  • Launching a new browser per request. Chromium startup is slow and memory-heavy. Pool or reuse a single instance.
  • Omitting page cleanup. Leaked pages accumulate memory until the process crashes.
  • Using waitUntil: "load" for SPAs. Single-page apps often fire load before content renders; use networkidle0 or explicit waitForSelector.
  • Ignoring printBackground. Without it, PDFs lose background colors and images, producing blank-looking documents.
  • Hardcoding viewport for screenshots. Always accept viewport dimensions as parameters; default assumptions break on varied content.
  • Running Puppeteer in serverless without size optimization. Bundled Chromium exceeds most Lambda size limits; use puppeteer-core with a Chromium layer.
  • Trusting user-supplied HTML without sanitization. Puppeteer executes JavaScript in the page context; unsanitized input can read local files or make network requests from your server.

Install this skill directly: skilldb add document-generation-services-skills

Get CLI access →