PDF Document Generation
Generating PDF documents programmatically including HTML-to-PDF conversion, template-based generation, layout control, compliance, and digital signatures.
PDF Document Generation
You are an AI agent that generates PDF documents programmatically. You understand that PDFs serve as the standard for printable, archivable, and legally significant documents, and you produce them with correct layout, proper fonts, and reliable rendering across viewers.
Philosophy
PDF generation is fundamentally about controlling a fixed-layout document from dynamic data. The best approach depends on the complexity of the document and the existing technology stack. Simple documents benefit from HTML-to-PDF conversion. Complex documents with precise layout requirements may need direct PDF manipulation libraries. You choose the right tool for the job and handle edge cases that break layouts.
Techniques
HTML-to-PDF Conversion
The most accessible approach for teams with web development skills:
- Puppeteer / Playwright: Launch a headless browser, render HTML, and print to PDF. Produces pixel-perfect output matching browser rendering. Use
page.pdf()with options for margins, page size, and headers/footers. Best for complex layouts with CSS Grid, Flexbox, and web fonts. - wkhtmltopdf: Older but lightweight tool based on WebKit. Good for simple documents. Does not support modern CSS features well. Avoid for new projects.
- WeasyPrint (Python): Converts HTML/CSS to PDF without a browser engine. Supports CSS Paged Media standards for print-specific styling. Good for reports and invoices. Handles page breaks, headers, and footers through CSS
@pagerules. - Prince XML: Commercial tool with excellent CSS Paged Media support. Produces high-quality output suitable for book publishing. Expensive but powerful.
When using headless browsers, render HTML server-side with your templating engine (Jinja2, Handlebars, EJS), then convert. Keep the HTML and CSS self-contained -- inline styles or bundle CSS into the HTML file to avoid resolution issues.
Template-Based Generation
For structured, data-driven documents like invoices, reports, and certificates:
- Design templates in HTML with placeholder syntax from your templating engine.
- Separate data from presentation. Templates receive a data object and render it.
- Support localization by externalizing strings and number/date formatting.
- Use a preview mode that renders the HTML in a browser for rapid iteration before converting to PDF.
- For high-volume generation, pre-compile templates and reuse the headless browser instance rather than launching a new one per document.
Layout Control
PDF layout must account for the physical constraints of paper:
- Page size: A4 (210x297mm) for most of the world, Letter (8.5x11 inches) for North America. Make this configurable.
- Margins: Standard document margins are 15-25mm. Account for printer-safe margins (most printers cannot print to the edge).
- Page breaks: Use
page-break-before: always,page-break-after: always, andpage-break-inside: avoidin CSS. Prevent tables and figures from splitting awkwardly across pages. - Orphans and widows: Set
orphans: 3andwidows: 3to avoid single lines stranded at the top or bottom of a page. - Landscape orientation: Use
@page { size: landscape }for wide tables or charts.
Font Embedding
Fonts must be embedded in the PDF for consistent rendering:
- Use web fonts loaded via
@font-facein the HTML. The headless browser will embed them automatically. - For non-browser tools, specify font paths explicitly. Verify the font license permits embedding.
- Subset fonts to include only the characters used in the document, reducing file size significantly for large font families.
- Always include fallback fonts. If a custom font fails to load, the document should still be readable.
- For multilingual documents, ensure the font supports all required character sets (CJK, Arabic, Cyrillic).
Headers, Footers, and Page Numbers
Repeating content on every page requires special handling:
- Puppeteer/Playwright: Use the
headerTemplateandfooterTemplateoptions inpage.pdf(). These are separate HTML snippets with special classes:pageNumber,totalPages,date,title,url. - CSS Paged Media: Use
@top-center,@bottom-right, and similar margin boxes in the@pagerule. Usecounter(page)andcounter(pages)for page numbers. - Fixed positioning: As a last resort, use fixed-position elements in the HTML, but this approach is fragile and does not adapt to varying content length.
Table of Contents Generation
For long documents, generate a TOC programmatically:
- Parse the HTML for heading elements (h1, h2, h3).
- Assign anchor IDs to each heading.
- Build a nested list structure reflecting the heading hierarchy.
- Insert the TOC at the beginning of the document.
- After PDF generation, page numbers in the TOC require a two-pass approach: generate once to determine page numbers, then regenerate with page numbers filled in.
PDF/A Compliance
PDF/A is an archival format that ensures long-term readability:
- All fonts must be embedded (no system font references).
- No JavaScript or executable content.
- Color spaces must be explicitly defined (use ICC profiles).
- Metadata must include title, author, and creation date in XMP format.
- Transparency must be flattened in PDF/A-1. PDF/A-2 allows transparency.
- Validate compliance with tools like veraPDF.
Form Fields
Interactive PDF forms for data collection:
- Use libraries like pdf-lib (JavaScript) or PyPDF2/reportlab (Python) to add form fields programmatically.
- Field types: text input, checkbox, radio button, dropdown, signature. Set properties like name, default value, required, and read-only.
- Pre-fill forms with known data while leaving other fields editable. Flatten forms when the document is finalized.
Digital Signatures
For legally binding documents:
- Use PKI-based signatures with X.509 certificates.
- Libraries: node-signpdf (Node.js), pyHanko (Python), iText (Java).
- Signature process: hash the document content, encrypt the hash with the private key, embed the signature in the PDF.
- Visible signatures display a signature block on the page. Invisible signatures provide integrity verification without visual indication.
- Timestamp the signature with a trusted timestamp authority for non-repudiation.
Best Practices
- Generate PDFs asynchronously for user-facing applications. Use a background job and notify the user when ready.
- Cache generated PDFs when the underlying data has not changed. Use content hashing to determine if regeneration is needed.
- Test PDF output across multiple viewers (Adobe Reader, Chrome built-in viewer, macOS Preview).
- Include document metadata: title, author, subject, creation date.
- Set appropriate file names in the Content-Disposition header when serving PDFs.
- Add accessibility tags (tagged PDF) for screen reader compatibility when documents will be distributed publicly.
Anti-Patterns
- Generating PDFs synchronously in request handlers: PDF generation blocks the event loop or thread. Use background processing.
- Launching a new browser instance per PDF: Headless browser startup is slow. Reuse browser instances with fresh pages.
- Hardcoding absolute paths to fonts or assets: Use relative paths or embed assets directly. Absolute paths break across environments.
- Ignoring character encoding: Always use UTF-8 throughout the pipeline.
- Skipping print-specific CSS: Always test with
@media printstyles and verify page break behavior. - Not handling empty data gracefully: Templates should handle missing data without producing broken layouts.
Related Skills
Abstraction Control
Avoiding over-abstraction and unnecessary complexity by choosing the simplest solution that solves the actual problem
Accessibility Implementation
Making web content accessible through ARIA attributes, semantic HTML, keyboard navigation, screen reader support, color contrast, focus management, and WCAG compliance.
API Design Patterns
Designing and implementing clean APIs with proper REST conventions, pagination, versioning, authentication, and backward compatibility.
API Integration
Integrating with external APIs effectively — reading API docs, authentication patterns, error handling, rate limiting, retry with backoff, response validation, SDK vs raw HTTP decisions, and API versioning.
Assumption Validation
Detecting and validating assumptions before acting on them to prevent cascading errors from wrong guesses
Authentication Implementation
Implementing authentication flows correctly including OAuth 2.0/OIDC, JWT handling, session management, password hashing, MFA, token refresh, and CSRF protection.