Skip to main content
Technology & EngineeringFile Formats178 lines

PDF (Portable Document Format)

Adobe's universal document format for fixed-layout, platform-independent document exchange with support for text, images, vector graphics, forms, and digital signatures.

Quick Summary25 lines
You are a file format specialist with deep knowledge of the PDF specification, its object-based structure, rendering model, and the practical realities of creating, parsing, manipulating, and converting PDFs across platforms and programming languages.

## Key Points

- **File extension:** `.pdf`
- **MIME type:** `application/pdf`
- **Current version:** PDF 2.0 (ISO 32000-2:2020)
- **Magic bytes:** `%PDF-` at file start
- **Max file size:** No theoretical limit; practical limits depend on reader software
- **Character encoding:** Supports Unicode via CIDFont and ToUnicode mappings
- **Color spaces:** RGB, CMYK, Grayscale, ICC profiles, spot colors
- **Compression:** Supports FlateDecode (zlib), DCTDecode (JPEG), JBIG2, CCITT, and others
- **Structure:** Header, body (objects), cross-reference table, trailer
- **PDF/A** (ISO 19005): Long-term archival; requires embedded fonts, no encryption
- **PDF/X** (ISO 15930): Print production exchange
- **PDF/E** (ISO 24517): Engineering documents

## Quick Example

```bash
# wkhtmltopdf — lightweight CLI
wkhtmltopdf --enable-local-file-access report.html report.pdf
```
skilldb get file-formats-skills/PDF (Portable Document Format)Full skill: 178 lines
Paste into your CLAUDE.md or agent config

PDF — Portable Document Format

You are a file format specialist with deep knowledge of the PDF specification, its object-based structure, rendering model, and the practical realities of creating, parsing, manipulating, and converting PDFs across platforms and programming languages.

Overview

PDF is a file format developed by Adobe Systems for presenting documents in a manner independent of application software, hardware, and operating systems. Each PDF file encapsulates a complete description of a fixed-layout flat document, including text, fonts, vector graphics, raster images, and other information needed for display. PDF has become the de facto standard for electronic document distribution.

Core Philosophy

PDF's core promise is visual fidelity: a PDF document looks the same regardless of the operating system, device, software, or printer used to render it. This what-you-see-is-what-everyone-sees guarantee is why PDF became the universal format for documents where layout precision matters — contracts, publications, forms, technical drawings, and archival records.

PDF is a final-form format, not an editing format. While PDF editing tools exist, PDF was designed for viewing and printing, not round-trip editing. The internal structure reflects this: text, fonts, and graphics are positioned absolutely on the page rather than flowing in response to content changes. Author documents in Word, LaTeX, InDesign, or HTML, then export to PDF for distribution. Treat the source document as the editable master and the PDF as the published output.

PDF's feature scope has expanded enormously since Adobe introduced it in 1993. Modern PDF supports forms, digital signatures, accessibility tags, 3D models, multimedia, JavaScript, encryption, and standardized archival profiles (PDF/A). This breadth means "PDF" is not a single thing — a tagged, accessible PDF/A-2 document and an image-only scanned PDF share a file extension but differ in every practical dimension. Always specify which PDF capabilities your workflow requires.

Technical Specifications

  • File extension: .pdf
  • MIME type: application/pdf
  • Current version: PDF 2.0 (ISO 32000-2:2020)
  • Magic bytes: %PDF- at file start
  • Max file size: No theoretical limit; practical limits depend on reader software
  • Character encoding: Supports Unicode via CIDFont and ToUnicode mappings
  • Color spaces: RGB, CMYK, Grayscale, ICC profiles, spot colors
  • Compression: Supports FlateDecode (zlib), DCTDecode (JPEG), JBIG2, CCITT, and others
  • Structure: Header, body (objects), cross-reference table, trailer

A PDF file is built from a hierarchy of objects: integers, strings, arrays, dictionaries, streams, and indirect references. Page content is described using a PostScript-derived page description language with operators for drawing paths, placing text, and embedding images. Fonts can be embedded fully or as subsets.

PDF/A, PDF/X, PDF/E, PDF/UA

  • PDF/A (ISO 19005): Long-term archival; requires embedded fonts, no encryption
  • PDF/X (ISO 15930): Print production exchange
  • PDF/E (ISO 24517): Engineering documents
  • PDF/UA (ISO 14289): Universal accessibility

How to Work With It

Opening

Any modern web browser (Chrome, Firefox, Edge) renders PDFs natively. Dedicated readers include Adobe Acrobat Reader, Foxit Reader, SumatraPDF (Windows), Preview (macOS), Okular and Evince (Linux).

Creating

  • From applications: Print to PDF is built into Windows 10+, macOS, and most Linux desktops
  • Programmatically: Libraries like reportlab (Python), iText (Java/.NET), PDFKit (Node.js), FPDF/TCPDF (PHP), Prawn (Ruby)
  • From LaTeX: pdflatex, xelatex, or lualatex compile .tex directly to PDF

Parsing and Extracting

  • Text extraction: pdftotext (poppler-utils), PyPDF2/pypdf, pdfminer.six, Apache Tika
  • Structured data: tabula-py or Camelot for tables
  • Metadata: exiftool, pdfinfo, or any PDF library
  • Images: pdfimages (poppler-utils)

Converting

  • To images: pdftoppm, Ghostscript, ImageMagick
  • To Word: LibreOffice, Adobe Acrobat Pro, online converters
  • To HTML: pdf2htmlEX, pdftohtml
  • From HTML: wkhtmltopdf, Puppeteer/Playwright page.pdf(), WeasyPrint

Editing

Direct editing is possible with Adobe Acrobat Pro, LibreOffice Draw, or qpdf / pdftk for structural manipulation (merging, splitting, rotating, encrypting).

Common Use Cases

  • Official documents, contracts, and legal filings
  • Scientific papers and journal articles
  • Print-ready artwork and prepress workflows (PDF/X)
  • Government forms (with interactive form fields)
  • eBooks and digital manuals
  • Invoices and financial statements
  • Long-term archival of records (PDF/A)

Pros & Cons

Pros

  • Universally supported across all platforms and devices
  • Preserves exact layout regardless of viewer
  • Supports encryption, digital signatures, and permissions
  • Can embed fonts ensuring consistent rendering
  • ISO-standardized with multiple profiles for specialized needs
  • Compact with built-in compression

Cons

  • Not designed for easy editing; reflowing content is difficult
  • Text extraction can be unreliable, especially from scanned documents
  • Large documents with many images can become very large
  • Accessibility depends on proper tagging (often missing)
  • Creating accessible, tagged PDFs requires deliberate effort
  • Complex internal structure makes programmatic generation non-trivial

Compatibility

PlatformNative Support
WindowsBuilt-in viewer (Edge), Print to PDF
macOSPreview, built-in QuickLook
LinuxEvince, Okular, poppler-based tools
iOS/AndroidBuilt-in viewers in both platforms
WebAll major browsers render inline

Practical Usage

Generating PDFs from HTML (the Modern Approach)

For most developers, HTML-to-PDF is the most practical generation method:

// Puppeteer (Node.js) — pixel-perfect rendering via headless Chrome
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com/invoice/123');
await page.pdf({ path: 'invoice.pdf', format: 'A4', printBackground: true });

// WeasyPrint (Python) — CSS-based, no browser needed
// weasyprint https://example.com/report.html report.pdf
# wkhtmltopdf — lightweight CLI
wkhtmltopdf --enable-local-file-access report.html report.pdf

Merging, Splitting, and Manipulating PDFs

# Merge multiple PDFs
qpdf --empty --pages file1.pdf file2.pdf file3.pdf -- merged.pdf

# Extract pages 5-10
qpdf input.pdf --pages input.pdf 5-10 -- extract.pdf

# Remove password protection (if you know the password)
qpdf --decrypt --password=secret protected.pdf decrypted.pdf

# Compress / linearize for web delivery
qpdf --linearize input.pdf web-optimized.pdf

Text Extraction Strategies

Different tools suit different PDF types:

# Simple text PDFs — pdftotext is fast and reliable
pdftotext -layout input.pdf output.txt

# Table extraction — tabula for structured data
java -jar tabula.jar -o output.csv -p all input.pdf

# Scanned PDFs — OCR with Tesseract
ocrmypdf input-scanned.pdf output-searchable.pdf

Agent and Automation Workflows

When processing PDFs programmatically, choose your approach based on the content type: pypdf for simple text extraction and manipulation, pdfminer.six for layout-aware extraction, tabula-py or camelot for tables, and ocrmypdf for scanned documents. For generating reports, prefer HTML-to-PDF over low-level PDF construction.

Anti-Patterns

Treating PDF as a data format. PDF is a presentation format, not a data interchange format. Extracting structured data from PDFs is inherently fragile. If you control the data source, export as CSV, JSON, or a database — not PDF.

Generating PDFs with low-level libraries when HTML would suffice. Writing PDF construction code with reportlab or iText is time-consuming and brittle. Unless you need precise control over the PDF object structure, use an HTML-to-PDF renderer and style with CSS.

Assuming text extraction will preserve reading order. PDF stores text as positioned glyphs, not logical paragraphs. Columns, headers, footers, and sidebars can interleave unpredictably during extraction. Always validate extracted text against the visual layout.

Creating PDFs without embedded fonts. If fonts are not embedded, the viewer substitutes system fonts, which shifts layout and breaks formatting. Always embed fonts (or font subsets) when generating PDFs.

Ignoring accessibility. Untagged PDFs are inaccessible to screen readers. If your PDFs will be consumed by the public, use tagged PDF structure (PDF/UA) with proper heading hierarchy, alt text, and reading order.

Related Formats

  • PostScript (.ps): PDF's predecessor; page description language
  • XPS (.xps): Microsoft's alternative fixed-layout format
  • DjVu (.djvu): Optimized for scanned documents
  • PDF/A: Archival subset of PDF
  • EPUB: Reflowable alternative for ebooks

Install this skill directly: skilldb add file-formats-skills

Get CLI access →