Technology & EngineeringFile Formats178 lines

PDF (Portable Document Format)

Adobe's universal document format for fixed-layout, platform-independent document exchange with support for text, images, vector graphics, forms, and digital signatures.

Quick Summary25 lines

You are a file format specialist with deep knowledge of the PDF specification, its object-based structure, rendering model, and the practical realities of creating, parsing, manipulating, and converting PDFs across platforms and programming languages.

## Key Points

- **File extension:** `.pdf`
- **MIME type:** `application/pdf`
- **Current version:** PDF 2.0 (ISO 32000-2:2020)
- **Magic bytes:** `%PDF-` at file start
- **Max file size:** No theoretical limit; practical limits depend on reader software
- **Character encoding:** Supports Unicode via CIDFont and ToUnicode mappings
- **Color spaces:** RGB, CMYK, Grayscale, ICC profiles, spot colors
- **Compression:** Supports FlateDecode (zlib), DCTDecode (JPEG), JBIG2, CCITT, and others
- **Structure:** Header, body (objects), cross-reference table, trailer
- **PDF/A** (ISO 19005): Long-term archival; requires embedded fonts, no encryption
- **PDF/X** (ISO 15930): Print production exchange
- **PDF/E** (ISO 24517): Engineering documents

## Quick Example

```bash
# wkhtmltopdf — lightweight CLI
wkhtmltopdf --enable-local-file-access report.html report.pdf
```

skilldb get file-formats-skills/PDF (Portable Document Format)Full skill: 178 lines

Paste into your CLAUDE.md or agent config

PDF — Portable Document Format

You are a file format specialist with deep knowledge of the PDF specification, its object-based structure, rendering model, and the practical realities of creating, parsing, manipulating, and converting PDFs across platforms and programming languages.

Overview

PDF is a file format developed by Adobe Systems for presenting documents in a manner independent of application software, hardware, and operating systems. Each PDF file encapsulates a complete description of a fixed-layout flat document, including text, fonts, vector graphics, raster images, and other information needed for display. PDF has become the de facto standard for electronic document distribution.

Core Philosophy

PDF's core promise is visual fidelity: a PDF document looks the same regardless of the operating system, device, software, or printer used to render it. This what-you-see-is-what-everyone-sees guarantee is why PDF became the universal format for documents where layout precision matters — contracts, publications, forms, technical drawings, and archival records.

PDF is a final-form format, not an editing format. While PDF editing tools exist, PDF was designed for viewing and printing, not round-trip editing. The internal structure reflects this: text, fonts, and graphics are positioned absolutely on the page rather than flowing in response to content changes. Author documents in Word, LaTeX, InDesign, or HTML, then export to PDF for distribution. Treat the source document as the editable master and the PDF as the published output.

PDF's feature scope has expanded enormously since Adobe introduced it in 1993. Modern PDF supports forms, digital signatures, accessibility tags, 3D models, multimedia, JavaScript, encryption, and standardized archival profiles (PDF/A). This breadth means "PDF" is not a single thing — a tagged, accessible PDF/A-2 document and an image-only scanned PDF share a file extension but differ in every practical dimension. Always specify which PDF capabilities your workflow requires.

Technical Specifications

File extension: .pdf
MIME type: application/pdf
Current version: PDF 2.0 (ISO 32000-2:2020)
Magic bytes: %PDF- at file start
Max file size: No theoretical limit; practical limits depend on reader software
Character encoding: Supports Unicode via CIDFont and ToUnicode mappings
Color spaces: RGB, CMYK, Grayscale, ICC profiles, spot colors
Compression: Supports FlateDecode (zlib), DCTDecode (JPEG), JBIG2, CCITT, and others
Structure: Header, body (objects), cross-reference table, trailer

A PDF file is built from a hierarchy of objects: integers, strings, arrays, dictionaries, streams, and indirect references. Page content is described using a PostScript-derived page description language with operators for drawing paths, placing text, and embedding images. Fonts can be embedded fully or as subsets.

PDF/A, PDF/X, PDF/E, PDF/UA

PDF/A (ISO 19005): Long-term archival; requires embedded fonts, no encryption
PDF/X (ISO 15930): Print production exchange
PDF/E (ISO 24517): Engineering documents
PDF/UA (ISO 14289): Universal accessibility

How to Work With It

Opening

Any modern web browser (Chrome, Firefox, Edge) renders PDFs natively. Dedicated readers include Adobe Acrobat Reader, Foxit Reader, SumatraPDF (Windows), Preview (macOS), Okular and Evince (Linux).

Creating

From applications: Print to PDF is built into Windows 10+, macOS, and most Linux desktops
Programmatically: Libraries like reportlab (Python), iText (Java/.NET), PDFKit (Node.js), FPDF/TCPDF (PHP), Prawn (Ruby)
From LaTeX: pdflatex, xelatex, or lualatex compile .tex directly to PDF

Parsing and Extracting

Text extraction: pdftotext (poppler-utils), PyPDF2/pypdf, pdfminer.six, Apache Tika
Structured data: tabula-py or Camelot for tables
Metadata: exiftool, pdfinfo, or any PDF library
Images: pdfimages (poppler-utils)

Converting

To images: pdftoppm, Ghostscript, ImageMagick
To Word: LibreOffice, Adobe Acrobat Pro, online converters
To HTML: pdf2htmlEX, pdftohtml
From HTML: wkhtmltopdf, Puppeteer/Playwright page.pdf(), WeasyPrint

Editing

Direct editing is possible with Adobe Acrobat Pro, LibreOffice Draw, or qpdf / pdftk for structural manipulation (merging, splitting, rotating, encrypting).

Common Use Cases

Official documents, contracts, and legal filings
Scientific papers and journal articles
Print-ready artwork and prepress workflows (PDF/X)
Government forms (with interactive form fields)
eBooks and digital manuals
Invoices and financial statements
Long-term archival of records (PDF/A)

Pros & Cons

Pros

Universally supported across all platforms and devices
Preserves exact layout regardless of viewer
Supports encryption, digital signatures, and permissions
Can embed fonts ensuring consistent rendering
ISO-standardized with multiple profiles for specialized needs
Compact with built-in compression

Cons

Not designed for easy editing; reflowing content is difficult
Text extraction can be unreliable, especially from scanned documents
Large documents with many images can become very large
Accessibility depends on proper tagging (often missing)
Creating accessible, tagged PDFs requires deliberate effort
Complex internal structure makes programmatic generation non-trivial

Compatibility

Platform	Native Support
Windows	Built-in viewer (Edge), Print to PDF
macOS	Preview, built-in QuickLook
Linux	Evince, Okular, poppler-based tools
iOS/Android	Built-in viewers in both platforms
Web	All major browsers render inline

Practical Usage

Generating PDFs from HTML (the Modern Approach)

For most developers, HTML-to-PDF is the most practical generation method:

// Puppeteer (Node.js) — pixel-perfect rendering via headless Chrome
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com/invoice/123');
await page.pdf({ path: 'invoice.pdf', format: 'A4', printBackground: true });

// WeasyPrint (Python) — CSS-based, no browser needed
// weasyprint https://example.com/report.html report.pdf

# wkhtmltopdf — lightweight CLI
wkhtmltopdf --enable-local-file-access report.html report.pdf

Merging, Splitting, and Manipulating PDFs

# Merge multiple PDFs
qpdf --empty --pages file1.pdf file2.pdf file3.pdf -- merged.pdf

# Extract pages 5-10
qpdf input.pdf --pages input.pdf 5-10 -- extract.pdf

# Remove password protection (if you know the password)
qpdf --decrypt --password=secret protected.pdf decrypted.pdf

# Compress / linearize for web delivery
qpdf --linearize input.pdf web-optimized.pdf

Text Extraction Strategies

Different tools suit different PDF types:

# Simple text PDFs — pdftotext is fast and reliable
pdftotext -layout input.pdf output.txt

# Table extraction — tabula for structured data
java -jar tabula.jar -o output.csv -p all input.pdf

# Scanned PDFs — OCR with Tesseract
ocrmypdf input-scanned.pdf output-searchable.pdf

Agent and Automation Workflows

When processing PDFs programmatically, choose your approach based on the content type: pypdf for simple text extraction and manipulation, pdfminer.six for layout-aware extraction, tabula-py or camelot for tables, and ocrmypdf for scanned documents. For generating reports, prefer HTML-to-PDF over low-level PDF construction.

Anti-Patterns

Treating PDF as a data format. PDF is a presentation format, not a data interchange format. Extracting structured data from PDFs is inherently fragile. If you control the data source, export as CSV, JSON, or a database — not PDF.

Generating PDFs with low-level libraries when HTML would suffice. Writing PDF construction code with reportlab or iText is time-consuming and brittle. Unless you need precise control over the PDF object structure, use an HTML-to-PDF renderer and style with CSS.

Assuming text extraction will preserve reading order. PDF stores text as positioned glyphs, not logical paragraphs. Columns, headers, footers, and sidebars can interleave unpredictably during extraction. Always validate extracted text against the visual layout.

Creating PDFs without embedded fonts. If fonts are not embedded, the viewer substitutes system fonts, which shifts layout and breaks formatting. Always embed fonts (or font subsets) when generating PDFs.

Ignoring accessibility. Untagged PDFs are inaccessible to screen readers. If your PDFs will be consumed by the public, use tagged PDF structure (PDF/UA) with proper heading hierarchy, alt text, and reading order.

Related Formats

PostScript (.ps): PDF's predecessor; page description language
XPS (.xps): Microsoft's alternative fixed-layout format
DjVu (.djvu): Optimized for scanned documents
PDF/A: Archival subset of PDF
EPUB: Reflowable alternative for ebooks

Install this skill directly: skilldb add file-formats-skills

Get CLI access →

PDF — Portable Document Format

Overview

Core Philosophy

Technical Specifications

PDF/A, PDF/X, PDF/E, PDF/UA

How to Work With It

Opening

Creating

Parsing and Extracting

Converting

Editing

Common Use Cases

Pros & Cons

Pros

Cons

Compatibility

Practical Usage

Generating PDFs from HTML (the Modern Approach)

wkhtmltopdf — lightweight CLI

Merging, Splitting, and Manipulating PDFs

Merge multiple PDFs

Extract pages 5-10

Remove password protection (if you know the password)

Compress / linearize for web delivery

Text Extraction Strategies

Simple text PDFs — pdftotext is fast and reliable

Table extraction — tabula for structured data

Scanned PDFs — OCR with Tesseract

Agent and Automation Workflows

Anti-Patterns

Related Formats

Details

Pack: file-formats-skills
File: pdf.md
Lines: 178
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add file-formats-skills

Installs the full File Formats pack to your project.

PDF (Portable Document Format)

PDF — Portable Document Format

Overview

Core Philosophy

Technical Specifications

PDF/A, PDF/X, PDF/E, PDF/UA

How to Work With It

Opening

Creating

Parsing and Extracting

Converting

Editing

Common Use Cases

Pros & Cons

Pros

Cons

Compatibility

Practical Usage

Generating PDFs from HTML (the Modern Approach)

Merging, Splitting, and Manipulating PDFs

Text Extraction Strategies

Agent and Automation Workflows

Anti-Patterns

Related Formats

Related Skills

3MF 3D Manufacturing Format

7-Zip Compressed Archive

AAC (Advanced Audio Coding)

AC3 (Dolby Digital)

AI Adobe Illustrator Format

AIFF (Audio Interchange File Format)