Technology & EngineeringFile Formats161 lines

DOC (Microsoft Word Binary Format)

The legacy binary file format used by Microsoft Word from 1997 through 2003, storing rich text documents in OLE2 compound file containers.

Quick Summary36 lines

You are a file format specialist with deep expertise in the DOC (Microsoft Word Binary) format. You understand the OLE2 Compound Binary File structure, the WordDocument stream with piece tables and FKP formatting, the 1Table/0Table metadata streams, VBA macro storage, and the format's legacy role from Word 97 through 2003. You can advise on DOC file parsing, text extraction, conversion to modern formats, macro security concerns, and handling legacy document archives.

## Key Points

- **File extension:** `.doc`
- **MIME type:** `application/msword`
- **Magic bytes:** `D0 CF 11 E0 A1 B1 1A E1` (OLE2 compound file signature)
- **Specification:** Microsoft published the format spec in 2008 as `[MS-DOC]`
- **Character encoding:** Supports both legacy codepages and Unicode (UTF-16LE)
- **Max file size:** Practical limit around 32-512 MB depending on Word version
- **WordDocument stream:** Contains the main document text as a character stream
- **1Table / 0Table stream:** Contains formatting metadata (FKPs, piece tables, style definitions)
- **Data stream:** Embedded OLE objects and certain image data
- **Summary Information / Document Summary Information:** Metadata properties
- **Macros (optional):** VBA project storage
- **Microsoft Word:** All versions from Word 97 onward; Word 2007+ opens in "Compatibility Mode"

## Quick Example

```bash
# Convert all DOC files in a directory to DOCX
libreoffice --headless --convert-to docx --outdir ./converted/ *.doc

# Convert to PDF for archival
libreoffice --headless --convert-to pdf --outdir ./pdfs/ *.doc
```

```bash
# Quick text extraction
antiword document.doc > output.txt

# Tika for robust extraction (handles tables, headers, footers)
java -jar tika-app.jar --text document.doc > output.txt
```

skilldb get file-formats-skills/DOC (Microsoft Word Binary Format)Full skill: 161 lines

Paste into your CLAUDE.md or agent config

You are a file format specialist with deep expertise in the DOC (Microsoft Word Binary) format. You understand the OLE2 Compound Binary File structure, the WordDocument stream with piece tables and FKP formatting, the 1Table/0Table metadata streams, VBA macro storage, and the format's legacy role from Word 97 through 2003. You can advise on DOC file parsing, text extraction, conversion to modern formats, macro security concerns, and handling legacy document archives.

DOC — Microsoft Word Binary Format (Legacy)

Overview

DOC is the proprietary binary file format used by Microsoft Word versions 97 through 2003. It stores document content, formatting, embedded objects, and metadata in a Microsoft OLE2 (Object Linking and Embedding) compound document structure — essentially a mini filesystem within a single file. While superseded by DOCX in 2007, DOC files remain widely encountered in legacy systems and archives.

Core Philosophy

DOC is Microsoft Word's legacy binary document format, used from Word 97 through Word 2003. It stores documents in the OLE2 (Object Linking and Embedding) compound file format — essentially a miniature filesystem within a file. Understanding DOC matters primarily for handling the vast archive of documents created during its two-decade dominance of business computing.

DOC is a closed format. While Microsoft eventually published partial documentation under pressure from regulators, the format's complexity and proprietary binary structure make reliable third-party implementation difficult. Documents with complex formatting, macros, OLE objects, or Word-specific features may not render identically outside Microsoft Word. This format lock-in was a deliberate business strategy that drove Word's market dominance.

For any active document workflow, convert DOC files to DOCX (Office Open XML) or ODF. DOC should be treated as an archival format — files you read and convert, not files you create. If you must produce Word-compatible documents programmatically, target DOCX, which is XML-based, well-documented, and significantly easier to generate and parse than DOC's binary format.

Technical Specifications

File extension: .doc
MIME type: application/msword
Magic bytes: D0 CF 11 E0 A1 B1 1A E1 (OLE2 compound file signature)
Specification: Microsoft published the format spec in 2008 as [MS-DOC]
Character encoding: Supports both legacy codepages and Unicode (UTF-16LE)
Max file size: Practical limit around 32-512 MB depending on Word version

Internal Structure

A DOC file uses OLE2 Compound Binary Format, organized as a FAT-based filesystem:

WordDocument stream: Contains the main document text as a character stream
1Table / 0Table stream: Contains formatting metadata (FKPs, piece tables, style definitions)
Data stream: Embedded OLE objects and certain image data
Summary Information / Document Summary Information: Metadata properties
Macros (optional): VBA project storage

Text in the WordDocument stream is stored as a sequence of characters. Formatting is not inline — instead, character and paragraph formatting is stored in separate structures (FKPs — Formatted disK Pages) that reference character positions via a piece table.

How to Work With It

Opening

Microsoft Word: All versions from Word 97 onward; Word 2007+ opens in "Compatibility Mode"
LibreOffice Writer: Good support for most DOC features
Google Docs: Can import and convert
WPS Office, OnlyOffice: Both support DOC reading and writing
macOS: TextEdit opens simple DOC files; Pages imports them

Creating

Modern applications default to DOCX. To create DOC files:

In Word: File > Save As > Word 97-2003 Document (*.doc)
In LibreOffice: File > Save As > Microsoft Word 97-2003 (.doc)
Programmatically: Apache POI (Java, HWPFDocument), antiword ecosystem

Parsing

Python: python-docx does NOT support DOC; use antiword, textract, or olefile + manual parsing
Java: Apache POI HWPF module
Command line: antiword, catdoc, wvWare for text extraction
Apache Tika: Handles DOC via POI internally

Converting

To DOCX: Open in Word or LibreOffice and resave; libreoffice --convert-to docx
To PDF: libreoffice --convert-to pdf, Word print-to-PDF
To text: antiword file.doc, catdoc file.doc
To HTML: wvHtml, LibreOffice headless

Common Use Cases

Legacy document archives from the 1997–2007 era
Government and institutional systems still generating DOC output
Templates in older enterprise workflows
Compatibility with very old systems that cannot handle DOCX
VBA macro documents (though .docm is now preferred)

Pros & Cons

Pros

Extremely wide legacy support — virtually every word processor can read DOC
Mature format with well-understood behavior
Compact for simple documents
Microsoft published the specification (in 2008), enabling third-party implementations
Supports macros, OLE embedding, and complex formatting

Cons

Proprietary binary format that is difficult to parse without specialized libraries
No longer the default format — DOCX is preferred since 2007
Security risks from embedded macros (major malware vector)
Cannot be inspected with a text editor (unlike DOCX's XML)
Limited compared to DOCX in modern features (no content controls, limited theme support)
OLE2 compound structure is complex and fragile

Compatibility

Platform	Support
Windows	Word (all versions), LibreOffice, WPS Office
macOS	Word, LibreOffice, Pages (import), TextEdit (basic)
Linux	LibreOffice, antiword, wvWare, AbiWord
Web	Google Docs (import/convert), Microsoft 365 (convert to DOCX)
Mobile	Word, Google Docs, WPS Office

Most modern tools will encourage or automatically convert DOC to DOCX upon opening.

Practical Usage

Migrating Legacy Archives

Organizations sitting on thousands of DOC files need a systematic conversion strategy. The most reliable batch approach uses LibreOffice headless mode:

# Convert all DOC files in a directory to DOCX
libreoffice --headless --convert-to docx --outdir ./converted/ *.doc

# Convert to PDF for archival
libreoffice --headless --convert-to pdf --outdir ./pdfs/ *.doc

Run these on a Linux server for stability — LibreOffice headless on Windows can hang on malformed files. Always validate output by spot-checking formatting on a sample set.

Extracting Text for Search Indexing

For full-text search pipelines, antiword is the fastest CLI extractor, but Apache Tika gives the most consistent results across edge cases:

# Quick text extraction
antiword document.doc > output.txt

# Tika for robust extraction (handles tables, headers, footers)
java -jar tika-app.jar --text document.doc > output.txt

Handling Macro-Laden DOC Files

DOC files remain a primary malware vector because of VBA macros. When processing untrusted DOC files programmatically, disable macro execution and use sandboxed environments. oletools (Python) can scan DOC files for suspicious macros before opening:

pip install oletools
olevba suspicious.doc  # analyze VBA macros without executing them

Agent Workflows

When an agent encounters a DOC file, the recommended approach is: extract text with antiword or Tika, convert to DOCX or PDF with LibreOffice for any formatting-sensitive work, and never attempt to write DOC format directly — always output DOCX instead.

Anti-Patterns

Trying to parse DOC as plain text. DOC is a binary OLE2 compound file. Opening it in a text editor or reading raw bytes will give you gibberish interspersed with fragments of text. Always use a proper parsing library.

Using python-docx to read DOC files. python-docx only handles DOCX (Open XML). It will raise an error or produce garbage on DOC files. Use antiword, textract, or Apache Tika for DOC.

Writing new documents in DOC format. There is no good reason to create new DOC files. Every modern system supports DOCX. Choosing DOC for "compatibility" is counterproductive — it sacrifices features and invites conversion errors.

Ignoring character encoding. Older DOC files may use legacy Windows codepages (e.g., Windows-1252) rather than Unicode. Text extraction tools that assume UTF-8 will produce garbled output for non-ASCII characters. Check the codepage in the file metadata.

Trusting embedded macros. Never enable macros in DOC files from untrusted sources. Scan with olevba or a similar tool before opening in Word with macros enabled.

Related Formats

DOCX (.docx): Modern replacement based on Open XML
DOT (.dot): DOC template format
RTF (.rtf): Microsoft's interchange format, simpler and text-based
WPS (.wps): Microsoft Works document format (also legacy)
ODT (.odt): OpenDocument Text alternative

Install this skill directly: skilldb add file-formats-skills

Get CLI access →