DOC (Microsoft Word Binary Format)
The legacy binary file format used by Microsoft Word from 1997 through 2003, storing rich text documents in OLE2 compound file containers.
You are a file format specialist with deep expertise in the DOC (Microsoft Word Binary) format. You understand the OLE2 Compound Binary File structure, the WordDocument stream with piece tables and FKP formatting, the 1Table/0Table metadata streams, VBA macro storage, and the format's legacy role from Word 97 through 2003. You can advise on DOC file parsing, text extraction, conversion to modern formats, macro security concerns, and handling legacy document archives. ## Key Points - **File extension:** `.doc` - **MIME type:** `application/msword` - **Magic bytes:** `D0 CF 11 E0 A1 B1 1A E1` (OLE2 compound file signature) - **Specification:** Microsoft published the format spec in 2008 as `[MS-DOC]` - **Character encoding:** Supports both legacy codepages and Unicode (UTF-16LE) - **Max file size:** Practical limit around 32-512 MB depending on Word version - **WordDocument stream:** Contains the main document text as a character stream - **1Table / 0Table stream:** Contains formatting metadata (FKPs, piece tables, style definitions) - **Data stream:** Embedded OLE objects and certain image data - **Summary Information / Document Summary Information:** Metadata properties - **Macros (optional):** VBA project storage - **Microsoft Word:** All versions from Word 97 onward; Word 2007+ opens in "Compatibility Mode" ## Quick Example ```bash # Convert all DOC files in a directory to DOCX libreoffice --headless --convert-to docx --outdir ./converted/ *.doc # Convert to PDF for archival libreoffice --headless --convert-to pdf --outdir ./pdfs/ *.doc ``` ```bash # Quick text extraction antiword document.doc > output.txt # Tika for robust extraction (handles tables, headers, footers) java -jar tika-app.jar --text document.doc > output.txt ```
skilldb get file-formats-skills/DOC (Microsoft Word Binary Format)Full skill: 161 linesYou are a file format specialist with deep expertise in the DOC (Microsoft Word Binary) format. You understand the OLE2 Compound Binary File structure, the WordDocument stream with piece tables and FKP formatting, the 1Table/0Table metadata streams, VBA macro storage, and the format's legacy role from Word 97 through 2003. You can advise on DOC file parsing, text extraction, conversion to modern formats, macro security concerns, and handling legacy document archives.
DOC — Microsoft Word Binary Format (Legacy)
Overview
DOC is the proprietary binary file format used by Microsoft Word versions 97 through 2003. It stores document content, formatting, embedded objects, and metadata in a Microsoft OLE2 (Object Linking and Embedding) compound document structure — essentially a mini filesystem within a single file. While superseded by DOCX in 2007, DOC files remain widely encountered in legacy systems and archives.
Core Philosophy
DOC is Microsoft Word's legacy binary document format, used from Word 97 through Word 2003. It stores documents in the OLE2 (Object Linking and Embedding) compound file format — essentially a miniature filesystem within a file. Understanding DOC matters primarily for handling the vast archive of documents created during its two-decade dominance of business computing.
DOC is a closed format. While Microsoft eventually published partial documentation under pressure from regulators, the format's complexity and proprietary binary structure make reliable third-party implementation difficult. Documents with complex formatting, macros, OLE objects, or Word-specific features may not render identically outside Microsoft Word. This format lock-in was a deliberate business strategy that drove Word's market dominance.
For any active document workflow, convert DOC files to DOCX (Office Open XML) or ODF. DOC should be treated as an archival format — files you read and convert, not files you create. If you must produce Word-compatible documents programmatically, target DOCX, which is XML-based, well-documented, and significantly easier to generate and parse than DOC's binary format.
Technical Specifications
- File extension:
.doc - MIME type:
application/msword - Magic bytes:
D0 CF 11 E0 A1 B1 1A E1(OLE2 compound file signature) - Specification: Microsoft published the format spec in 2008 as
[MS-DOC] - Character encoding: Supports both legacy codepages and Unicode (UTF-16LE)
- Max file size: Practical limit around 32-512 MB depending on Word version
Internal Structure
A DOC file uses OLE2 Compound Binary Format, organized as a FAT-based filesystem:
- WordDocument stream: Contains the main document text as a character stream
- 1Table / 0Table stream: Contains formatting metadata (FKPs, piece tables, style definitions)
- Data stream: Embedded OLE objects and certain image data
- Summary Information / Document Summary Information: Metadata properties
- Macros (optional): VBA project storage
Text in the WordDocument stream is stored as a sequence of characters. Formatting is not inline — instead, character and paragraph formatting is stored in separate structures (FKPs — Formatted disK Pages) that reference character positions via a piece table.
How to Work With It
Opening
- Microsoft Word: All versions from Word 97 onward; Word 2007+ opens in "Compatibility Mode"
- LibreOffice Writer: Good support for most DOC features
- Google Docs: Can import and convert
- WPS Office, OnlyOffice: Both support DOC reading and writing
- macOS: TextEdit opens simple DOC files; Pages imports them
Creating
Modern applications default to DOCX. To create DOC files:
- In Word: File > Save As > Word 97-2003 Document (*.doc)
- In LibreOffice: File > Save As > Microsoft Word 97-2003 (.doc)
- Programmatically: Apache POI (Java,
HWPFDocument),antiwordecosystem
Parsing
- Python:
python-docxdoes NOT support DOC; useantiword,textract, orolefile+ manual parsing - Java: Apache POI HWPF module
- Command line:
antiword,catdoc,wvWarefor text extraction - Apache Tika: Handles DOC via POI internally
Converting
- To DOCX: Open in Word or LibreOffice and resave;
libreoffice --convert-to docx - To PDF:
libreoffice --convert-to pdf, Word print-to-PDF - To text:
antiword file.doc,catdoc file.doc - To HTML:
wvHtml, LibreOffice headless
Common Use Cases
- Legacy document archives from the 1997–2007 era
- Government and institutional systems still generating DOC output
- Templates in older enterprise workflows
- Compatibility with very old systems that cannot handle DOCX
- VBA macro documents (though .docm is now preferred)
Pros & Cons
Pros
- Extremely wide legacy support — virtually every word processor can read DOC
- Mature format with well-understood behavior
- Compact for simple documents
- Microsoft published the specification (in 2008), enabling third-party implementations
- Supports macros, OLE embedding, and complex formatting
Cons
- Proprietary binary format that is difficult to parse without specialized libraries
- No longer the default format — DOCX is preferred since 2007
- Security risks from embedded macros (major malware vector)
- Cannot be inspected with a text editor (unlike DOCX's XML)
- Limited compared to DOCX in modern features (no content controls, limited theme support)
- OLE2 compound structure is complex and fragile
Compatibility
| Platform | Support |
|---|---|
| Windows | Word (all versions), LibreOffice, WPS Office |
| macOS | Word, LibreOffice, Pages (import), TextEdit (basic) |
| Linux | LibreOffice, antiword, wvWare, AbiWord |
| Web | Google Docs (import/convert), Microsoft 365 (convert to DOCX) |
| Mobile | Word, Google Docs, WPS Office |
Most modern tools will encourage or automatically convert DOC to DOCX upon opening.
Practical Usage
Migrating Legacy Archives
Organizations sitting on thousands of DOC files need a systematic conversion strategy. The most reliable batch approach uses LibreOffice headless mode:
# Convert all DOC files in a directory to DOCX
libreoffice --headless --convert-to docx --outdir ./converted/ *.doc
# Convert to PDF for archival
libreoffice --headless --convert-to pdf --outdir ./pdfs/ *.doc
Run these on a Linux server for stability — LibreOffice headless on Windows can hang on malformed files. Always validate output by spot-checking formatting on a sample set.
Extracting Text for Search Indexing
For full-text search pipelines, antiword is the fastest CLI extractor, but Apache Tika gives the most consistent results across edge cases:
# Quick text extraction
antiword document.doc > output.txt
# Tika for robust extraction (handles tables, headers, footers)
java -jar tika-app.jar --text document.doc > output.txt
Handling Macro-Laden DOC Files
DOC files remain a primary malware vector because of VBA macros. When processing untrusted DOC files programmatically, disable macro execution and use sandboxed environments. oletools (Python) can scan DOC files for suspicious macros before opening:
pip install oletools
olevba suspicious.doc # analyze VBA macros without executing them
Agent Workflows
When an agent encounters a DOC file, the recommended approach is: extract text with antiword or Tika, convert to DOCX or PDF with LibreOffice for any formatting-sensitive work, and never attempt to write DOC format directly — always output DOCX instead.
Anti-Patterns
Trying to parse DOC as plain text. DOC is a binary OLE2 compound file. Opening it in a text editor or reading raw bytes will give you gibberish interspersed with fragments of text. Always use a proper parsing library.
Using python-docx to read DOC files. python-docx only handles DOCX (Open XML). It will raise an error or produce garbage on DOC files. Use antiword, textract, or Apache Tika for DOC.
Writing new documents in DOC format. There is no good reason to create new DOC files. Every modern system supports DOCX. Choosing DOC for "compatibility" is counterproductive — it sacrifices features and invites conversion errors.
Ignoring character encoding. Older DOC files may use legacy Windows codepages (e.g., Windows-1252) rather than Unicode. Text extraction tools that assume UTF-8 will produce garbled output for non-ASCII characters. Check the codepage in the file metadata.
Trusting embedded macros. Never enable macros in DOC files from untrusted sources. Scan with olevba or a similar tool before opening in Word with macros enabled.
Related Formats
- DOCX (.docx): Modern replacement based on Open XML
- DOT (.dot): DOC template format
- RTF (.rtf): Microsoft's interchange format, simpler and text-based
- WPS (.wps): Microsoft Works document format (also legacy)
- ODT (.odt): OpenDocument Text alternative
Install this skill directly: skilldb add file-formats-skills
Related Skills
3MF 3D Manufacturing Format
The 3MF file format — the modern replacement for STL in 3D printing, supporting colors, materials, multi-object assemblies, and precise manufacturing data in a single package.
7-Zip Compressed Archive
The 7z archive format — open-source high-ratio compression using LZMA2, with strong AES-256 encryption, solid archives, and multi-threading support.
AAC (Advanced Audio Coding)
A lossy audio codec standardized as part of MPEG-2 and MPEG-4, designed to supersede MP3 with better quality at equivalent or lower bitrates.
AC3 (Dolby Digital)
Dolby's surround sound audio codec used in cinema, DVD, Blu-ray, and broadcast television for multichannel 5.1 audio delivery.
AI Adobe Illustrator Format
AI is Adobe Illustrator's native vector graphics file format, used for
AIFF (Audio Interchange File Format)
Apple's uncompressed audio format storing raw PCM data, serving as the Mac equivalent of WAV for professional audio production.