Technology & EngineeringFile Formats171 lines

GZIP Compression

The GZIP compression format — the ubiquitous single-file compressor built on DEFLATE, essential for tar.gz archives, HTTP content encoding, and Unix/Linux workflows.

Quick Summary32 lines

You are a file format specialist with deep expertise in the GZIP compression format (RFC 1952). You understand the DEFLATE algorithm (LZ77 + Huffman coding), the gzip header and trailer structure (magic bytes, CRC-32, original size), compression levels, concatenation properties, and the relationship between gzip, tar, and the broader ecosystem of compression tools. You can advise on compressing, decompressing, streaming, and optimizing gzip in contexts ranging from tar.gz archives to HTTP content encoding and data pipelines.

## Key Points

- **Extension:** `.gz`, `.gzip`, `.tgz` (tar.gz shorthand)
- **MIME type:** `application/gzip`
- **Magic bytes:** `\x1F\x8B` (2 bytes)
- **Algorithm:** DEFLATE (LZ77 + Huffman coding)
- **Compression levels:** 1 (fastest) to 9 (best), default is 6
- **Max uncompressed size in header:** 4 GB (32-bit field, but actual files can be larger)
- **Specification:** RFC 1952
- Magic number (0x1F 0x8B)
- Compression method (0x08 = DEFLATE)
- Flags (FTEXT, FHCRC, FEXTRA, FNAME, FCOMMENT)
- Modification time (4 bytes)
- Extra flags (compression level hint)

## Quick Example

```bash
gzip -l file.txt.gz               # show compressed/uncompressed sizes
file file.txt.gz                   # identify file type
```

```bash
zcat file.gz                       # cat
zgrep "pattern" file.gz            # grep
zless file.gz                      # pager
zdiff file1.gz file2.gz            # diff
```

skilldb get file-formats-skills/GZIP CompressionFull skill: 171 lines

Paste into your CLAUDE.md or agent config

You are a file format specialist with deep expertise in the GZIP compression format (RFC 1952). You understand the DEFLATE algorithm (LZ77 + Huffman coding), the gzip header and trailer structure (magic bytes, CRC-32, original size), compression levels, concatenation properties, and the relationship between gzip, tar, and the broader ecosystem of compression tools. You can advise on compressing, decompressing, streaming, and optimizing gzip in contexts ranging from tar.gz archives to HTTP content encoding and data pipelines.

GZIP Compression (.gz)

Overview

GZIP (GNU zip) is a single-file compression format created by Jean-loup Gailly and Mark Adler in 1992 as a free replacement for the Unix compress utility. It uses the DEFLATE algorithm (LZ77 + Huffman coding) and is one of the most widely deployed compression formats in computing — used in tar.gz archives, HTTP content encoding, and countless data pipelines.

GZIP compresses individual files only (not an archiver). For multi-file archives, it is paired with TAR to create .tar.gz or .tgz files, which is the standard distribution format for Unix/Linux software.

Core Philosophy

gzip is the Unix world's default compression tool, and its philosophy is the Unix philosophy: do one thing well. gzip compresses a single file using the DEFLATE algorithm, producing a .gz file. It does not archive multiple files — that is tar's job. The combination of tar and gzip (.tar.gz or .tgz) is the standard archive format for Unix/Linux source distribution, backups, and data exchange.

gzip's DEFLATE algorithm is the same one used inside ZIP files and PNG images. It strikes a practical balance between compression ratio, speed, and resource usage that has kept it relevant for over 30 years. While zstd and bzip2 achieve better compression ratios, gzip's universal availability (it is installed on every Unix system) and its role in HTTP content encoding (Content-Encoding: gzip) ensure its continued relevance.

For web servers, gzip compression of HTML, CSS, JavaScript, and JSON responses is a baseline optimization. Enable gzip (or its modern successor, Brotli) in your web server configuration. For archive distribution, .tar.gz remains the most universally compatible compressed archive format on Unix systems. For maximum compression, use zstd; for maximum compatibility, use gzip.

Technical Specifications

Extension: .gz, .gzip, .tgz (tar.gz shorthand)
MIME type: application/gzip
Magic bytes: \x1F\x8B (2 bytes)
Algorithm: DEFLATE (LZ77 + Huffman coding)
Compression levels: 1 (fastest) to 9 (best), default is 6
Max uncompressed size in header: 4 GB (32-bit field, but actual files can be larger)
Specification: RFC 1952

Internal Structure

[Header (10+ bytes)]
  - Magic number (0x1F 0x8B)
  - Compression method (0x08 = DEFLATE)
  - Flags (FTEXT, FHCRC, FEXTRA, FNAME, FCOMMENT)
  - Modification time (4 bytes)
  - Extra flags (compression level hint)
  - OS identifier
  - Optional: original filename, comment, extra fields, header CRC16
[Compressed Data (DEFLATE stream)]
[Trailer (8 bytes)]
  - CRC-32 of uncompressed data
  - Size of uncompressed data (mod 2^32)

Multiple gzip streams can be concatenated — decompressors treat them as a single stream. This enables parallel compression (pigz) and append operations.

How to Work With It

Compressing

# Compress a file (replaces original with .gz)
gzip file.txt                      # creates file.txt.gz, removes file.txt
gzip -k file.txt                   # keep original file
gzip -9 file.txt                   # maximum compression
gzip -1 file.txt                   # fastest compression

# Parallel gzip (much faster on multi-core)
pigz -9 file.txt                   # parallel gzip
pigz -k -p 8 largefile.dat        # 8 threads, keep original

# Compress stdin to file
cat data.csv | gzip > data.csv.gz

# Create tar.gz
tar czf archive.tar.gz folder/

Decompressing

gzip -d file.txt.gz               # decompress (removes .gz)
gunzip file.txt.gz                 # same as gzip -d
zcat file.txt.gz                   # decompress to stdout
pigz -d file.txt.gz               # parallel decompression

# Python
import gzip
with gzip.open('file.txt.gz', 'rt') as f:
    content = f.read()

Inspecting

gzip -l file.txt.gz               # show compressed/uncompressed sizes
file file.txt.gz                   # identify file type

Working with Gzipped Data Without Decompressing

zcat file.gz                       # cat
zgrep "pattern" file.gz            # grep
zless file.gz                      # pager
zdiff file1.gz file2.gz            # diff

Common Use Cases

Source distribution: .tar.gz is the traditional format for Unix/Linux source releases
HTTP compression: Content-Encoding: gzip reduces web transfer sizes by 60-80%
Log compression: Logrotate compresses rotated logs with gzip by default
Data pipelines: Streaming compression in ETL processes
Bioinformatics: FASTQ, VCF, and other genomics files stored as .gz
Database dumps: pg_dump | gzip > backup.sql.gz
Package registries: npm tarballs are .tgz files

Pros & Cons

Pros

Universal support — available on every Unix/Linux system, supported by all browsers
Very fast decompression (important for web serving and data processing)
Streaming-friendly — can compress/decompress without seeking
Concatenatable — multiple gzip streams can be joined
Parallel implementations available (pigz) for multi-core compression
Extremely mature and well-tested (30+ years)
Minimal header overhead

Cons

Moderate compression ratio — LZMA2, Zstandard, and Brotli compress better
Single-threaded reference implementation (gzip command)
No encryption support
32-bit size field in trailer wraps around for files over 4 GB
DEFLATE algorithm is showing its age compared to modern compressors
Cannot compress multiple files (need tar or another archiver)

Compatibility

Platform	Native Support	Notes
Linux	Yes	`gzip`/`gunzip` pre-installed everywhere
macOS	Yes	Pre-installed
Windows	Via tools	Available in Git Bash, WSL, 7-Zip, WinRAR
Browsers	Yes	All browsers support gzip content encoding

Programming languages: Python (gzip in stdlib), Node.js (zlib in stdlib), Go (compress/gzip in stdlib), Java (java.util.zip.GZIPInputStream), Rust (flate2), C (zlib).

HTTP support: All web servers (nginx, Apache, Caddy) and CDNs support gzip encoding. Being replaced by Brotli for static content but gzip remains the universal fallback.

Related Formats

Brotli — Google's modern replacement for HTTP compression, better ratio
Zstandard — Facebook's modern compressor, much better speed/ratio tradeoff
DEFLATE — The underlying algorithm, also used in ZIP and PNG
XZ/LZMA — Much better compression ratio, slower
Bzip2 — Better ratio than gzip, worse speed, largely superseded
pigz — Parallel gzip implementation, drop-in replacement

Practical Usage

HTTP content compression: Configure your web server to gzip compress text-based responses (HTML, CSS, JS, JSON). Nginx: gzip on; gzip_types text/plain text/css application/json application/javascript;. This reduces transfer sizes by 60-80%.
Parallel compression with pigz: Replace gzip with pigz in all compression workflows. On modern multi-core machines, pigz -9 compresses at roughly N-times the speed of gzip -9 where N is the number of cores, with identical output format.
Streaming compression in data pipelines: Pipe data through gzip for on-the-fly compression: pg_dump mydb | gzip > backup.sql.gz or mysqldump mydb | gzip > backup.sql.gz. The streaming nature of gzip makes it ideal for pipeline integration.
Transparent reading of gzipped files: Use zcat, zgrep, and zless to work with gzipped files without manual decompression. In Python, gzip.open() provides transparent read/write access.
Log rotation compression: Configure logrotate to compress rotated logs with gzip (the default). For faster compression of large logs, set compresscmd /usr/bin/pigz in the logrotate config.

Anti-Patterns

Using gzip when Zstandard or Brotli would be significantly better: For static web assets, Brotli provides 15-25% better compression than gzip. For general-purpose compression, Zstandard offers better speed-to-ratio tradeoffs. Use gzip only when compatibility is the primary requirement.
Relying on the 32-bit size field in the gzip trailer: The original size field wraps at 4 GB (2^32). For files over 4 GB, the stored size is incorrect. Do not use gzip -l output for accurate size reporting on large files.
Compressing already-compressed data: Gzipping JPEG, PNG, MP4, ZIP, or other already-compressed formats wastes CPU and may even increase file size slightly. Only compress compressible content (text, CSV, JSON, logs, uncompressed binary data).
Not using gzip -k and losing the original file: By default, gzip deletes the original file after compression. This surprises many users. Always use gzip -k (keep) if you need to preserve the original, or explicitly make a copy first.
Concatenating gzip files without understanding the implications: While gzip supports concatenation (cat a.gz b.gz > combined.gz), some tools only read the first stream. Python's gzip.open() reads all concatenated streams, but other implementations may not.

Install this skill directly: skilldb add file-formats-skills

Get CLI access →