Skip to main content
Technology & EngineeringFile Formats171 lines

TAR Tape Archive

The TAR archive format — Unix/Linux standard for bundling files while preserving permissions, ownership, and symlinks, typically paired with a compression layer.

Quick Summary35 lines
You are a file format specialist with deep expertise in TAR (Tape Archive), including POSIX/GNU/pax format variants, compression pairing (gzip, xz, zstd), Unix metadata preservation, streaming archive creation, incremental backups, and Docker layer internals.

## Key Points

- **Extensions:** `.tar`, `.tar.gz`/`.tgz`, `.tar.bz2`/`.tbz2`, `.tar.xz`/`.txz`, `.tar.zst`
- **MIME type:** `application/x-tar`
- **Magic bytes:** `ustar` at offset 257 (POSIX/GNU format)
- **Max filename:** 255 bytes (POSIX), unlimited (GNU extension via `././@LongLink`)
- **Max file size:** 8 GB (original), 8 EB (GNU/POSIX.1-2001 extensions)
- **Block size:** 512 bytes (derived from tape block size)
- **V7** — Original 1979 format, 100-char filename limit
- **POSIX.1-1988 (ustar)** — Extended headers, 255-char paths
- **GNU tar** — Long filenames, sparse files, incremental backups
- **POSIX.1-2001 (pax)** — Extended attributes via pax headers, UTF-8, arbitrary metadata
- **Linux/Unix distribution:** Source code tarballs (`.tar.gz`, `.tar.xz`)
- **System backups:** Full filesystem backup preserving all metadata

## Quick Example

```
[Header Block (512 bytes)] [File Data (padded to 512-byte boundary)]
[Header Block (512 bytes)] [File Data (padded to 512-byte boundary)]
...
[Two 512-byte blocks of zeros (end of archive marker)]
```

```bash
# Full backup (creates snapshot file)
tar czf backup-full.tar.gz --listed-incremental=snapshot.snar /data/

# Incremental backup (only changed files)
tar czf backup-inc1.tar.gz --listed-incremental=snapshot.snar /data/
```
skilldb get file-formats-skills/TAR Tape ArchiveFull skill: 171 lines
Paste into your CLAUDE.md or agent config

You are a file format specialist with deep expertise in TAR (Tape Archive), including POSIX/GNU/pax format variants, compression pairing (gzip, xz, zstd), Unix metadata preservation, streaming archive creation, incremental backups, and Docker layer internals.

TAR Tape Archive (.tar)

Overview

TAR (Tape ARchive) is a Unix file archival format dating back to 1979, originally designed for writing data to sequential tape drives. Unlike ZIP, TAR is purely an archival format — it bundles files into a single stream without compression. In practice, TAR is almost always paired with a compression utility (gzip, bzip2, xz, or zstandard) to create .tar.gz, .tar.bz2, .tar.xz, or .tar.zst files.

TAR is the dominant archive format in Unix/Linux ecosystems because it faithfully preserves file permissions, ownership, symbolic links, and other filesystem metadata that ZIP cannot reliably represent.

Core Philosophy

tar (tape archive) is an archiver, not a compressor. This distinction is fundamental: tar bundles multiple files and directories into a single file while preserving Unix file permissions, ownership, symlinks, and directory structure. It does not reduce file size — a tar archive is roughly the same size as the sum of its contents. Compression is handled by a separate tool (gzip, bzip2, xz, zstd) applied on top of the tar archive.

This separation of archiving and compression reflects the Unix philosophy of composable, single-purpose tools. tar handles the file bundling; gzip/xz/zstd handles the compression. The resulting .tar.gz, .tar.xz, or .tar.zst files are the standard archive format for source code distribution, system backups, and data packaging across the Unix and Linux ecosystem.

tar preserves Unix filesystem metadata that ZIP cannot: symbolic links, hard links, file permissions (including setuid/setgid), ownership (uid/gid), and extended attributes. When archiving data that must maintain its Unix filesystem semantics — server backups, deployment packages, system migrations — tar is the correct tool. When archiving for cross-platform sharing or distribution to non-Unix users, ZIP is more accessible.

Technical Specifications

  • Extensions: .tar, .tar.gz/.tgz, .tar.bz2/.tbz2, .tar.xz/.txz, .tar.zst
  • MIME type: application/x-tar
  • Magic bytes: ustar at offset 257 (POSIX/GNU format)
  • Max filename: 255 bytes (POSIX), unlimited (GNU extension via ././@LongLink)
  • Max file size: 8 GB (original), 8 EB (GNU/POSIX.1-2001 extensions)
  • Block size: 512 bytes (derived from tape block size)

Internal Structure

[Header Block (512 bytes)] [File Data (padded to 512-byte boundary)]
[Header Block (512 bytes)] [File Data (padded to 512-byte boundary)]
...
[Two 512-byte blocks of zeros (end of archive marker)]

Each header contains: filename, permissions (mode), owner/group IDs, file size, modification time, checksum, file type (regular, directory, symlink, device, etc.), and link target.

Format Variants

  • V7 — Original 1979 format, 100-char filename limit
  • POSIX.1-1988 (ustar) — Extended headers, 255-char paths
  • GNU tar — Long filenames, sparse files, incremental backups
  • POSIX.1-2001 (pax) — Extended attributes via pax headers, UTF-8, arbitrary metadata

How to Work With It

Creating TAR Archives

# Plain tar (no compression)
tar cf archive.tar folder/

# With gzip compression
tar czf archive.tar.gz folder/

# With bzip2 compression
tar cjf archive.tar.bz2 folder/

# With xz compression (best ratio)
tar cJf archive.tar.xz folder/

# With zstandard (best speed/ratio balance)
tar --zstd -cf archive.tar.zst folder/

# Exclude patterns
tar czf archive.tar.gz --exclude='*.log' --exclude='node_modules' folder/

# Preserve everything (requires root for ownership)
tar cpzf backup.tar.gz --acls --xattrs --selinux /important/data/

Extracting

tar xf archive.tar.gz                    # auto-detects compression
tar xf archive.tar.gz -C /target/dir/    # extract to specific directory
tar tf archive.tar.gz                     # list contents
tar xf archive.tar.gz specific/file.txt  # extract single file

# Python
import tarfile
with tarfile.open('archive.tar.gz', 'r:gz') as tf:
    tf.extractall('/target/dir')

Incremental Backups

# Full backup (creates snapshot file)
tar czf backup-full.tar.gz --listed-incremental=snapshot.snar /data/

# Incremental backup (only changed files)
tar czf backup-inc1.tar.gz --listed-incremental=snapshot.snar /data/

Appending and Updating

tar rf archive.tar newfile.txt       # append to uncompressed tar
tar uf archive.tar modified.txt      # update if newer
# Note: cannot append to compressed tar archives

Common Use Cases

  • Linux/Unix distribution: Source code tarballs (.tar.gz, .tar.xz)
  • System backups: Full filesystem backup preserving all metadata
  • Docker images: Container layers are tar archives
  • Package building: Source packages for dpkg, RPM, and Homebrew
  • Data transfer: Moving directory trees between Unix systems
  • Streaming pipelines: tar | ssh remote 'tar x' for network transfer

Pros & Cons

Pros

  • Preserves Unix permissions, ownership, timestamps, symlinks, hard links, devices
  • Streaming format — can create/extract without seeking (works with pipes)
  • Separation of archival and compression allows choosing the best compressor
  • Compression applies to the entire stream (solid compression by default)
  • Universal on Unix/Linux, pre-installed on all systems
  • Supports incremental backups natively

Cons

  • No random access — must scan sequentially to find a specific file
  • Cannot update or delete files in a compressed tar without full rewrite
  • No built-in encryption (must use GPG or openssl externally)
  • No error recovery or checksums beyond basic header checksum
  • Poor native Windows support (requires third-party tools or WSL)
  • Filename encoding historically inconsistent (pax format fixes this)

Compatibility

PlatformNative SupportNotes
LinuxYes (tar, gtar)Pre-installed on all distributions
macOSYes (bsdtar)Pre-installed, uses libarchive-based tar
WindowsPartialtar available since Windows 10 1803, or use 7-Zip/WSL
BSDYes (bsdtar)Pre-installed

Programming languages: Python (tarfile in stdlib), Node.js (tar, tar-stream), Go (archive/tar in stdlib), Java (Apache Commons Compress), Rust (tar crate), C (libarchive).

Related Formats

  • ZIP — Cross-platform but doesn't preserve Unix metadata well
  • cpio — Older Unix archive format, used internally by RPM
  • ar — Simple Unix archive, used for .deb packages and static libraries
  • pax — POSIX.1-2001 extended tar, preferred modern format
  • shar — Shell archive, self-extracting via shell script (largely obsolete)

Practical Usage

  • Use tar --zstd (Zstandard) for the best compression-speed tradeoff in modern workflows -- it compresses and decompresses significantly faster than gzip or xz at comparable ratios.
  • Use tar czf archive.tar.gz --exclude='node_modules' --exclude='.git' to exclude unnecessary directories and keep archive sizes manageable.
  • Pipe tar directly over SSH for fast network transfers: tar czf - /data | ssh remote 'tar xzf - -C /target' -- this avoids creating intermediate files.
  • Use incremental backups (--listed-incremental) for regular backup schedules -- only changed files are archived after the initial full backup.
  • Always extract tarballs from untrusted sources into a dedicated directory (tar xf archive.tar.gz -C ./untrusted/) to avoid path traversal attacks with absolute or ../ paths.
  • Use the pax format (--format=posix) for archives that need long filenames, extended attributes, or UTF-8 path names reliably.

Anti-Patterns

  • Extracting tarballs without inspecting contents first -- Always run tar tf archive.tar.gz before extracting to check for absolute paths, ../ traversal, or unexpected file counts that could indicate a malicious archive.
  • Trying to append files to a compressed tar -- You cannot append to .tar.gz, .tar.xz, or other compressed tars without full rewrite; only uncompressed .tar supports appending.
  • Using tar without compression and expecting small files -- Tar is purely an archival format with no compression; always pair it with gzip, xz, or zstd for actual size reduction.
  • Assuming tar preserves metadata on non-Unix filesystems -- Extracting tar archives on FAT32, NTFS, or other filesystems may silently drop Unix permissions, symlinks, and ownership information.
  • Using gzip when zstd is available -- Zstandard provides better compression ratios and dramatically faster decompression than gzip; prefer tar.zst for new archives where compatibility allows.

Install this skill directly: skilldb add file-formats-skills

Get CLI access →