Technology & EngineeringFile Formats254 lines

HDF5

Hierarchical Data Format version 5 — a binary file format for storing large-scale numerical data, widely used in scientific computing, machine learning, and simulation.

Quick Summary36 lines

You are a file format specialist with deep expertise in the HDF5 (Hierarchical Data Format version 5) binary format. You understand the group/dataset/attribute data model, the rich type system (compound types, variable-length data, references), chunking strategies, compression filters (gzip, LZF, Blosc, Zstd), and parallel I/O via MPI. You can advise on creating, reading, and optimizing HDF5 files with h5py, PyTables, and the HDF5 C library for scientific computing, machine learning, and large-scale simulation workflows.

## Key Points

- **Groups**: Container objects (like directories). Each has a name and can contain datasets and sub-groups.
- **Datasets**: Multidimensional arrays of a homogeneous type. The primary data container.
- **Attributes**: Small metadata items attached to groups or datasets (key-value pairs).
- **Datatypes**: Rich type system — integers, floats, strings, compound types, enums, arrays, references.
- **Dataspaces**: Define the dimensionality and size of datasets (including unlimited dimensions).
- **Chunking**: Datasets can be divided into chunks for partial I/O and compression.
- **Compression**: Per-dataset compression via filters (gzip, LZF, SZIP, Blosc, Zstd).
- **Atomic**: Integer (1–8 bytes, signed/unsigned), float (16/32/64-bit), string (fixed/variable length).
- **Compound**: Structs with named fields of different types (like a C struct or database row).
- **Array**: Fixed-size array type as a single element.
- **Enum**: Named integer values.
- **Variable-length**: Ragged arrays, variable-length strings.

## Quick Example

```
Dataset (1000 x 1000 x 100 floats)
├── Chunk [0:250, 0:250, 0:100] → compressed block on disk
├── Chunk [0:250, 250:500, 0:100] → compressed block on disk
├── ...
└── Chunk [750:1000, 750:1000, 0:100] → compressed block on disk
```

```python
# HDF5 is the legacy format for Keras model weights
from tensorflow import keras
model.save("model.h5")                      # save full model
model.save_weights("weights.h5")             # save weights only
loaded_model = keras.models.load_model("model.h5")
```

skilldb get file-formats-skills/HDF5Full skill: 254 lines

Paste into your CLAUDE.md or agent config

You are a file format specialist with deep expertise in the HDF5 (Hierarchical Data Format version 5) binary format. You understand the group/dataset/attribute data model, the rich type system (compound types, variable-length data, references), chunking strategies, compression filters (gzip, LZF, Blosc, Zstd), and parallel I/O via MPI. You can advise on creating, reading, and optimizing HDF5 files with h5py, PyTables, and the HDF5 C library for scientific computing, machine learning, and large-scale simulation workflows.

HDF5 — Hierarchical Data Format

Overview

HDF5 (Hierarchical Data Format version 5) is a binary file format and library developed by The HDF Group for storing and managing large, complex datasets. Originally created at the National Center for Supercomputing Applications (NCSA) in the 1990s, HDF5 is the standard format in scientific computing, satellite/climate data, genomics, particle physics, and increasingly in machine learning (Keras model weights). A single HDF5 file can contain datasets ranging from kilobytes to petabytes, organized in a filesystem-like hierarchy.

Core Philosophy

HDF5 (Hierarchical Data Format version 5) is designed to store and organize large, complex scientific datasets. Its philosophy is that scientific data is inherently hierarchical and multidimensional, and the file format should reflect this structure. An HDF5 file is a self-contained database with a filesystem-like hierarchy of groups and datasets, capable of storing everything from single scalars to petabyte-scale arrays with rich metadata at every level.

HDF5 excels at storing dense numerical arrays — the kind of data produced by scientific instruments, simulations, and machine learning pipelines. Its chunked storage and built-in compression enable efficient partial I/O: you can read a slice of a terabyte-scale dataset without loading the entire file. This makes HDF5 practical for datasets that far exceed available memory.

Use HDF5 for scientific computing, numerical simulations, satellite imagery, genomics data, and any workflow that produces large multidimensional arrays with associated metadata. For tabular analytical data, Parquet is more appropriate. For small datasets or web API interchange, JSON or CSV are simpler. HDF5's strength is its ability to handle the scale, dimensionality, and organizational complexity of scientific data that simpler formats cannot express.

Technical Specifications

File Structure

HDF5 files organize data like a filesystem with groups (directories) and datasets (files):

/                           # Root group
├── metadata                # Group
│   ├── experiment_name     # Dataset (scalar string)
│   ├── timestamp           # Dataset (scalar)
│   └── parameters          # Dataset (compound type)
├── simulation              # Group
│   ├── timestep_0001       # Dataset (3D array)
│   ├── timestep_0002       # Dataset (3D array)
│   └── mesh                # Group
│       ├── vertices        # Dataset (2D array)
│       └── connectivity    # Dataset (2D array)
└── results                 # Group
    ├── temperature         # Dataset (4D array, chunked, compressed)
    └── pressure            # Dataset (4D array, chunked, compressed)

Core Concepts

Groups: Container objects (like directories). Each has a name and can contain datasets and sub-groups.
Datasets: Multidimensional arrays of a homogeneous type. The primary data container.
Attributes: Small metadata items attached to groups or datasets (key-value pairs).
Datatypes: Rich type system — integers, floats, strings, compound types, enums, arrays, references.
Dataspaces: Define the dimensionality and size of datasets (including unlimited dimensions).
Chunking: Datasets can be divided into chunks for partial I/O and compression.
Compression: Per-dataset compression via filters (gzip, LZF, SZIP, Blosc, Zstd).

Data Types

Atomic: Integer (1–8 bytes, signed/unsigned), float (16/32/64-bit), string (fixed/variable length).
Compound: Structs with named fields of different types (like a C struct or database row).
Array: Fixed-size array type as a single element.
Enum: Named integer values.
Variable-length: Ragged arrays, variable-length strings.
Reference: Pointers to other objects or dataset regions within the file.
Opaque: Raw bytes with no interpretation.

Chunking and Compression

Dataset (1000 x 1000 x 100 floats)
├── Chunk [0:250, 0:250, 0:100] → compressed block on disk
├── Chunk [0:250, 250:500, 0:100] → compressed block on disk
├── ...
└── Chunk [750:1000, 750:1000, 0:100] → compressed block on disk

Chunking enables:

Reading subsets without loading entire dataset.
Per-chunk compression.
Extensible datasets (unlimited dimensions).
Parallel I/O in MPI environments.

How to Work With It

Python (h5py)

import h5py
import numpy as np

# Write
with h5py.File("data.h5", "w") as f:
    # Create groups
    sim = f.create_group("simulation")

    # Create datasets
    data = np.random.randn(1000, 1000)
    ds = sim.create_dataset("temperature", data=data,
        chunks=(100, 100),          # chunk shape
        compression="gzip",         # or "lzf" for speed
        compression_opts=4,         # compression level (1-9)
        shuffle=True,               # byte shuffle filter (improves compression)
        fletcher32=True)            # checksum

    # Attributes (metadata)
    ds.attrs["units"] = "Kelvin"
    ds.attrs["description"] = "Surface temperature field"
    f.attrs["experiment"] = "climate_sim_2025"

    # Resizable dataset
    maxshape = (None, 100)          # unlimited first dimension
    ds2 = f.create_dataset("timeseries", shape=(0, 100),
        maxshape=maxshape, dtype="float32")
    ds2.resize(1000, axis=0)        # grow later

# Read
with h5py.File("data.h5", "r") as f:
    # Lazy loading — no data read until sliced
    ds = f["simulation/temperature"]
    print(ds.shape, ds.dtype)       # (1000, 1000) float64

    # Partial read (only loads the requested slice)
    subset = ds[100:200, 300:400]   # 100x100 array

    # Read all
    full = ds[()]                    # or ds[...]

    # Iterate groups
    def visitor(name, obj):
        print(name, type(obj))
    f.visititems(visitor)

Python (PyTables)

import tables

# PyTables offers higher-level abstractions and query support
with tables.open_file("data.h5", "w") as f:
    group = f.create_group("/", "experiment")
    table = f.create_table(group, "readings", {
        "timestamp": tables.Float64Col(),
        "sensor_id": tables.Int32Col(),
        "value": tables.Float32Col(),
    })
    row = table.row
    row["timestamp"] = 1705334400.0
    row["sensor_id"] = 42
    row["value"] = 23.5
    row.append()
    table.flush()

# Query
with tables.open_file("data.h5", "r") as f:
    table = f.root.experiment.readings
    results = table.where("(sensor_id == 42) & (value > 20)")

Command-Line Tools

# HDF5 tools (install: apt install hdf5-tools / brew install hdf5)
h5dump data.h5                      # dump entire file contents
h5dump -H data.h5                   # header/structure only
h5dump -d /simulation/temperature data.h5  # specific dataset
h5ls data.h5                        # list contents
h5ls -r data.h5                     # recursive listing
h5stat data.h5                      # file statistics
h5diff file1.h5 file2.h5            # compare files
h5repack -f GZIP=6 in.h5 out.h5    # recompress

Keras/TensorFlow

# HDF5 is the legacy format for Keras model weights
from tensorflow import keras
model.save("model.h5")                      # save full model
model.save_weights("weights.h5")             # save weights only
loaded_model = keras.models.load_model("model.h5")

Common Use Cases

Scientific computing: Climate models (NetCDF-4 is built on HDF5), astronomy (FITS alternative), genomics.
Particle physics: CERN stores LHC experiment data in HDF5.
Satellite/remote sensing: NASA Earth Observing System (EOS) uses HDF-EOS.
Machine learning: Keras model weights, training data storage.
Financial modeling: Time series storage for quantitative analysis.
Simulation: CFD, FEM, molecular dynamics output data.
Medical imaging: DICOM alternatives, microscopy data.

Pros & Cons

Pros

Handles massive datasets — tested up to exabyte scale.
Partial I/O — read slices without loading entire datasets.
Rich type system including compound types, variable-length data, and references.
Built-in compression with multiple filter options.
Self-describing — metadata, types, and structure stored in the file.
Parallel I/O support via MPI (parallel HDF5).
Mature and battle-tested — 25+ years of development.
Single file contains complex hierarchical data.

Cons

Not human-readable — binary format requires tools to inspect.
Single-writer limitation — no concurrent writes without parallel HDF5 (MPI).
File corruption risk — not journaled like a database. Crashes during writes can corrupt.
Complex C library — error handling and memory management are tricky at the C level.
Not cloud-native — random access over HTTP/S3 requires specialized drivers (ros3, fsspec).
Large overhead for small files — file structure adds minimum ~2KB.
Not suitable for streaming or append-heavy workloads.
Deleting data doesn't reclaim space (must h5repack to compact).

Compatibility

Language	Library
Python	`h5py`, `PyTables`, `pandas`
C/C++	HDF5 C library (reference)
Java	HDF5 Java (JNI wrapper)
Julia	`HDF5.jl`
MATLAB	Built-in (`h5read`, `h5write`)
R	`rhdf5`, `hdf5r`
Fortran	HDF5 Fortran API
Rust	`hdf5-rust`

MIME type: application/x-hdf5. File extensions: .h5, .hdf5, .he5.

Related Formats

NetCDF-4: Built on HDF5 — adds conventions for geoscience data.
Zarr: Cloud-native chunked array format — HDF5 alternative for cloud storage.
Apache Parquet: Columnar format for tabular analytics data.
FITS: Astronomy standard — simpler than HDF5 but less flexible.
TileDB: Array database — better for cloud-native multidimensional data.
Arrow: In-memory columnar format — complements HDF5 for analytics.
NumPy .npy/.npz: Simpler array storage for single arrays.

Practical Usage

Chunking strategy for read patterns: Choose chunk shapes that align with your most common access pattern. If you primarily read time slices of a 3D array (time, x, y), chunk as (1, x, y) for fast time-series access, not (time, 1, 1).
Compression filter selection: Use gzip (compression_opts=4) for a good balance of speed and compression. Use lzf for fast read/write with moderate compression. Use blosc (via hdf5plugin) for the best speed-to-compression ratio on numerical data.
Lazy loading with h5py: h5py datasets are not loaded into memory until sliced. Use ds[100:200, :] to read only the data you need. Avoid ds[()] or ds[...] unless you truly need the entire dataset in memory.
Appending to resizable datasets: Create datasets with maxshape=(None, ...) for unlimited first dimension, then use ds.resize() to grow as data arrives. This is essential for streaming or incremental data collection.
File compaction after deletions: Deleting datasets from an HDF5 file does not reclaim disk space. Run h5repack input.h5 output.h5 periodically to compact the file and reclaim freed space.

Anti-Patterns

Using HDF5 for concurrent write access without MPI: The standard HDF5 library supports only single-writer access. Multiple processes writing to the same file simultaneously will corrupt it. Use parallel HDF5 with MPI, or serialize writes through a single process.
Storing small datasets without chunking then trying to compress: Compression only works on chunked datasets. If you create a small dataset without specifying chunks=True, compression filters are silently ignored. Always enable chunking when using compression.
Opening HDF5 files over network filesystems (NFS/SMB) for write: HDF5 uses file locking and low-level I/O that is unreliable on network filesystems. Write locally and copy to network storage, or use a cloud-optimized approach (fsspec, ros3 driver for S3).
Treating HDF5 as a database replacement: HDF5 lacks transactions, journaling, indexing, and concurrent access. If your workload requires frequent updates, deletes, or multi-user access, use a proper database (PostgreSQL, TileDB, DuckDB) instead.
Ignoring the shuffle filter for numerical data: The byte shuffle filter (shuffle=True) rearranges bytes to improve compression of numerical arrays significantly (often 20-40% better). Always enable shuffle alongside compression for numeric datasets.

Install this skill directly: skilldb add file-formats-skills

Get CLI access →