HDF5
Hierarchical Data Format version 5 — a binary file format for storing large-scale numerical data, widely used in scientific computing, machine learning, and simulation.
You are a file format specialist with deep expertise in the HDF5 (Hierarchical Data Format version 5) binary format. You understand the group/dataset/attribute data model, the rich type system (compound types, variable-length data, references), chunking strategies, compression filters (gzip, LZF, Blosc, Zstd), and parallel I/O via MPI. You can advise on creating, reading, and optimizing HDF5 files with h5py, PyTables, and the HDF5 C library for scientific computing, machine learning, and large-scale simulation workflows.
## Key Points
- **Groups**: Container objects (like directories). Each has a name and can contain datasets and sub-groups.
- **Datasets**: Multidimensional arrays of a homogeneous type. The primary data container.
- **Attributes**: Small metadata items attached to groups or datasets (key-value pairs).
- **Datatypes**: Rich type system — integers, floats, strings, compound types, enums, arrays, references.
- **Dataspaces**: Define the dimensionality and size of datasets (including unlimited dimensions).
- **Chunking**: Datasets can be divided into chunks for partial I/O and compression.
- **Compression**: Per-dataset compression via filters (gzip, LZF, SZIP, Blosc, Zstd).
- **Atomic**: Integer (1–8 bytes, signed/unsigned), float (16/32/64-bit), string (fixed/variable length).
- **Compound**: Structs with named fields of different types (like a C struct or database row).
- **Array**: Fixed-size array type as a single element.
- **Enum**: Named integer values.
- **Variable-length**: Ragged arrays, variable-length strings.
## Quick Example
```
Dataset (1000 x 1000 x 100 floats)
├── Chunk [0:250, 0:250, 0:100] → compressed block on disk
├── Chunk [0:250, 250:500, 0:100] → compressed block on disk
├── ...
└── Chunk [750:1000, 750:1000, 0:100] → compressed block on disk
```
```python
# HDF5 is the legacy format for Keras model weights
from tensorflow import keras
model.save("model.h5") # save full model
model.save_weights("weights.h5") # save weights only
loaded_model = keras.models.load_model("model.h5")
```skilldb get file-formats-skills/HDF5Full skill: 254 linesYou are a file format specialist with deep expertise in the HDF5 (Hierarchical Data Format version 5) binary format. You understand the group/dataset/attribute data model, the rich type system (compound types, variable-length data, references), chunking strategies, compression filters (gzip, LZF, Blosc, Zstd), and parallel I/O via MPI. You can advise on creating, reading, and optimizing HDF5 files with h5py, PyTables, and the HDF5 C library for scientific computing, machine learning, and large-scale simulation workflows.
HDF5 — Hierarchical Data Format
Overview
HDF5 (Hierarchical Data Format version 5) is a binary file format and library developed by The HDF Group for storing and managing large, complex datasets. Originally created at the National Center for Supercomputing Applications (NCSA) in the 1990s, HDF5 is the standard format in scientific computing, satellite/climate data, genomics, particle physics, and increasingly in machine learning (Keras model weights). A single HDF5 file can contain datasets ranging from kilobytes to petabytes, organized in a filesystem-like hierarchy.
Core Philosophy
HDF5 (Hierarchical Data Format version 5) is designed to store and organize large, complex scientific datasets. Its philosophy is that scientific data is inherently hierarchical and multidimensional, and the file format should reflect this structure. An HDF5 file is a self-contained database with a filesystem-like hierarchy of groups and datasets, capable of storing everything from single scalars to petabyte-scale arrays with rich metadata at every level.
HDF5 excels at storing dense numerical arrays — the kind of data produced by scientific instruments, simulations, and machine learning pipelines. Its chunked storage and built-in compression enable efficient partial I/O: you can read a slice of a terabyte-scale dataset without loading the entire file. This makes HDF5 practical for datasets that far exceed available memory.
Use HDF5 for scientific computing, numerical simulations, satellite imagery, genomics data, and any workflow that produces large multidimensional arrays with associated metadata. For tabular analytical data, Parquet is more appropriate. For small datasets or web API interchange, JSON or CSV are simpler. HDF5's strength is its ability to handle the scale, dimensionality, and organizational complexity of scientific data that simpler formats cannot express.
Technical Specifications
File Structure
HDF5 files organize data like a filesystem with groups (directories) and datasets (files):
/ # Root group
├── metadata # Group
│ ├── experiment_name # Dataset (scalar string)
│ ├── timestamp # Dataset (scalar)
│ └── parameters # Dataset (compound type)
├── simulation # Group
│ ├── timestep_0001 # Dataset (3D array)
│ ├── timestep_0002 # Dataset (3D array)
│ └── mesh # Group
│ ├── vertices # Dataset (2D array)
│ └── connectivity # Dataset (2D array)
└── results # Group
├── temperature # Dataset (4D array, chunked, compressed)
└── pressure # Dataset (4D array, chunked, compressed)
Core Concepts
- Groups: Container objects (like directories). Each has a name and can contain datasets and sub-groups.
- Datasets: Multidimensional arrays of a homogeneous type. The primary data container.
- Attributes: Small metadata items attached to groups or datasets (key-value pairs).
- Datatypes: Rich type system — integers, floats, strings, compound types, enums, arrays, references.
- Dataspaces: Define the dimensionality and size of datasets (including unlimited dimensions).
- Chunking: Datasets can be divided into chunks for partial I/O and compression.
- Compression: Per-dataset compression via filters (gzip, LZF, SZIP, Blosc, Zstd).
Data Types
- Atomic: Integer (1–8 bytes, signed/unsigned), float (16/32/64-bit), string (fixed/variable length).
- Compound: Structs with named fields of different types (like a C struct or database row).
- Array: Fixed-size array type as a single element.
- Enum: Named integer values.
- Variable-length: Ragged arrays, variable-length strings.
- Reference: Pointers to other objects or dataset regions within the file.
- Opaque: Raw bytes with no interpretation.
Chunking and Compression
Dataset (1000 x 1000 x 100 floats)
├── Chunk [0:250, 0:250, 0:100] → compressed block on disk
├── Chunk [0:250, 250:500, 0:100] → compressed block on disk
├── ...
└── Chunk [750:1000, 750:1000, 0:100] → compressed block on disk
Chunking enables:
- Reading subsets without loading entire dataset.
- Per-chunk compression.
- Extensible datasets (unlimited dimensions).
- Parallel I/O in MPI environments.
How to Work With It
Python (h5py)
import h5py
import numpy as np
# Write
with h5py.File("data.h5", "w") as f:
# Create groups
sim = f.create_group("simulation")
# Create datasets
data = np.random.randn(1000, 1000)
ds = sim.create_dataset("temperature", data=data,
chunks=(100, 100), # chunk shape
compression="gzip", # or "lzf" for speed
compression_opts=4, # compression level (1-9)
shuffle=True, # byte shuffle filter (improves compression)
fletcher32=True) # checksum
# Attributes (metadata)
ds.attrs["units"] = "Kelvin"
ds.attrs["description"] = "Surface temperature field"
f.attrs["experiment"] = "climate_sim_2025"
# Resizable dataset
maxshape = (None, 100) # unlimited first dimension
ds2 = f.create_dataset("timeseries", shape=(0, 100),
maxshape=maxshape, dtype="float32")
ds2.resize(1000, axis=0) # grow later
# Read
with h5py.File("data.h5", "r") as f:
# Lazy loading — no data read until sliced
ds = f["simulation/temperature"]
print(ds.shape, ds.dtype) # (1000, 1000) float64
# Partial read (only loads the requested slice)
subset = ds[100:200, 300:400] # 100x100 array
# Read all
full = ds[()] # or ds[...]
# Iterate groups
def visitor(name, obj):
print(name, type(obj))
f.visititems(visitor)
Python (PyTables)
import tables
# PyTables offers higher-level abstractions and query support
with tables.open_file("data.h5", "w") as f:
group = f.create_group("/", "experiment")
table = f.create_table(group, "readings", {
"timestamp": tables.Float64Col(),
"sensor_id": tables.Int32Col(),
"value": tables.Float32Col(),
})
row = table.row
row["timestamp"] = 1705334400.0
row["sensor_id"] = 42
row["value"] = 23.5
row.append()
table.flush()
# Query
with tables.open_file("data.h5", "r") as f:
table = f.root.experiment.readings
results = table.where("(sensor_id == 42) & (value > 20)")
Command-Line Tools
# HDF5 tools (install: apt install hdf5-tools / brew install hdf5)
h5dump data.h5 # dump entire file contents
h5dump -H data.h5 # header/structure only
h5dump -d /simulation/temperature data.h5 # specific dataset
h5ls data.h5 # list contents
h5ls -r data.h5 # recursive listing
h5stat data.h5 # file statistics
h5diff file1.h5 file2.h5 # compare files
h5repack -f GZIP=6 in.h5 out.h5 # recompress
Keras/TensorFlow
# HDF5 is the legacy format for Keras model weights
from tensorflow import keras
model.save("model.h5") # save full model
model.save_weights("weights.h5") # save weights only
loaded_model = keras.models.load_model("model.h5")
Common Use Cases
- Scientific computing: Climate models (NetCDF-4 is built on HDF5), astronomy (FITS alternative), genomics.
- Particle physics: CERN stores LHC experiment data in HDF5.
- Satellite/remote sensing: NASA Earth Observing System (EOS) uses HDF-EOS.
- Machine learning: Keras model weights, training data storage.
- Financial modeling: Time series storage for quantitative analysis.
- Simulation: CFD, FEM, molecular dynamics output data.
- Medical imaging: DICOM alternatives, microscopy data.
Pros & Cons
Pros
- Handles massive datasets — tested up to exabyte scale.
- Partial I/O — read slices without loading entire datasets.
- Rich type system including compound types, variable-length data, and references.
- Built-in compression with multiple filter options.
- Self-describing — metadata, types, and structure stored in the file.
- Parallel I/O support via MPI (parallel HDF5).
- Mature and battle-tested — 25+ years of development.
- Single file contains complex hierarchical data.
Cons
- Not human-readable — binary format requires tools to inspect.
- Single-writer limitation — no concurrent writes without parallel HDF5 (MPI).
- File corruption risk — not journaled like a database. Crashes during writes can corrupt.
- Complex C library — error handling and memory management are tricky at the C level.
- Not cloud-native — random access over HTTP/S3 requires specialized drivers (ros3, fsspec).
- Large overhead for small files — file structure adds minimum ~2KB.
- Not suitable for streaming or append-heavy workloads.
- Deleting data doesn't reclaim space (must
h5repackto compact).
Compatibility
| Language | Library |
|---|---|
| Python | h5py, PyTables, pandas |
| C/C++ | HDF5 C library (reference) |
| Java | HDF5 Java (JNI wrapper) |
| Julia | HDF5.jl |
| MATLAB | Built-in (h5read, h5write) |
| R | rhdf5, hdf5r |
| Fortran | HDF5 Fortran API |
| Rust | hdf5-rust |
MIME type: application/x-hdf5. File extensions: .h5, .hdf5, .he5.
Related Formats
- NetCDF-4: Built on HDF5 — adds conventions for geoscience data.
- Zarr: Cloud-native chunked array format — HDF5 alternative for cloud storage.
- Apache Parquet: Columnar format for tabular analytics data.
- FITS: Astronomy standard — simpler than HDF5 but less flexible.
- TileDB: Array database — better for cloud-native multidimensional data.
- Arrow: In-memory columnar format — complements HDF5 for analytics.
- NumPy .npy/.npz: Simpler array storage for single arrays.
Practical Usage
- Chunking strategy for read patterns: Choose chunk shapes that align with your most common access pattern. If you primarily read time slices of a 3D array
(time, x, y), chunk as(1, x, y)for fast time-series access, not(time, 1, 1). - Compression filter selection: Use
gzip(compression_opts=4) for a good balance of speed and compression. Uselzffor fast read/write with moderate compression. Useblosc(via hdf5plugin) for the best speed-to-compression ratio on numerical data. - Lazy loading with h5py: h5py datasets are not loaded into memory until sliced. Use
ds[100:200, :]to read only the data you need. Avoidds[()]ords[...]unless you truly need the entire dataset in memory. - Appending to resizable datasets: Create datasets with
maxshape=(None, ...)for unlimited first dimension, then useds.resize()to grow as data arrives. This is essential for streaming or incremental data collection. - File compaction after deletions: Deleting datasets from an HDF5 file does not reclaim disk space. Run
h5repack input.h5 output.h5periodically to compact the file and reclaim freed space.
Anti-Patterns
- Using HDF5 for concurrent write access without MPI: The standard HDF5 library supports only single-writer access. Multiple processes writing to the same file simultaneously will corrupt it. Use parallel HDF5 with MPI, or serialize writes through a single process.
- Storing small datasets without chunking then trying to compress: Compression only works on chunked datasets. If you create a small dataset without specifying
chunks=True, compression filters are silently ignored. Always enable chunking when using compression. - Opening HDF5 files over network filesystems (NFS/SMB) for write: HDF5 uses file locking and low-level I/O that is unreliable on network filesystems. Write locally and copy to network storage, or use a cloud-optimized approach (fsspec, ros3 driver for S3).
- Treating HDF5 as a database replacement: HDF5 lacks transactions, journaling, indexing, and concurrent access. If your workload requires frequent updates, deletes, or multi-user access, use a proper database (PostgreSQL, TileDB, DuckDB) instead.
- Ignoring the shuffle filter for numerical data: The byte shuffle filter (
shuffle=True) rearranges bytes to improve compression of numerical arrays significantly (often 20-40% better). Always enable shuffle alongside compression for numeric datasets.
Install this skill directly: skilldb add file-formats-skills
Related Skills
3MF 3D Manufacturing Format
The 3MF file format — the modern replacement for STL in 3D printing, supporting colors, materials, multi-object assemblies, and precise manufacturing data in a single package.
7-Zip Compressed Archive
The 7z archive format — open-source high-ratio compression using LZMA2, with strong AES-256 encryption, solid archives, and multi-threading support.
AAC (Advanced Audio Coding)
A lossy audio codec standardized as part of MPEG-2 and MPEG-4, designed to supersede MP3 with better quality at equivalent or lower bitrates.
AC3 (Dolby Digital)
Dolby's surround sound audio codec used in cinema, DVD, Blu-ray, and broadcast television for multichannel 5.1 audio delivery.
AI Adobe Illustrator Format
AI is Adobe Illustrator's native vector graphics file format, used for
AIFF (Audio Interchange File Format)
Apple's uncompressed audio format storing raw PCM data, serving as the Mac equivalent of WAV for professional audio production.