Apache Parquet
Apache Parquet columnar storage format — a highly efficient binary format for analytics workloads, supporting compression and encoding optimizations.
You are a file format specialist with deep expertise in Apache Parquet, including columnar storage internals, row group and page structure, encoding and compression strategies, schema design with nested types, and efficient reading with PyArrow, Polars, DuckDB, and Spark. ## Key Points - **Row Groups**: Horizontal partitions containing a configurable number of rows (typically 128MB–1GB). - **Column Chunks**: All values for one column within a row group, stored contiguously. - **Pages**: Column chunks are divided into pages (typically 1MB) — the unit of compression and encoding. - **Footer**: Contains the full schema, row group metadata, column statistics (min/max/null count), and byte offsets. - **Plain**: Raw values. - **Dictionary**: Maps values to integer codes — excellent for low-cardinality columns. - **Run-Length Encoding (RLE)**: Compresses repeated values. - **Delta encoding**: Stores differences between consecutive values. - **Bit-packing**: Packs small integers into minimal bits. - **Snappy**: Fast, moderate compression (default in many systems). - **Gzip/Zlib**: Better compression, slower. - **Zstd**: Best balance of compression ratio and speed.
skilldb get file-formats-skills/Apache ParquetFull skill: 243 linesYou are a file format specialist with deep expertise in Apache Parquet, including columnar storage internals, row group and page structure, encoding and compression strategies, schema design with nested types, and efficient reading with PyArrow, Polars, DuckDB, and Spark.
Apache Parquet — Columnar Storage Format
Overview
Apache Parquet is an open-source, column-oriented binary data file format designed for efficient data storage and retrieval in analytics workloads. Created in 2013 as a collaboration between Twitter and Cloudera (inspired by Google's Dremel paper), Parquet has become the standard storage format for data lakes, data warehouses, and big data processing. Its columnar layout enables dramatic compression ratios and query performance improvements over row-oriented formats like CSV or JSON.
Core Philosophy
Apache Parquet is designed around a single insight: analytical queries almost never need all columns of a dataset. By storing data in columnar format — all values for a single column stored contiguously — Parquet enables queries to read only the columns they need, skipping irrelevant data entirely. For analytical workloads that touch a few columns of wide tables, this translates to 10-100x I/O reduction compared to row-oriented formats like CSV or JSON.
Parquet combines columnar storage with sophisticated encoding and compression to achieve remarkable efficiency. Dictionary encoding, run-length encoding, delta encoding, and general-purpose compression (Snappy, Zstd, Gzip) work together to produce files that are typically 5-10x smaller than equivalent CSV data while being faster to query. The format also embeds schema information and column-level statistics that enable query engines to skip entire row groups without reading them.
Use Parquet for analytical data storage, data lake formats, and any scenario where data is written once and queried many times. CSV is better for human inspection, quick data exchange, and write-heavy append workloads. JSON is better for hierarchical data and API interchange. Parquet is the right format when your priority is query performance and storage efficiency for tabular analytical data.
Technical Specifications
File Structure
A Parquet file is organized as:
┌──────────────────────────┐
│ Magic Number │ 4 bytes: "PAR1"
├──────────────────────────┤
│ Row Group 1 │ A horizontal partition of rows
│ ┌────────┬────────────┐ │
│ │ Col A │ Col B ... │ │ Column chunks within row group
│ │ Chunk │ Chunk │ │
│ └────────┴────────────┘ │
├──────────────────────────┤
│ Row Group 2 │
│ ┌────────┬────────────┐ │
│ │ Col A │ Col B ... │ │
│ │ Chunk │ Chunk │ │
│ └────────┴────────────┘ │
├──────────────────────────┤
│ Footer Metadata │ Schema, row group locations, stats
├──────────────────────────┤
│ Footer Length (4B) │
├──────────────────────────┤
│ Magic Number │ 4 bytes: "PAR1"
└──────────────────────────┘
- Row Groups: Horizontal partitions containing a configurable number of rows (typically 128MB–1GB).
- Column Chunks: All values for one column within a row group, stored contiguously.
- Pages: Column chunks are divided into pages (typically 1MB) — the unit of compression and encoding.
- Footer: Contains the full schema, row group metadata, column statistics (min/max/null count), and byte offsets.
Schema and Types
Parquet uses a schema definition with primitive and logical types:
message User {
required int64 id;
required binary name (UTF8);
optional binary email (UTF8);
required int32 age;
optional double salary;
required int64 created_at (TIMESTAMP_MILLIS);
optional group address {
required binary street (UTF8);
required binary city (UTF8);
optional binary zip (UTF8);
}
repeated binary tags (UTF8); -- array/list
}
Primitive types: BOOLEAN, INT32, INT64, FLOAT, DOUBLE, BINARY, FIXED_LEN_BYTE_ARRAY.
Logical types: UTF8, DECIMAL, DATE, TIME, TIMESTAMP, JSON, UUID, LIST, MAP.
Encoding and Compression
Encodings (applied per column):
- Plain: Raw values.
- Dictionary: Maps values to integer codes — excellent for low-cardinality columns.
- Run-Length Encoding (RLE): Compresses repeated values.
- Delta encoding: Stores differences between consecutive values.
- Bit-packing: Packs small integers into minimal bits.
Compression codecs (applied per page):
- Snappy: Fast, moderate compression (default in many systems).
- Gzip/Zlib: Better compression, slower.
- Zstd: Best balance of compression ratio and speed.
- LZ4: Fastest decompression.
- Brotli: High compression ratio.
How to Work With It
Reading and Writing
import pandas as pd
import pyarrow.parquet as pq
# Read
df = pd.read_parquet("data.parquet")
df = pd.read_parquet("data.parquet", columns=["name", "age"]) # column pruning
# Write
df.to_parquet("output.parquet", compression="zstd", index=False)
# PyArrow for more control
table = pq.read_table("data.parquet")
pq.write_table(table, "output.parquet",
compression="zstd",
row_group_size=1_000_000,
use_dictionary=True)
# Read metadata without loading data
metadata = pq.read_metadata("data.parquet")
schema = pq.read_schema("data.parquet")
print(f"Rows: {metadata.num_rows}, Row Groups: {metadata.num_row_groups}")
# Polars (faster for large datasets)
import polars as pl
df = pl.read_parquet("data.parquet")
df = pl.scan_parquet("data.parquet") # lazy — only reads what's needed
.filter(pl.col("age") > 25)
.select(["name", "age"])
.collect()
# DuckDB — query Parquet files directly with SQL
import duckdb
result = duckdb.sql("""
SELECT name, AVG(salary)
FROM 'data.parquet'
WHERE age > 25
GROUP BY name
""").fetchdf()
# Query multiple files / partitioned datasets
duckdb.sql("SELECT * FROM 'data/**/*.parquet'")
Partitioned Datasets
# Write partitioned dataset (hive-style)
df.to_parquet("output/", partition_cols=["year", "month"])
# Creates: output/year=2025/month=01/part-0.parquet
# Read partitioned dataset
df = pd.read_parquet("output/") # automatic partition discovery
Inspecting
# parquet-tools (pip install parquet-tools)
parquet-tools show data.parquet
parquet-tools schema data.parquet
parquet-tools rowcount data.parquet
parquet-tools inspect data.parquet # detailed metadata
# DuckDB CLI
duckdb -c "DESCRIBE SELECT * FROM 'data.parquet'"
duckdb -c "SELECT * FROM parquet_metadata('data.parquet')"
Common Use Cases
- Data lakes: S3/GCS/ADLS storage for Spark, Trino, Athena, BigQuery.
- Data warehouses: Snowflake, Databricks, Redshift Spectrum external tables.
- ETL pipelines: Intermediate and output format for data transformations.
- Machine learning: Feature stores, training dataset storage.
- Log analytics: Compressed storage of structured log data.
- Data exchange: Sharing large datasets between teams and organizations.
- Local analytics: DuckDB, Polars, and pandas for laptop-scale analysis.
Pros & Cons
Pros
- Extreme compression — often 5-10x smaller than CSV for the same data.
- Column pruning — read only the columns you need, skipping the rest.
- Predicate pushdown — skip row groups based on column statistics (min/max).
- Schema enforcement with rich type system including nested structures.
- Splittable — row groups can be processed in parallel.
- Language-agnostic binary format with broad ecosystem support.
- Self-describing — schema embedded in the file.
Cons
- Not human-readable — binary format requires tooling to inspect.
- Append-unfriendly — adding rows requires rewriting (use Delta Lake/Iceberg for mutability).
- Not suitable for row-level updates or deletes (use table formats like Delta/Iceberg).
- Small files have disproportionate overhead — not ideal for tiny datasets.
- Write performance is slower than CSV for simple dumps.
- Complex nested schemas can be tricky to work with.
Compatibility
| Tool/Language | Library / Support |
|---|---|
| Python | pyarrow, fastparquet, polars |
| Java/Scala | parquet-mr (reference impl), Spark |
| Rust | parquet crate (arrow-rs), Polars |
| Go | xitongsys/parquet-go, apache/arrow-go |
| JavaScript | parquetjs, hyparquet, DuckDB-WASM |
| C++ | Apache Arrow C++ library |
| SQL engines | DuckDB, Spark, Trino, Athena, BigQuery |
MIME type: application/vnd.apache.parquet. File extension: .parquet.
Related Formats
- Apache ORC: Alternative columnar format, popular in Hive ecosystem.
- Apache Avro: Row-oriented binary format for serialization/messaging.
- CSV: Simple text-based tabular format (much larger, no types).
- Arrow (IPC): In-memory columnar format; Parquet is Arrow's on-disk partner.
- Delta Lake / Iceberg / Hudi: Table formats built on top of Parquet adding ACID, time travel, and schema evolution.
- Lance: Modern columnar format optimized for ML/vector data.
Practical Usage
- Use Zstandard (
zstd) compression for the best balance of compression ratio and read/write speed -- it outperforms Snappy and Gzip in most workloads. - Always specify
columns=when reading Parquet files to leverage column pruning -- reading only the columns you need can be 10-100x faster than reading the full file. - Use
pl.scan_parquet()(Polars) or DuckDB for lazy evaluation that pushes filters and projections down to the Parquet reader, avoiding loading unnecessary data into memory. - Set row group size to match your query patterns -- larger row groups (128 MB+) improve compression and sequential scan performance; smaller row groups improve predicate pushdown granularity.
- Use Hive-style partitioning (
partition_cols=["year", "month"]) for datasets queried by date or category to enable partition pruning and avoid scanning irrelevant files. - Inspect Parquet metadata with
parquet-tools inspectorpq.read_metadata()before processing to understand schema, row counts, and compression without reading any data.
Anti-Patterns
- Using Parquet for small datasets under 10,000 rows -- The format overhead (footer, metadata, page headers) makes Parquet less efficient than CSV or JSON for tiny datasets.
- Appending rows to existing Parquet files -- Parquet is immutable by design; use Delta Lake, Iceberg, or Hudi for mutable table semantics on top of Parquet.
- Creating many small Parquet files -- Small files cause excessive metadata overhead and slow query planning; aim for files between 128 MB and 1 GB, and use compaction to merge small files.
- Storing data in Parquet without considering column cardinality -- High-cardinality string columns (UUIDs, free text) compress poorly and negate Parquet's columnar advantages; consider dictionary encoding limits.
- Using Parquet as an interchange format for non-analytical workloads -- Parquet is optimized for columnar reads; for row-oriented access patterns, streaming, or message passing, use Avro, NDJSON, or MessagePack instead.
Install this skill directly: skilldb add file-formats-skills
Related Skills
3MF 3D Manufacturing Format
The 3MF file format — the modern replacement for STL in 3D printing, supporting colors, materials, multi-object assemblies, and precise manufacturing data in a single package.
7-Zip Compressed Archive
The 7z archive format — open-source high-ratio compression using LZMA2, with strong AES-256 encryption, solid archives, and multi-threading support.
AAC (Advanced Audio Coding)
A lossy audio codec standardized as part of MPEG-2 and MPEG-4, designed to supersede MP3 with better quality at equivalent or lower bitrates.
AC3 (Dolby Digital)
Dolby's surround sound audio codec used in cinema, DVD, Blu-ray, and broadcast television for multichannel 5.1 audio delivery.
AI Adobe Illustrator Format
AI is Adobe Illustrator's native vector graphics file format, used for
AIFF (Audio Interchange File Format)
Apple's uncompressed audio format storing raw PCM data, serving as the Mac equivalent of WAV for professional audio production.