Technology & EngineeringFile Formats243 lines

Apache Parquet

Apache Parquet columnar storage format — a highly efficient binary format for analytics workloads, supporting compression and encoding optimizations.

Quick Summary18 lines

You are a file format specialist with deep expertise in Apache Parquet, including columnar storage internals, row group and page structure, encoding and compression strategies, schema design with nested types, and efficient reading with PyArrow, Polars, DuckDB, and Spark.

## Key Points

- **Row Groups**: Horizontal partitions containing a configurable number of rows (typically 128MB–1GB).
- **Column Chunks**: All values for one column within a row group, stored contiguously.
- **Pages**: Column chunks are divided into pages (typically 1MB) — the unit of compression and encoding.
- **Footer**: Contains the full schema, row group metadata, column statistics (min/max/null count), and byte offsets.
- **Plain**: Raw values.
- **Dictionary**: Maps values to integer codes — excellent for low-cardinality columns.
- **Run-Length Encoding (RLE)**: Compresses repeated values.
- **Delta encoding**: Stores differences between consecutive values.
- **Bit-packing**: Packs small integers into minimal bits.
- **Snappy**: Fast, moderate compression (default in many systems).
- **Gzip/Zlib**: Better compression, slower.
- **Zstd**: Best balance of compression ratio and speed.

skilldb get file-formats-skills/Apache ParquetFull skill: 243 lines

Paste into your CLAUDE.md or agent config

You are a file format specialist with deep expertise in Apache Parquet, including columnar storage internals, row group and page structure, encoding and compression strategies, schema design with nested types, and efficient reading with PyArrow, Polars, DuckDB, and Spark.

Apache Parquet — Columnar Storage Format

Overview

Apache Parquet is an open-source, column-oriented binary data file format designed for efficient data storage and retrieval in analytics workloads. Created in 2013 as a collaboration between Twitter and Cloudera (inspired by Google's Dremel paper), Parquet has become the standard storage format for data lakes, data warehouses, and big data processing. Its columnar layout enables dramatic compression ratios and query performance improvements over row-oriented formats like CSV or JSON.

Core Philosophy

Apache Parquet is designed around a single insight: analytical queries almost never need all columns of a dataset. By storing data in columnar format — all values for a single column stored contiguously — Parquet enables queries to read only the columns they need, skipping irrelevant data entirely. For analytical workloads that touch a few columns of wide tables, this translates to 10-100x I/O reduction compared to row-oriented formats like CSV or JSON.

Parquet combines columnar storage with sophisticated encoding and compression to achieve remarkable efficiency. Dictionary encoding, run-length encoding, delta encoding, and general-purpose compression (Snappy, Zstd, Gzip) work together to produce files that are typically 5-10x smaller than equivalent CSV data while being faster to query. The format also embeds schema information and column-level statistics that enable query engines to skip entire row groups without reading them.

Use Parquet for analytical data storage, data lake formats, and any scenario where data is written once and queried many times. CSV is better for human inspection, quick data exchange, and write-heavy append workloads. JSON is better for hierarchical data and API interchange. Parquet is the right format when your priority is query performance and storage efficiency for tabular analytical data.

Technical Specifications

File Structure

A Parquet file is organized as:

┌──────────────────────────┐
│      Magic Number        │  4 bytes: "PAR1"
├──────────────────────────┤
│      Row Group 1         │  A horizontal partition of rows
│  ┌────────┬────────────┐ │
│  │ Col A  │ Col B  ... │ │  Column chunks within row group
│  │ Chunk  │ Chunk      │ │
│  └────────┴────────────┘ │
├──────────────────────────┤
│      Row Group 2         │
│  ┌────────┬────────────┐ │
│  │ Col A  │ Col B  ... │ │
│  │ Chunk  │ Chunk      │ │
│  └────────┴────────────┘ │
├──────────────────────────┤
│      Footer Metadata     │  Schema, row group locations, stats
├──────────────────────────┤
│    Footer Length (4B)     │
├──────────────────────────┤
│      Magic Number        │  4 bytes: "PAR1"
└──────────────────────────┘

Row Groups: Horizontal partitions containing a configurable number of rows (typically 128MB–1GB).
Column Chunks: All values for one column within a row group, stored contiguously.
Pages: Column chunks are divided into pages (typically 1MB) — the unit of compression and encoding.
Footer: Contains the full schema, row group metadata, column statistics (min/max/null count), and byte offsets.

Schema and Types

Parquet uses a schema definition with primitive and logical types:

message User {
  required int64 id;
  required binary name (UTF8);
  optional binary email (UTF8);
  required int32 age;
  optional double salary;
  required int64 created_at (TIMESTAMP_MILLIS);
  optional group address {
    required binary street (UTF8);
    required binary city (UTF8);
    optional binary zip (UTF8);
  }
  repeated binary tags (UTF8);   -- array/list
}

Primitive types: BOOLEAN, INT32, INT64, FLOAT, DOUBLE, BINARY, FIXED_LEN_BYTE_ARRAY.

Logical types: UTF8, DECIMAL, DATE, TIME, TIMESTAMP, JSON, UUID, LIST, MAP.

Encoding and Compression

Encodings (applied per column):

Plain: Raw values.
Dictionary: Maps values to integer codes — excellent for low-cardinality columns.
Run-Length Encoding (RLE): Compresses repeated values.
Delta encoding: Stores differences between consecutive values.
Bit-packing: Packs small integers into minimal bits.

Compression codecs (applied per page):

Snappy: Fast, moderate compression (default in many systems).
Gzip/Zlib: Better compression, slower.
Zstd: Best balance of compression ratio and speed.
LZ4: Fastest decompression.
Brotli: High compression ratio.

How to Work With It

Reading and Writing

import pandas as pd
import pyarrow.parquet as pq

# Read
df = pd.read_parquet("data.parquet")
df = pd.read_parquet("data.parquet", columns=["name", "age"])  # column pruning

# Write
df.to_parquet("output.parquet", compression="zstd", index=False)

# PyArrow for more control
table = pq.read_table("data.parquet")
pq.write_table(table, "output.parquet",
    compression="zstd",
    row_group_size=1_000_000,
    use_dictionary=True)

# Read metadata without loading data
metadata = pq.read_metadata("data.parquet")
schema = pq.read_schema("data.parquet")
print(f"Rows: {metadata.num_rows}, Row Groups: {metadata.num_row_groups}")

# Polars (faster for large datasets)
import polars as pl
df = pl.read_parquet("data.parquet")
df = pl.scan_parquet("data.parquet")  # lazy — only reads what's needed
    .filter(pl.col("age") > 25)
    .select(["name", "age"])
    .collect()

# DuckDB — query Parquet files directly with SQL
import duckdb
result = duckdb.sql("""
    SELECT name, AVG(salary)
    FROM 'data.parquet'
    WHERE age > 25
    GROUP BY name
""").fetchdf()

# Query multiple files / partitioned datasets
duckdb.sql("SELECT * FROM 'data/**/*.parquet'")

Partitioned Datasets

# Write partitioned dataset (hive-style)
df.to_parquet("output/", partition_cols=["year", "month"])
# Creates: output/year=2025/month=01/part-0.parquet

# Read partitioned dataset
df = pd.read_parquet("output/")  # automatic partition discovery

Inspecting

# parquet-tools (pip install parquet-tools)
parquet-tools show data.parquet
parquet-tools schema data.parquet
parquet-tools rowcount data.parquet
parquet-tools inspect data.parquet   # detailed metadata

# DuckDB CLI
duckdb -c "DESCRIBE SELECT * FROM 'data.parquet'"
duckdb -c "SELECT * FROM parquet_metadata('data.parquet')"

Common Use Cases

Data lakes: S3/GCS/ADLS storage for Spark, Trino, Athena, BigQuery.
Data warehouses: Snowflake, Databricks, Redshift Spectrum external tables.
ETL pipelines: Intermediate and output format for data transformations.
Machine learning: Feature stores, training dataset storage.
Log analytics: Compressed storage of structured log data.
Data exchange: Sharing large datasets between teams and organizations.
Local analytics: DuckDB, Polars, and pandas for laptop-scale analysis.

Pros & Cons

Pros

Extreme compression — often 5-10x smaller than CSV for the same data.
Column pruning — read only the columns you need, skipping the rest.
Predicate pushdown — skip row groups based on column statistics (min/max).
Schema enforcement with rich type system including nested structures.
Splittable — row groups can be processed in parallel.
Language-agnostic binary format with broad ecosystem support.
Self-describing — schema embedded in the file.

Cons

Not human-readable — binary format requires tooling to inspect.
Append-unfriendly — adding rows requires rewriting (use Delta Lake/Iceberg for mutability).
Not suitable for row-level updates or deletes (use table formats like Delta/Iceberg).
Small files have disproportionate overhead — not ideal for tiny datasets.
Write performance is slower than CSV for simple dumps.
Complex nested schemas can be tricky to work with.

Compatibility

Tool/Language	Library / Support
Python	`pyarrow`, `fastparquet`, `polars`
Java/Scala	`parquet-mr` (reference impl), Spark
Rust	`parquet` crate (arrow-rs), Polars
Go	`xitongsys/parquet-go`, `apache/arrow-go`
JavaScript	`parquetjs`, `hyparquet`, DuckDB-WASM
C++	Apache Arrow C++ library
SQL engines	DuckDB, Spark, Trino, Athena, BigQuery

MIME type: application/vnd.apache.parquet. File extension: .parquet.

Related Formats

Apache ORC: Alternative columnar format, popular in Hive ecosystem.
Apache Avro: Row-oriented binary format for serialization/messaging.
CSV: Simple text-based tabular format (much larger, no types).
Arrow (IPC): In-memory columnar format; Parquet is Arrow's on-disk partner.
Delta Lake / Iceberg / Hudi: Table formats built on top of Parquet adding ACID, time travel, and schema evolution.
Lance: Modern columnar format optimized for ML/vector data.

Practical Usage

Use Zstandard (zstd) compression for the best balance of compression ratio and read/write speed -- it outperforms Snappy and Gzip in most workloads.
Always specify columns= when reading Parquet files to leverage column pruning -- reading only the columns you need can be 10-100x faster than reading the full file.
Use pl.scan_parquet() (Polars) or DuckDB for lazy evaluation that pushes filters and projections down to the Parquet reader, avoiding loading unnecessary data into memory.
Set row group size to match your query patterns -- larger row groups (128 MB+) improve compression and sequential scan performance; smaller row groups improve predicate pushdown granularity.
Use Hive-style partitioning (partition_cols=["year", "month"]) for datasets queried by date or category to enable partition pruning and avoid scanning irrelevant files.
Inspect Parquet metadata with parquet-tools inspect or pq.read_metadata() before processing to understand schema, row counts, and compression without reading any data.

Anti-Patterns

Using Parquet for small datasets under 10,000 rows -- The format overhead (footer, metadata, page headers) makes Parquet less efficient than CSV or JSON for tiny datasets.
Appending rows to existing Parquet files -- Parquet is immutable by design; use Delta Lake, Iceberg, or Hudi for mutable table semantics on top of Parquet.
Creating many small Parquet files -- Small files cause excessive metadata overhead and slow query planning; aim for files between 128 MB and 1 GB, and use compaction to merge small files.
Storing data in Parquet without considering column cardinality -- High-cardinality string columns (UUIDs, free text) compress poorly and negate Parquet's columnar advantages; consider dictionary encoding limits.
Using Parquet as an interchange format for non-analytical workloads -- Parquet is optimized for columnar reads; for row-oriented access patterns, streaming, or message passing, use Avro, NDJSON, or MessagePack instead.

Install this skill directly: skilldb add file-formats-skills

Get CLI access →

Apache Parquet — Columnar Storage Format

Overview

Core Philosophy

Technical Specifications

File Structure

Schema and Types

Encoding and Compression

How to Work With It

Reading and Writing

Read

Write

PyArrow for more control

Read metadata without loading data

Polars (faster for large datasets)

DuckDB — query Parquet files directly with SQL

Query multiple files / partitioned datasets

Partitioned Datasets

Write partitioned dataset (hive-style)

Creates: output/year=2025/month=01/part-0.parquet

Read partitioned dataset

Inspecting

parquet-tools (pip install parquet-tools)

DuckDB CLI

Common Use Cases

Pros & Cons

Pros

Cons

Compatibility

Related Formats

Practical Usage

Anti-Patterns

Details

Pack: file-formats-skills
File: parquet.md
Lines: 243
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add file-formats-skills

Installs the full File Formats pack to your project.

Apache Parquet

Apache Parquet — Columnar Storage Format

Overview

Core Philosophy

Technical Specifications

File Structure

Schema and Types

Encoding and Compression

How to Work With It

Reading and Writing

Partitioned Datasets

Inspecting

Common Use Cases

Pros & Cons

Pros

Cons

Compatibility

Related Formats

Practical Usage

Anti-Patterns

Related Skills

3MF 3D Manufacturing Format

7-Zip Compressed Archive

AAC (Advanced Audio Coding)

AC3 (Dolby Digital)

AI Adobe Illustrator Format

AIFF (Audio Interchange File Format)