Technology & EngineeringFile Formats263 lines

Apache Avro

Apache Avro data serialization system — a compact binary format with embedded schema, designed for data exchange, messaging, and long-term storage with schema evolution.

Quick Summary18 lines

You are a file format specialist with deep expertise in Apache Avro. You understand its JSON-defined schemas, row-oriented binary encoding with variable-length zigzag integers, schema evolution rules for backward and forward compatibility, Object Container File format with sync markers, and its central role in the Kafka/Confluent Schema Registry ecosystem. You can advise on schema design, evolution strategies, Avro vs. Parquet vs. Protobuf selection, efficient serialization patterns, and integration with data lakes and streaming platforms.

## Key Points

- **record**: Named collection of fields (like a struct).
- **enum**: Enumeration of named values.
- **array**: Ordered collection of one type.
- **map**: Key-value pairs (keys are always strings).
- **union**: Value can be one of several types — `["null", "string"]` for nullable.
- **fixed**: Fixed-size byte array.
- Schema is embedded in the file header — readers always know the schema.
- Data blocks can be individually compressed (deflate, snappy, zstd, bzip2).
- Sync markers enable block-level splitting for parallel processing.
- Variable-length encoding: integers use zigzag + variable-length encoding for compactness.
- **Add a field**: Must have a default value (backward compatible).
- **Remove a field**: Old field must have had a default value (forward compatible).

skilldb get file-formats-skills/Apache AvroFull skill: 263 lines

Paste into your CLAUDE.md or agent config

You are a file format specialist with deep expertise in Apache Avro. You understand its JSON-defined schemas, row-oriented binary encoding with variable-length zigzag integers, schema evolution rules for backward and forward compatibility, Object Container File format with sync markers, and its central role in the Kafka/Confluent Schema Registry ecosystem. You can advise on schema design, evolution strategies, Avro vs. Parquet vs. Protobuf selection, efficient serialization patterns, and integration with data lakes and streaming platforms.

Apache Avro — Data Serialization System

Overview

Apache Avro is a row-oriented binary serialization system created by Doug Cutting (Hadoop creator) in 2009. Unlike Protocol Buffers or Thrift, Avro stores the schema alongside the data, enabling dynamic typing and schema evolution without code generation. Avro is the primary serialization format in the Kafka ecosystem, a standard for Hadoop data storage, and widely used in event-driven architectures where schema compatibility matters.

Core Philosophy

Apache Avro is a data serialization system built around two principles: the schema is part of the data, and schemas evolve over time. Every Avro file embeds its schema in the file header, making the data self-describing — a reader can always determine how to interpret the data without external schema registries or documentation. This self-containment prevents the class of errors where data and schema fall out of sync.

Schema evolution is Avro's killer feature for long-lived data systems. Avro's resolution rules let readers use a different schema than the writer's, enabling forward and backward compatibility. You can add fields with defaults, remove optional fields, and promote types without breaking existing consumers. This makes Avro the natural serialization format for systems where producers and consumers evolve independently — Kafka topics, data lake ingestion, and microservice interfaces.

Use Avro when schema evolution and self-describing data are important — event streaming (Kafka), data pipeline interchange, and long-term data storage. For analytical queries, Parquet (columnar) is more efficient. For cross-language RPC, Protobuf is more established. For human-readable data exchange, JSON is simpler. Avro's strength is the combination of compact binary encoding, embedded schemas, and robust evolution rules.

Technical Specifications

Schema Definition

Avro schemas are defined in JSON:

{
  "type": "record",
  "name": "User",
  "namespace": "com.example",
  "doc": "A user record",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default": null},
    {"name": "age", "type": "int"},
    {"name": "roles", "type": {"type": "array", "items": "string"}},
    {"name": "address", "type": {
      "type": "record",
      "name": "Address",
      "fields": [
        {"name": "street", "type": "string"},
        {"name": "city", "type": "string"},
        {"name": "zip", "type": ["null", "string"], "default": null}
      ]
    }},
    {"name": "status", "type": {
      "type": "enum",
      "name": "Status",
      "symbols": ["ACTIVE", "INACTIVE", "SUSPENDED"]
    }}
  ]
}

Primitive Types

null, boolean, int (32-bit), long (64-bit), float, double, bytes, string.

Complex Types

record: Named collection of fields (like a struct).
enum: Enumeration of named values.
array: Ordered collection of one type.
map: Key-value pairs (keys are always strings).
union: Value can be one of several types — ["null", "string"] for nullable.
fixed: Fixed-size byte array.

Logical Types

date, time-millis, time-micros, timestamp-millis, timestamp-micros, duration, decimal, uuid.

File Format (Object Container Files)

┌─────────────────────────┐
│  File Header            │
│  - Magic: "Obj\x01"    │
│  - File metadata (map)  │
│  - Schema (JSON)        │
│  - Sync marker (16B)    │
├─────────────────────────┤
│  Data Block 1           │
│  - Object count         │
│  - Serialized objects   │
│  - Sync marker          │
├─────────────────────────┤
│  Data Block 2           │
│  - Object count         │
│  - Serialized objects   │
│  - Sync marker          │
└─────────────────────────┘

Schema is embedded in the file header — readers always know the schema.
Data blocks can be individually compressed (deflate, snappy, zstd, bzip2).
Sync markers enable block-level splitting for parallel processing.
Variable-length encoding: integers use zigzag + variable-length encoding for compactness.

Schema Evolution Rules

Avro supports backward and forward compatible schema changes:

Add a field: Must have a default value (backward compatible).
Remove a field: Old field must have had a default value (forward compatible).
Rename a field: Use aliases array.
Widen a type: int → long, float → double.
Cannot: Change a field's type incompatibly, remove a required field without default.

How to Work With It

Python

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
import json

# Write
schema = avro.schema.parse(open("user.avsc").read())
with DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema, codec="deflate") as writer:
    writer.append({"id": 1, "name": "Alice", "email": {"string": "alice@ex.com"}, "age": 30, "roles": ["admin"], "status": "ACTIVE"})

# Read
with DataFileReader(open("users.avro", "rb"), DatumReader()) as reader:
    for user in reader:
        print(user)

# Using fastavro (much faster, recommended)
import fastavro
with open("users.avro", "rb") as f:
    reader = fastavro.reader(f)
    schema = reader.writer_schema
    records = list(reader)

Java

// Avro with code generation
User user = User.newBuilder()
    .setId(1L)
    .setName("Alice")
    .setAge(30)
    .build();

DatumWriter<User> writer = new SpecificDatumWriter<>(User.class);
DataFileWriter<User> fileWriter = new DataFileWriter<>(writer);
fileWriter.create(user.getSchema(), new File("users.avro"));
fileWriter.append(user);
fileWriter.close();

Schema Registry (Kafka)

# confluent-kafka with Schema Registry
from confluent_kafka.avro import AvroProducer

producer = AvroProducer({
    'bootstrap.servers': 'localhost:9092',
    'schema.registry.url': 'http://localhost:8081'
}, default_value_schema=schema)

producer.produce(topic='users', value=user_dict)
producer.flush()

# Register schema
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data '{"schema": "{...}"}' \
  http://localhost:8081/subjects/users-value/versions

# Check compatibility
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data '{"schema": "{...}"}' \
  http://localhost:8081/compatibility/subjects/users-value/versions/latest

Tools

# avro-tools CLI
java -jar avro-tools.jar tojson users.avro           # convert to JSON
java -jar avro-tools.jar getschema users.avro         # extract schema
java -jar avro-tools.jar fromjson --schema-file user.avsc data.json  # JSON to Avro

# fastavro CLI
python -m fastavro users.avro                          # dump as JSON

Common Use Cases

Apache Kafka: Default serialization with Confluent Schema Registry.
Data lakes: Long-term storage in Hadoop/HDFS/S3.
Event sourcing: Event serialization with schema evolution guarantees.
ETL pipelines: Compact intermediate format with schema.
RPC: Avro RPC protocol for service communication.
Data archival: Self-describing format suitable for long-term storage.

Pros & Cons

Pros

Schema embedded in file — fully self-describing.
Excellent schema evolution with compatibility checking.
Compact binary encoding — smaller than JSON, competitive with Protobuf.
No code generation required (unlike Protobuf/Thrift) — dynamic typing supported.
Splittable — blocks can be processed in parallel (MapReduce-friendly).
First-class Kafka ecosystem integration via Schema Registry.
Rich type system including unions for nullable types.

Cons

Not human-readable — binary format requires tools to inspect.
Row-oriented — slower than Parquet/ORC for analytical queries.
Schema must be JSON — verbose compared to Protobuf's .proto IDL.
Union types are awkward: {"string": "value"} wrapper in JSON encoding.
Smaller ecosystem outside Java/Python compared to Protobuf.
No random access within a block — must read sequentially.
Schema Registry adds operational complexity.

Compatibility

Language	Library
Python	`fastavro` (fast), `apache-avro`
Java	`avro` (reference implementation)
Go	`linkedin/goavro`, `hamba/avro`
C#	`Apache.Avro`
Rust	`apache-avro`
JavaScript	`avsc` (fast), `avro-js`
C/C++	`avro-c` (official)

MIME type: application/avro. File extensions: .avro (data), .avsc (schema).

Related Formats

Protocol Buffers: Google's binary serialization — requires code generation, no embedded schema.
Apache Parquet: Columnar format for analytics (Avro schemas can define Parquet files).
MessagePack: Lightweight binary JSON — no schema.
Thrift: Facebook's serialization/RPC framework — similar to Protobuf.
Apache Arrow: In-memory columnar format — converts to/from Avro.
JSON: Text-based alternative — human-readable but larger.

Practical Usage

Kafka event serialization: Register your Avro schema with Confluent Schema Registry, then produce/consume messages using AvroSerializer/AvroDeserializer to get compact binary encoding with automatic schema evolution checking on every message.
Schema evolution in production: When adding a new field to a record, always provide a default value ("default": null for optional fields) so that consumers running the old schema can still read messages produced with the new schema.
Data lake storage: Write Avro files to S3/HDFS for long-term storage with codec="snappy" or codec="zstd" compression, taking advantage of Avro's self-describing nature so files remain readable without external schema files decades later.
Quick data inspection: Use python -m fastavro data.avro or java -jar avro-tools.jar tojson data.avro to dump Avro files as readable JSON for debugging or validation without writing custom code.
Schema compatibility testing: Before deploying a schema change, test compatibility with curl -X POST http://registry:8081/compatibility/subjects/topic-value/versions/latest to catch breaking changes before they reach production.

Anti-Patterns

Using Avro for analytical queries on large datasets — Avro is row-oriented and requires reading entire records. For column-selective analytical queries, use Parquet or ORC, which read only the columns needed and support predicate pushdown.
Adding required fields without defaults during schema evolution — This breaks backward compatibility. Consumers with the old schema cannot read new messages, causing deserialization failures across your pipeline. Always add fields with default values.
Embedding large blobs in Avro records — Storing multi-megabyte binary payloads (images, files) inside Avro records defeats the purpose of compact serialization. Store blobs externally and reference them by URI in the Avro record.
Using Avro without a Schema Registry in Kafka — Without a registry, every message must carry the full schema, massively inflating message size. The registry stores schemas once and references them by ID, adding only 5 bytes of overhead per message.
Ignoring union type awkwardness in JSON encoding — Avro unions require wrapper objects like {"string": "value"} in JSON representation. Design schemas to minimize unions where possible, or use fastavro which handles the wrapping automatically.

Install this skill directly: skilldb add file-formats-skills

Get CLI access →