Skip to main content
Technology & EngineeringFile Formats267 lines

NDJSON/JSONL

Newline-Delimited JSON — a text format with one JSON object per line, designed for streaming, logging, and processing large datasets line-by-line.

Quick Summary33 lines
You are a file format specialist with deep expertise in NDJSON/JSONL (Newline-Delimited JSON), including streaming parse patterns, jq command-line processing, bulk API ingestion formats, structured logging pipelines, and efficient large-dataset processing with Pandas, Polars, and DuckDB.

## Key Points

1. Each line is a valid JSON value (typically an object, but can be any JSON type).
2. Lines are separated by `\n` (newline). `\r\n` is also accepted by most parsers.
3. No trailing comma, no enclosing array brackets.
4. Each line must not contain unescaped newlines within the JSON value.
5. Empty lines are typically ignored.
6. No header line (unlike CSV) — the schema is implicit in the objects.
7. UTF-8 encoding throughout.
8. Lines are independent — a malformed line does not invalidate the rest of the file.
- **Structured logging**: Application logs in JSON format (ELK stack, Datadog, CloudWatch).
- **Bulk APIs**: Elasticsearch `_bulk` API, OpenAI batch API, BigQuery load jobs.
- **Data pipelines**: Intermediate format between extraction and loading (ETL).
- **Streaming data**: Server-Sent Events, WebSocket messages, Kafka message dumps.

## Quick Example

```jsonl
{"id": 1, "name": "Alice", "email": "alice@example.com", "age": 30}
{"id": 2, "name": "Bob", "email": "bob@example.com", "age": 25}
{"id": 3, "name": "Charlie", "email": null, "age": 35}
```

```jsonl
// NDJSON — parse line by line, constant memory usage
{"id": 1, "name": "Alice"}
{"id": 2, "name": "Bob"}
{"id": 3, "name": "Charlie"}
```
skilldb get file-formats-skills/NDJSON/JSONLFull skill: 267 lines
Paste into your CLAUDE.md or agent config

You are a file format specialist with deep expertise in NDJSON/JSONL (Newline-Delimited JSON), including streaming parse patterns, jq command-line processing, bulk API ingestion formats, structured logging pipelines, and efficient large-dataset processing with Pandas, Polars, and DuckDB.

NDJSON/JSONL — Newline-Delimited JSON

Overview

NDJSON (Newline Delimited JSON) and JSONL (JSON Lines) are effectively the same format: a text file where each line is a valid, self-contained JSON value, separated by newline characters (\n). This simple convention solves a fundamental problem with standard JSON — it cannot be easily streamed, appended to, or processed line-by-line. NDJSON/JSONL is the standard format for structured logging, data streaming, bulk APIs, and large dataset processing where loading an entire JSON array into memory is impractical.

Core Philosophy

NDJSON (Newline Delimited JSON) solves a fundamental problem with JSON: standard JSON has no built-in mechanism for streaming. A JSON array must be fully parsed before any element can be processed. NDJSON eliminates this by placing one complete JSON object per line, separated by newlines. Each line is independently parseable, enabling true line-by-line streaming processing.

This one-object-per-line structure makes NDJSON the natural format for log files, event streams, data pipeline interchange, and any scenario where records arrive incrementally or where the complete dataset may not fit in memory. Tools like jq, grep, head, tail, wc, and split work directly on NDJSON because each line is a self-contained unit. This composability with Unix text processing tools is a practical advantage that standard JSON arrays do not provide.

Use NDJSON when you need to stream, append, or process JSON records incrementally. Use standard JSON arrays when you need to serve a complete, well-formed response (REST APIs) or when the data includes complex nested relationships that span multiple records. The choice between JSON and NDJSON is primarily about whether your data is a complete document (JSON) or an unbounded stream of records (NDJSON).

Technical Specifications

Syntax and Structure

{"id": 1, "name": "Alice", "email": "alice@example.com", "age": 30}
{"id": 2, "name": "Bob", "email": "bob@example.com", "age": 25}
{"id": 3, "name": "Charlie", "email": null, "age": 35}

Key Rules

  1. Each line is a valid JSON value (typically an object, but can be any JSON type).
  2. Lines are separated by \n (newline). \r\n is also accepted by most parsers.
  3. No trailing comma, no enclosing array brackets.
  4. Each line must not contain unescaped newlines within the JSON value.
  5. Empty lines are typically ignored.
  6. No header line (unlike CSV) — the schema is implicit in the objects.
  7. UTF-8 encoding throughout.
  8. Lines are independent — a malformed line does not invalidate the rest of the file.

NDJSON vs JSONL vs JSON Streaming

These names refer to the same format with minor community differences:

NameSpec URLCommunity
NDJSONndjson.orgStreaming/APIs
JSON Linesjsonlines.orgData processing
LDJSON(less common)Linked Data

All follow the same convention: one JSON value per line.

Comparison with JSON Array

// Standard JSON array — must load entirely into memory to parse
[
  {"id": 1, "name": "Alice"},
  {"id": 2, "name": "Bob"},
  {"id": 3, "name": "Charlie"}
]
// NDJSON — parse line by line, constant memory usage
{"id": 1, "name": "Alice"}
{"id": 2, "name": "Bob"}
{"id": 3, "name": "Charlie"}

How to Work With It

Reading

import json

# Line-by-line (memory efficient)
with open("data.jsonl") as f:
    for line in f:
        record = json.loads(line)
        process(record)

# Load all into list
with open("data.jsonl") as f:
    records = [json.loads(line) for line in f if line.strip()]

# Pandas
import pandas as pd
df = pd.read_json("data.jsonl", lines=True)
# For large files:
chunks = pd.read_json("data.jsonl", lines=True, chunksize=10000)
for chunk in chunks:
    process(chunk)

# Polars
import polars as pl
df = pl.read_ndjson("data.jsonl")
lazy = pl.scan_ndjson("data.jsonl")  # lazy evaluation
import { createReadStream } from 'fs';
import { createInterface } from 'readline';

const rl = createInterface({ input: createReadStream('data.jsonl') });
for await (const line of rl) {
    const record = JSON.parse(line);
    process(record);
}

// Or with ndjson package
import ndjson from 'ndjson';
createReadStream('data.jsonl')
    .pipe(ndjson.parse())
    .on('data', (obj) => process(obj));

Writing / Appending

import json

# Write
with open("data.jsonl", "w") as f:
    for record in records:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

# Append (key advantage over JSON arrays)
with open("data.jsonl", "a") as f:
    f.write(json.dumps(new_record) + "\n")

# Pandas
df.to_json("output.jsonl", orient="records", lines=True)
import { appendFileSync } from 'fs';
appendFileSync('data.jsonl', JSON.stringify(record) + '\n');

Command-Line Processing

# jq works perfectly with NDJSON using -c (compact) and slurp
cat data.jsonl | jq -c 'select(.age > 25)'           # filter
cat data.jsonl | jq -c '.name'                        # extract field
cat data.jsonl | jq -s 'sort_by(.age)'                # slurp, sort, output array
cat data.jsonl | jq -s 'length'                       # count records
cat data.jsonl | jq -s 'group_by(.status)'            # group

# Convert JSON array to JSONL
jq -c '.[]' array.json > data.jsonl

# Convert JSONL to JSON array
jq -s '.' data.jsonl > array.json

# grep for quick filtering (faster than jq for simple matches)
grep '"status":"active"' data.jsonl | jq .

# wc for counting lines/records
wc -l data.jsonl

# head/tail for sampling
head -100 data.jsonl | jq .
tail -50 data.jsonl | jq .

# DuckDB for SQL queries on JSONL
duckdb -c "SELECT name, COUNT(*) FROM read_json_auto('data.jsonl') GROUP BY name"

Validation

# Validate each line is valid JSON
while IFS= read -r line; do
    echo "$line" | jq . > /dev/null 2>&1 || echo "Invalid: $line"
done < data.jsonl

# Python validation
python -c "
import json, sys
for i, line in enumerate(open('data.jsonl'), 1):
    try: json.loads(line)
    except: print(f'Line {i}: invalid JSON')
"

Common Use Cases

  • Structured logging: Application logs in JSON format (ELK stack, Datadog, CloudWatch).
  • Bulk APIs: Elasticsearch _bulk API, OpenAI batch API, BigQuery load jobs.
  • Data pipelines: Intermediate format between extraction and loading (ETL).
  • Streaming data: Server-Sent Events, WebSocket messages, Kafka message dumps.
  • Database exports: MongoDB mongoexport, PostgreSQL COPY ... FORMAT json.
  • Machine learning: Training data, model evaluation results.
  • Log analysis: Splunk, Loki, and other log aggregation systems.
  • Data exchange: Large dataset transfers where streaming is needed.

Pros & Cons

Pros

  • Streamable — process records one at a time with constant memory.
  • Appendable — just add a line to the end (impossible with JSON arrays).
  • Line-independent — corrupt lines do not invalidate the entire file.
  • Simple — just JSON with newlines. No new parser needed.
  • Unix-friendly — works with grep, wc, head, tail, sort, jq.
  • Splittable — easy to split files for parallel processing.
  • Compresses well — especially with gzip (.jsonl.gz).
  • Human-readable (each line is valid JSON).

Cons

  • No schema — no validation, no type enforcement.
  • Verbose — JSON overhead on every line (repeated key names).
  • Not as compact as binary formats (Parquet, Avro, MessagePack).
  • No random access — must scan from the beginning to find a specific record.
  • Heterogeneous lines are valid — nothing enforces consistent structure across lines.
  • No built-in indexing or query optimization.
  • Large files can be slow compared to columnar formats for analytical queries.
  • Multi-line JSON values must be flattened to single lines.

Compatibility

Tool/LanguageSupport
Pythonjson stdlib, pandas, polars
JavaScriptJSON.parse per line, ndjson pkg
Gojson.Decoder, bufio.Scanner
Rustserde_json, line-by-line
jqNative support (default input mode)
DuckDBread_json_auto(), read_ndjson_auto()
Elasticsearch_bulk API format
BigQueryNative JSONL load support
Sparkspark.read.json("path.jsonl")
AWSAthena, Glue, Kinesis support

MIME type: application/x-ndjson or application/jsonl. File extensions: .jsonl, .ndjson, .json (ambiguous).

Related Formats

  • JSON: Standard format — NDJSON/JSONL is the streaming-friendly variant.
  • CSV: Simpler tabular text format — no nesting, no types.
  • Parquet: Columnar binary format — much better for analytics on large data.
  • Avro: Binary row format with schema — better for schema-enforced streaming.
  • JSON Sequence (RFC 7464): Uses ASCII Record Separator (0x1E) as delimiter.
  • Server-Sent Events: HTTP streaming format that can carry JSON payloads.
  • MessagePack: Binary alternative when size/speed matters more than readability.

Practical Usage

  • Use jq -c for all NDJSON transformations -- the -c (compact) flag ensures output remains one JSON value per line, preserving the NDJSON format.
  • Compress NDJSON files with gzip (.jsonl.gz) for storage and transfer -- NDJSON compresses exceptionally well due to repeated key names, and most tools (pandas, DuckDB, jq) can read gzipped NDJSON directly.
  • Use pd.read_json("file.jsonl", lines=True, chunksize=10000) or pl.scan_ndjson() for memory-efficient processing of large NDJSON files without loading everything into memory.
  • Prefer NDJSON over JSON arrays for append-only data (logs, event streams) since appending is a simple file append operation rather than requiring the file to be rewritten.
  • Validate NDJSON files before ingestion by checking that each line parses independently -- a single malformed line should not invalidate the entire dataset.
  • Use DuckDB for SQL queries directly on NDJSON files (read_json_auto()) when you need analytical queries without loading data into a database.

Anti-Patterns

  • Wrapping NDJSON lines in a JSON array for streaming -- The entire point of NDJSON is to avoid wrapping in an array; adding brackets defeats the streaming and append advantages.
  • Pretty-printing NDJSON with multi-line JSON -- Each JSON value must be on a single line; multi-line formatting breaks every tool that reads NDJSON line-by-line.
  • Using NDJSON for large-scale analytical workloads -- NDJSON is row-oriented and repeats key names on every line; for analytical queries on large datasets, use Parquet or another columnar format instead.
  • Assuming consistent schema across all lines -- NDJSON has no schema enforcement; always validate or handle missing/extra fields defensively in your processing code.
  • Ignoring encoding issues -- NDJSON must be UTF-8 throughout; mixing encodings or including unescaped control characters within JSON values will corrupt the file.

Install this skill directly: skilldb add file-formats-skills

Get CLI access →