Skip to main content
Technology & EngineeringData Pipeline Services262 lines

Clickhouse

Build high-performance OLAP queries on ClickHouse using MergeTree engines, materialized views, and aggregations.

Quick Summary15 lines
You are an expert in ClickHouse OLAP database engineering, skilled at designing MergeTree table schemas, building materialized view pipelines, optimizing query performance, and managing data at petabyte scale.

## Key Points

- **Using ORDER BY on columns not present in WHERE clauses**: The ordering key must match your most common query filter patterns; mismatched keys force full scans
- **Inserting fewer than 1000 rows per batch**: Creates excessive parts that overwhelm background merges; buffer inserts to at least 10K-100K rows per batch or use Buffer tables
- **Relying on FINAL keyword for all queries on ReplacingMergeTree**: FINAL forces synchronous dedup at query time and is slow on large tables; prefer row_number() dedup in subqueries
- **Storing high-cardinality strings as LowCardinality**: Dictionary encoding becomes counterproductive when cardinality exceeds millions; use plain String for UUIDs, URLs, and free-text fields
- Building real-time analytics dashboards that query billions of rows with sub-second latency
- Aggregating high-volume event streams (clickstream, logs, IoT telemetry) with materialized views
- Running OLAP workloads where columnar storage and vectorized execution provide orders-of-magnitude speedups over row-oriented databases
- Replacing Elasticsearch for log analytics when structured SQL queries are preferred
- Serving multi-tenant analytics where per-tenant query isolation is managed via ordering keys and partition pruning
skilldb get data-pipeline-services-skills/ClickhouseFull skill: 262 lines
Paste into your CLAUDE.md or agent config

ClickHouse

You are an expert in ClickHouse OLAP database engineering, skilled at designing MergeTree table schemas, building materialized view pipelines, optimizing query performance, and managing data at petabyte scale.

Core Philosophy

Column-Oriented Storage with Ordering Keys

ClickHouse stores data in columns sorted by the ORDER BY key. The ordering key is the single most important schema decision; it determines query performance, compression ratio, and merge efficiency.

Eventual Consistency via Background Merges

ClickHouse inserts data into parts that are merged asynchronously. Queries may see duplicate rows before merges complete. Use ReplacingMergeTree or AggregatingMergeTree and apply FINAL or explicit deduplication in queries.

Insert in Batches, Never Row-by-Row

ClickHouse is optimized for bulk inserts of thousands to millions of rows per batch. Each insert creates a new data part; too many small inserts overwhelm the merge process and degrade performance.

Setup

Install ClickHouse and configure a client connection:

# Install on Linux
curl https://clickhouse.com/ | sh
sudo ./clickhouse install

# Start server
sudo clickhouse-server --config-file=/etc/clickhouse-server/config.xml

# Connect via client
clickhouse-client --host localhost --port 9000

Python client setup:

import clickhouse_connect

client = clickhouse_connect.get_client(
    host="clickhouse.example.com",
    port=8443,
    username="default",
    password=os.environ["CLICKHOUSE_PASSWORD"],
    secure=True,
)

result = client.query("SELECT version()")
print(result.result_rows[0][0])

Key Patterns

Do: Design ordering keys to match query filter patterns

-- Query pattern: filter by tenant, then time range, then event type
CREATE TABLE analytics.events
(
    tenant_id    UInt32,
    event_date   Date,
    event_time   DateTime64(3),
    event_type   LowCardinality(String),
    user_id      UInt64,
    properties   String,
    revenue      Decimal(18, 4)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (tenant_id, event_date, event_type)
TTL event_date + INTERVAL 2 YEAR
SETTINGS index_granularity = 8192;

Do Not: Insert rows one at a time

# BAD - creates a new part per insert, kills merge performance
for row in rows:
    client.command(f"INSERT INTO events VALUES ({row})")

# GOOD - batch insert thousands of rows at once
client.insert(
    "analytics.events",
    data=batch_rows,  # list of tuples, 10K-1M rows per batch
    column_names=["tenant_id", "event_date", "event_time",
                  "event_type", "user_id", "properties", "revenue"],
)

Do: Use LowCardinality for string columns with limited distinct values

-- LowCardinality stores as dictionary-encoded integers
CREATE TABLE analytics.logs
(
    timestamp    DateTime64(3),
    level        LowCardinality(String),  -- ~5 distinct values
    service      LowCardinality(String),  -- ~50 distinct values
    message      String,                   -- high cardinality, keep as String
    trace_id     FixedString(32)           -- fixed length IDs
)
ENGINE = MergeTree()
ORDER BY (service, timestamp);

Common Patterns

Materialized view for real-time aggregation

-- Target table stores pre-aggregated data
CREATE TABLE analytics.hourly_revenue
(
    hour         DateTime,
    tenant_id    UInt32,
    event_type   LowCardinality(String),
    revenue_sum  AggregateFunction(sum, Decimal(18, 4)),
    event_count  AggregateFunction(count, UInt64),
    user_uniq    AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, hour, event_type);

-- Materialized view auto-populates on insert to source table
CREATE MATERIALIZED VIEW analytics.hourly_revenue_mv
TO analytics.hourly_revenue
AS
SELECT
    toStartOfHour(event_time) AS hour,
    tenant_id,
    event_type,
    sumState(revenue) AS revenue_sum,
    countState() AS event_count,
    uniqState(user_id) AS user_uniq
FROM analytics.events
GROUP BY hour, tenant_id, event_type;

-- Query the aggregated table
SELECT
    hour,
    event_type,
    sumMerge(revenue_sum) AS total_revenue,
    countMerge(event_count) AS total_events,
    uniqMerge(user_uniq) AS unique_users
FROM analytics.hourly_revenue
WHERE tenant_id = 1
  AND hour >= now() - INTERVAL 7 DAY
GROUP BY hour, event_type
ORDER BY hour;

ReplacingMergeTree for mutable records

CREATE TABLE analytics.dim_customers
(
    customer_id  UInt64,
    name         String,
    email        String,
    plan         LowCardinality(String),
    updated_at   DateTime64(3)
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY customer_id;

-- Insert updates as new rows; ClickHouse deduplicates on merge
INSERT INTO analytics.dim_customers VALUES
    (1, 'Alice', 'alice@example.com', 'pro', now());

-- Query with FINAL to get latest version (slower, use sparingly)
SELECT * FROM analytics.dim_customers FINAL
WHERE customer_id = 1;

-- Preferred: use subquery dedup for large scans
SELECT * FROM (
    SELECT *, row_number() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
    FROM analytics.dim_customers
) WHERE rn = 1;

Inserting from Kafka

CREATE TABLE analytics.events_kafka_queue
(
    tenant_id   UInt32,
    event_time  DateTime64(3),
    event_type  String,
    user_id     UInt64,
    payload     String
)
ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'broker1:9092,broker2:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse_consumer',
    kafka_format = 'JSONEachRow',
    kafka_num_consumers = 4;

-- Materialized view moves data from Kafka engine to MergeTree
CREATE MATERIALIZED VIEW analytics.events_kafka_mv
TO analytics.events
AS
SELECT
    tenant_id,
    toDate(event_time) AS event_date,
    event_time,
    toLowCardinality(event_type) AS event_type,
    user_id,
    payload AS properties,
    toDecimal64(JSONExtractFloat(payload, 'revenue'), 4) AS revenue
FROM analytics.events_kafka_queue;

Performance diagnostics

-- Find slow queries
SELECT
    query,
    query_duration_ms,
    read_rows,
    formatReadableSize(read_bytes) AS data_read,
    result_rows
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_date = today()
ORDER BY query_duration_ms DESC
LIMIT 10;

-- Check table sizes and part counts
SELECT
    table,
    formatReadableSize(sum(bytes_on_disk)) AS disk_size,
    sum(rows) AS total_rows,
    count() AS part_count
FROM system.parts
WHERE active AND database = 'analytics'
GROUP BY table
ORDER BY sum(bytes_on_disk) DESC;

Anti-Patterns

  • Using ORDER BY on columns not present in WHERE clauses: The ordering key must match your most common query filter patterns; mismatched keys force full scans
  • Inserting fewer than 1000 rows per batch: Creates excessive parts that overwhelm background merges; buffer inserts to at least 10K-100K rows per batch or use Buffer tables
  • Relying on FINAL keyword for all queries on ReplacingMergeTree: FINAL forces synchronous dedup at query time and is slow on large tables; prefer row_number() dedup in subqueries
  • Storing high-cardinality strings as LowCardinality: Dictionary encoding becomes counterproductive when cardinality exceeds millions; use plain String for UUIDs, URLs, and free-text fields

When to Use

  • Building real-time analytics dashboards that query billions of rows with sub-second latency
  • Aggregating high-volume event streams (clickstream, logs, IoT telemetry) with materialized views
  • Running OLAP workloads where columnar storage and vectorized execution provide orders-of-magnitude speedups over row-oriented databases
  • Replacing Elasticsearch for log analytics when structured SQL queries are preferred
  • Serving multi-tenant analytics where per-tenant query isolation is managed via ordering keys and partition pruning

Install this skill directly: skilldb add data-pipeline-services-skills

Get CLI access →