Clickhouse
Build high-performance OLAP queries on ClickHouse using MergeTree engines, materialized views, and aggregations.
You are an expert in ClickHouse OLAP database engineering, skilled at designing MergeTree table schemas, building materialized view pipelines, optimizing query performance, and managing data at petabyte scale. ## Key Points - **Using ORDER BY on columns not present in WHERE clauses**: The ordering key must match your most common query filter patterns; mismatched keys force full scans - **Inserting fewer than 1000 rows per batch**: Creates excessive parts that overwhelm background merges; buffer inserts to at least 10K-100K rows per batch or use Buffer tables - **Relying on FINAL keyword for all queries on ReplacingMergeTree**: FINAL forces synchronous dedup at query time and is slow on large tables; prefer row_number() dedup in subqueries - **Storing high-cardinality strings as LowCardinality**: Dictionary encoding becomes counterproductive when cardinality exceeds millions; use plain String for UUIDs, URLs, and free-text fields - Building real-time analytics dashboards that query billions of rows with sub-second latency - Aggregating high-volume event streams (clickstream, logs, IoT telemetry) with materialized views - Running OLAP workloads where columnar storage and vectorized execution provide orders-of-magnitude speedups over row-oriented databases - Replacing Elasticsearch for log analytics when structured SQL queries are preferred - Serving multi-tenant analytics where per-tenant query isolation is managed via ordering keys and partition pruning
skilldb get data-pipeline-services-skills/ClickhouseFull skill: 262 linesClickHouse
You are an expert in ClickHouse OLAP database engineering, skilled at designing MergeTree table schemas, building materialized view pipelines, optimizing query performance, and managing data at petabyte scale.
Core Philosophy
Column-Oriented Storage with Ordering Keys
ClickHouse stores data in columns sorted by the ORDER BY key. The ordering key is the single most important schema decision; it determines query performance, compression ratio, and merge efficiency.
Eventual Consistency via Background Merges
ClickHouse inserts data into parts that are merged asynchronously. Queries may see duplicate rows before merges complete. Use ReplacingMergeTree or AggregatingMergeTree and apply FINAL or explicit deduplication in queries.
Insert in Batches, Never Row-by-Row
ClickHouse is optimized for bulk inserts of thousands to millions of rows per batch. Each insert creates a new data part; too many small inserts overwhelm the merge process and degrade performance.
Setup
Install ClickHouse and configure a client connection:
# Install on Linux
curl https://clickhouse.com/ | sh
sudo ./clickhouse install
# Start server
sudo clickhouse-server --config-file=/etc/clickhouse-server/config.xml
# Connect via client
clickhouse-client --host localhost --port 9000
Python client setup:
import clickhouse_connect
client = clickhouse_connect.get_client(
host="clickhouse.example.com",
port=8443,
username="default",
password=os.environ["CLICKHOUSE_PASSWORD"],
secure=True,
)
result = client.query("SELECT version()")
print(result.result_rows[0][0])
Key Patterns
Do: Design ordering keys to match query filter patterns
-- Query pattern: filter by tenant, then time range, then event type
CREATE TABLE analytics.events
(
tenant_id UInt32,
event_date Date,
event_time DateTime64(3),
event_type LowCardinality(String),
user_id UInt64,
properties String,
revenue Decimal(18, 4)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (tenant_id, event_date, event_type)
TTL event_date + INTERVAL 2 YEAR
SETTINGS index_granularity = 8192;
Do Not: Insert rows one at a time
# BAD - creates a new part per insert, kills merge performance
for row in rows:
client.command(f"INSERT INTO events VALUES ({row})")
# GOOD - batch insert thousands of rows at once
client.insert(
"analytics.events",
data=batch_rows, # list of tuples, 10K-1M rows per batch
column_names=["tenant_id", "event_date", "event_time",
"event_type", "user_id", "properties", "revenue"],
)
Do: Use LowCardinality for string columns with limited distinct values
-- LowCardinality stores as dictionary-encoded integers
CREATE TABLE analytics.logs
(
timestamp DateTime64(3),
level LowCardinality(String), -- ~5 distinct values
service LowCardinality(String), -- ~50 distinct values
message String, -- high cardinality, keep as String
trace_id FixedString(32) -- fixed length IDs
)
ENGINE = MergeTree()
ORDER BY (service, timestamp);
Common Patterns
Materialized view for real-time aggregation
-- Target table stores pre-aggregated data
CREATE TABLE analytics.hourly_revenue
(
hour DateTime,
tenant_id UInt32,
event_type LowCardinality(String),
revenue_sum AggregateFunction(sum, Decimal(18, 4)),
event_count AggregateFunction(count, UInt64),
user_uniq AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, hour, event_type);
-- Materialized view auto-populates on insert to source table
CREATE MATERIALIZED VIEW analytics.hourly_revenue_mv
TO analytics.hourly_revenue
AS
SELECT
toStartOfHour(event_time) AS hour,
tenant_id,
event_type,
sumState(revenue) AS revenue_sum,
countState() AS event_count,
uniqState(user_id) AS user_uniq
FROM analytics.events
GROUP BY hour, tenant_id, event_type;
-- Query the aggregated table
SELECT
hour,
event_type,
sumMerge(revenue_sum) AS total_revenue,
countMerge(event_count) AS total_events,
uniqMerge(user_uniq) AS unique_users
FROM analytics.hourly_revenue
WHERE tenant_id = 1
AND hour >= now() - INTERVAL 7 DAY
GROUP BY hour, event_type
ORDER BY hour;
ReplacingMergeTree for mutable records
CREATE TABLE analytics.dim_customers
(
customer_id UInt64,
name String,
email String,
plan LowCardinality(String),
updated_at DateTime64(3)
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY customer_id;
-- Insert updates as new rows; ClickHouse deduplicates on merge
INSERT INTO analytics.dim_customers VALUES
(1, 'Alice', 'alice@example.com', 'pro', now());
-- Query with FINAL to get latest version (slower, use sparingly)
SELECT * FROM analytics.dim_customers FINAL
WHERE customer_id = 1;
-- Preferred: use subquery dedup for large scans
SELECT * FROM (
SELECT *, row_number() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
FROM analytics.dim_customers
) WHERE rn = 1;
Inserting from Kafka
CREATE TABLE analytics.events_kafka_queue
(
tenant_id UInt32,
event_time DateTime64(3),
event_type String,
user_id UInt64,
payload String
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'broker1:9092,broker2:9092',
kafka_topic_list = 'events',
kafka_group_name = 'clickhouse_consumer',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 4;
-- Materialized view moves data from Kafka engine to MergeTree
CREATE MATERIALIZED VIEW analytics.events_kafka_mv
TO analytics.events
AS
SELECT
tenant_id,
toDate(event_time) AS event_date,
event_time,
toLowCardinality(event_type) AS event_type,
user_id,
payload AS properties,
toDecimal64(JSONExtractFloat(payload, 'revenue'), 4) AS revenue
FROM analytics.events_kafka_queue;
Performance diagnostics
-- Find slow queries
SELECT
query,
query_duration_ms,
read_rows,
formatReadableSize(read_bytes) AS data_read,
result_rows
FROM system.query_log
WHERE type = 'QueryFinish'
AND event_date = today()
ORDER BY query_duration_ms DESC
LIMIT 10;
-- Check table sizes and part counts
SELECT
table,
formatReadableSize(sum(bytes_on_disk)) AS disk_size,
sum(rows) AS total_rows,
count() AS part_count
FROM system.parts
WHERE active AND database = 'analytics'
GROUP BY table
ORDER BY sum(bytes_on_disk) DESC;
Anti-Patterns
- Using ORDER BY on columns not present in WHERE clauses: The ordering key must match your most common query filter patterns; mismatched keys force full scans
- Inserting fewer than 1000 rows per batch: Creates excessive parts that overwhelm background merges; buffer inserts to at least 10K-100K rows per batch or use Buffer tables
- Relying on FINAL keyword for all queries on ReplacingMergeTree: FINAL forces synchronous dedup at query time and is slow on large tables; prefer row_number() dedup in subqueries
- Storing high-cardinality strings as LowCardinality: Dictionary encoding becomes counterproductive when cardinality exceeds millions; use plain String for UUIDs, URLs, and free-text fields
When to Use
- Building real-time analytics dashboards that query billions of rows with sub-second latency
- Aggregating high-volume event streams (clickstream, logs, IoT telemetry) with materialized views
- Running OLAP workloads where columnar storage and vectorized execution provide orders-of-magnitude speedups over row-oriented databases
- Replacing Elasticsearch for log analytics when structured SQL queries are preferred
- Serving multi-tenant analytics where per-tenant query isolation is managed via ordering keys and partition pruning
Install this skill directly: skilldb add data-pipeline-services-skills
Related Skills
Airbyte
Configure Airbyte open-source data integration with custom connectors, destinations, and CDC replication.
Apache Airflow
Orchestrate data pipelines using Apache Airflow DAGs, operators, sensors, and XCom.
Apache Spark
Process large-scale data with Apache Spark using PySpark DataFrames, Spark SQL, and structured streaming.
Bigquery
Build analytical pipelines on Google BigQuery using SQL, streaming inserts, and federated queries.
DBT
Build and test data transformation pipelines using dbt models, macros, and incremental strategies.
Fivetran
Configure and manage Fivetran connectors for automated data ingestion into warehouses.