Clickhouse
Build with ClickHouse for real-time analytics and OLAP workloads. Use this skill
You are an analytics database specialist who integrates ClickHouse into projects.
ClickHouse is a columnar OLAP database designed for real-time analytics on large
datasets. It processes billions of rows per second for aggregation queries, making
it ideal for log analytics, event tracking, and business intelligence.
## Key Points
- **Always insert in batches.** ClickHouse is designed for bulk inserts of thousands
- **Choose the ORDER BY key carefully.** It determines how data is sorted on disk and
- **Use LowCardinality(String)** for columns with few distinct values (status, country,
- **Use materialized views** for dashboards. Let ClickHouse pre-aggregate on insert
- **Partition by month** (toYYYYMM) for most use cases. Avoid over-partitioning — too
- **Set TTL** to auto-expire old data instead of running manual DELETE.
- **Use parameterized queries** ({param:Type} syntax) to prevent SQL injection.
- **Single-row inserts.** Each INSERT creates a new data part on disk. Thousands of
- **Using ClickHouse for OLTP.** It has no transactions, no row-level updates, no
- **Forgetting FINAL with ReplacingMergeTree.** Without FINAL, you'll see duplicate
- **Over-partitioning.** Partitioning by day (or worse, hour) on low-volume tables
- **Ignoring column types.** Using String where UInt32 suffices wastes storage and
## Quick Example
```bash
docker run -d --name clickhouse \
-p 8123:8123 \
-p 9000:9000 \
clickhouse/clickhouse-server
```
```bash
npm install @clickhouse/client
```skilldb get database-services-skills/ClickhouseFull skill: 260 linesClickHouse Integration
You are an analytics database specialist who integrates ClickHouse into projects. ClickHouse is a columnar OLAP database designed for real-time analytics on large datasets. It processes billions of rows per second for aggregation queries, making it ideal for log analytics, event tracking, and business intelligence.
Core Philosophy
Columnar storage for fast aggregation
ClickHouse stores data by column, not by row. Aggregations like COUNT, SUM, AVG only read the columns they need, skipping everything else. This is why it can scan billions of rows in seconds.
Append-mostly, not OLTP
ClickHouse is optimized for bulk inserts and analytical reads. It is not a transactional database. Don't use it for single-row lookups, frequent updates, or delete-heavy workloads. Insert in batches, query in aggregates.
MergeTree is the engine
Almost everything in ClickHouse uses the MergeTree table engine family. MergeTree sorts data on disk by the ORDER BY key, enabling fast range scans. ReplacingMergeTree, AggregatingMergeTree, and SummingMergeTree add deduplication and pre-aggregation.
Setup
Install (Docker)
docker run -d --name clickhouse \
-p 8123:8123 \
-p 9000:9000 \
clickhouse/clickhouse-server
Install client (Node.js)
npm install @clickhouse/client
import { createClient } from '@clickhouse/client';
const client = createClient({
url: process.env.CLICKHOUSE_URL ?? 'http://localhost:8123',
username: 'default',
password: process.env.CLICKHOUSE_PASSWORD ?? '',
database: 'default',
});
Core Patterns
Create a table (MergeTree)
CREATE TABLE events (
event_time DateTime,
event_date Date DEFAULT toDate(event_time),
user_id UInt64,
event_type LowCardinality(String),
properties String, -- JSON stored as string
revenue Float64 DEFAULT 0
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_type, user_id, event_time)
TTL event_date + INTERVAL 1 YEAR;
Insert data (batched)
await client.insert({
table: 'events',
values: [
{
event_time: '2026-03-17 10:00:00',
user_id: 12345,
event_type: 'page_view',
properties: '{"page":"/pricing"}',
revenue: 0,
},
{
event_time: '2026-03-17 10:01:00',
user_id: 12345,
event_type: 'purchase',
properties: '{"product":"pro_plan"}',
revenue: 49.99,
},
],
format: 'JSONEachRow',
});
Analytical queries
-- Events per day by type (scans billions of rows in seconds)
SELECT
event_date,
event_type,
count() AS event_count,
uniq(user_id) AS unique_users
FROM events
WHERE event_date >= today() - 30
GROUP BY event_date, event_type
ORDER BY event_date DESC, event_count DESC;
-- Funnel analysis
SELECT
level,
count() AS users
FROM (
SELECT
user_id,
windowFunnel(86400)(
event_time,
event_type = 'page_view',
event_type = 'signup',
event_type = 'purchase'
) AS level
FROM events
WHERE event_date >= today() - 7
GROUP BY user_id
)
GROUP BY level
ORDER BY level;
Query from Node.js
async function getDailyStats(days: number) {
const result = await client.query({
query: `
SELECT
event_date,
count() AS total_events,
uniq(user_id) AS unique_users,
sum(revenue) AS total_revenue
FROM events
WHERE event_date >= today() - {days:UInt32}
GROUP BY event_date
ORDER BY event_date DESC
`,
query_params: { days },
format: 'JSONEachRow',
});
return result.json<{
event_date: string;
total_events: string;
unique_users: string;
total_revenue: string;
}>();
}
Materialized views (real-time aggregation)
-- Destination table for pre-aggregated data
CREATE TABLE daily_stats (
event_date Date,
event_type LowCardinality(String),
event_count UInt64,
unique_users AggregateFunction(uniq, UInt64),
total_revenue Float64
)
ENGINE = AggregatingMergeTree()
ORDER BY (event_date, event_type);
-- Materialized view that auto-populates on INSERT
CREATE MATERIALIZED VIEW daily_stats_mv
TO daily_stats AS
SELECT
event_date,
event_type,
count() AS event_count,
uniqState(user_id) AS unique_users,
sum(revenue) AS total_revenue
FROM events
GROUP BY event_date, event_type;
-- Query the pre-aggregated table (instant)
SELECT
event_date,
event_type,
event_count,
uniqMerge(unique_users) AS unique_users,
total_revenue
FROM daily_stats
GROUP BY event_date, event_type, event_count, total_revenue
ORDER BY event_date DESC;
ReplacingMergeTree (deduplication)
-- Last-write-wins by user_id, based on updated_at
CREATE TABLE user_profiles (
user_id UInt64,
name String,
email String,
updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;
-- Query with FINAL to get deduplicated results
SELECT * FROM user_profiles FINAL WHERE user_id = 12345;
Best Practices
- Always insert in batches. ClickHouse is designed for bulk inserts of thousands of rows at once, not single-row inserts. Buffer in your application and flush periodically (every 1-10 seconds or every 10k-100k rows).
- Choose the ORDER BY key carefully. It determines how data is sorted on disk and which queries are fast. Put your most common filter columns first.
- Use LowCardinality(String) for columns with few distinct values (status, country, event_type). It compresses dramatically and speeds up queries.
- Use materialized views for dashboards. Let ClickHouse pre-aggregate on insert rather than scanning raw data on every query.
- Partition by month (toYYYYMM) for most use cases. Avoid over-partitioning — too many partitions degrades performance.
- Set TTL to auto-expire old data instead of running manual DELETE.
- Use parameterized queries ({param:Type} syntax) to prevent SQL injection.
Common Pitfalls
- Single-row inserts. Each INSERT creates a new data part on disk. Thousands of individual inserts cause merge storms. Always batch.
- Using ClickHouse for OLTP. It has no transactions, no row-level updates, no efficient single-row lookups. Use PostgreSQL for OLTP, ClickHouse for analytics.
- Forgetting FINAL with ReplacingMergeTree. Without FINAL, you'll see duplicate rows until background merges consolidate them.
- Over-partitioning. Partitioning by day (or worse, hour) on low-volume tables creates thousands of tiny parts. Stick to monthly partitions unless you have very high volume.
- Ignoring column types. Using String where UInt32 suffices wastes storage and slows queries. ClickHouse is strongly typed — use the smallest appropriate type.
- Large JOINs. ClickHouse JOINs load the right table into memory. Keep the right side of a JOIN small, or use dictionaries for dimension lookups.
Anti-Patterns
Using the service without understanding its pricing model. Cloud services bill differently — per request, per GB, per seat. Deploying without modeling expected costs leads to surprise invoices.
Hardcoding configuration instead of using environment variables. API keys, endpoints, and feature flags change between environments. Hardcoded values break deployments and leak secrets.
Ignoring the service's rate limits and quotas. Every external API has throughput limits. Failing to implement backoff, queuing, or caching results in dropped requests under load.
Treating the service as always available. External services go down. Without circuit breakers, fallbacks, or graceful degradation, a third-party outage becomes your outage.
Coupling your architecture to a single provider's API. Building directly against provider-specific interfaces makes migration painful. Wrap external services in thin adapter layers.
Install this skill directly: skilldb add database-services-skills
Related Skills
Cassandra
Build with Apache Cassandra for high-availability distributed data. Use this skill
Cockroachdb
Build with CockroachDB as a distributed SQL database. Use this skill when the
Convex
Build with Convex as a reactive backend. Use this skill when the project needs
Drizzle
Use Drizzle ORM for type-safe SQL in TypeScript. Use this skill when the project
Dynamodb
Build with Amazon DynamoDB as a serverless NoSQL database. Use this skill when
Fauna
Build with Fauna as a distributed document-relational database. Use this skill