ClickHouse Schema Design (Core Workflow A)
Overview
Design ClickHouse tables with correct engine selection, ORDER BY keys, partitioning, and codec choices for analytical workloads.
Prerequisites
@clickhouse/clientconnected (seeclickhouse-install-auth)- Understanding of your query patterns (what you filter and group on)
Instructions
Step 1: Choose the Right Engine
| Engine | Best For | Dedup? | Example |
|--------|----------|--------|---------|
| MergeTree | General analytics, append-only logs | No | Clickstream, IoT |
| ReplacingMergeTree | Mutable rows (upserts) | Yes (on merge) | User profiles, state |
| SummingMergeTree | Pre-aggregated counters | Sums numerics | Page view counts |
| AggregatingMergeTree | Materialized view targets | Merges states | Dashboards |
| CollapsingMergeTree | Stateful row updates | Collapses +-1 | Shopping carts |
ClickHouse Cloud uses SharedMergeTree — it is a drop-in replacement for
MergeTree on Cloud. You do not need to change your DDL.
Step 2: Design the ORDER BY (Sort Key)
The ORDER BY clause is the single most important schema decision. It defines:
- Primary index — sparse index over sort-key granules (8192 rows default)
- Data layout on disk — rows sorted physically by these columns
- Query speed — queries filtering on ORDER BY prefix columns hit fewer granules
Rules of thumb:
- Put low-cardinality filter columns first (
event_type,status) - Then high-cardinality columns you filter on (
user_id,tenant_id) - End with a time column if you use range filters (
created_at) - Do NOT put high-cardinality columns you never filter on in ORDER BY
-- Good: filter by tenant, then by time ranges
ORDER BY (tenant_id, event_type, created_at)
-- Bad: UUID first means every query scans the full index
ORDER BY (event_id, created_at) -- event_id is random UUID
Step 3: Schema Examples
Event Analytics Table
CREATE TABLE analytics.events (
event_id UUID DEFAULT generateUUIDv4(),
tenant_id UInt32,
event_type LowCardinality(String),
user_id UInt64,
session_id String,
properties String CODEC(ZSTD(3)), -- JSON blob, compress well
url String CODEC(ZSTD(1)),
ip_address IPv4,
country LowCardinality(FixedString(2)),
created_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
ORDER BY (tenant_id, event_type, toDate(created_at), user_id)
PARTITION BY toYYYYMM(created_at)
TTL created_at + INTERVAL 1 YEAR
SETTINGS index_granularity = 8192;
User Profile Table (Upserts)
CREATE TABLE analytics.users (
user_id UInt64,
email String,
plan LowCardinality(String),
mrr_cents UInt32,
properties String CODEC(ZSTD(3)),
updated_at DateTime DEFAULT now()
)
ENGINE = ReplacingMergeTree(updated_at) -- keeps latest row per ORDER BY key
ORDER BY user_id;
-- Query with FINAL to get deduplicated results
SELECT * FROM analytics.users FINAL WHERE user_id = 42;
Daily Aggregation Table
CREATE TABLE analytics.daily_stats (
date Date,
tenant_id UInt32,
event_type LowCardinality(String),
event_count UInt64,
unique_users AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, event_type, date);
Step 4: Partitioning Guidelines
| Partition Expression | Typical Use | Parts Per Partition |
|---------------------|-------------|---------------------|
| toYYYYMM(date) | Most common — monthly | Target 10-1000 |
| toMonday(date) | Weekly rollups | More parts, finer drops |
| toYYYYMMDD(date) | Daily TTL drops | Many parts — use carefully |
| None | Small tables (<1M rows) | Fine |
Warning: Each partition creates separate parts on disk. Over-partitioning
(e.g., by user_id) creates millions of tiny parts and kills performance.
Step 5: Codecs and Compression
-- Column-level compression codecs
column1 UInt64 CODEC(Delta, ZSTD(3)), -- Time series / sequential IDs
column2 Float64 CODEC(Gorilla, ZSTD(1)), -- Floating point (similar values)
column3 String CODEC(ZSTD(3)), -- General text / JSON
column4 DateTime CODEC(DoubleDelta, ZSTD), -- Timestamps (near-sequential)
Applying Schema via Node.js
import { createClient } from '@clickhouse/client';
const client = createClient({ url: process.env.CLICKHOUSE_HOST! });
async function applySchema() {
await client.command({ query: 'CREATE DATABASE IF NOT EXISTS analytics' });
await client.command({
query: `
CREATE TABLE IF NOT EXISTS analytics.events (
event_id UUID DEFAULT generateUUIDv4(),
tenant_id UInt32,
event_type LowCardinality(String),
user_id UInt64,
payload String CODEC(ZSTD(3)),
created_at DateTime DEFAULT now()
)
ENGINE = MergeTree()
ORDER BY (tenant_id, event_type, created_at)
PARTITION BY toYYYYMM(created_at)
`,
});
console.log('Schema applied.');
}
Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| ORDER BY expression not in primary key | PRIMARY KEY != ORDER BY | Remove explicit PRIMARY KEY or align |
| Too many parts (300+) | Over-partitioning | Use coarser partition expression |
| Cannot convert String to UInt64 | Wrong data type | Match insert types to schema |
| TTL expression type mismatch | TTL on non-date column | TTL must reference DateTime column |
Resources
Next Steps
For inserting and querying data, see clickhouse-core-workflow-b.