Agent Skills: ClickHouse Schema Design (Core Workflow A)

|

UncategorizedID: jeremylongshore/claude-code-plugins-plus-skills/clickhouse-core-workflow-a

Install this agent skill to your local

pnpm dlx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/HEAD/plugins/saas-packs/clickhouse-pack/skills/clickhouse-core-workflow-a

Skill Files

Browse the full folder contents for clickhouse-core-workflow-a.

Download Skill

Loading file tree…

plugins/saas-packs/clickhouse-pack/skills/clickhouse-core-workflow-a/SKILL.md

Skill Metadata

Name
clickhouse-core-workflow-a
Description
|

ClickHouse Schema Design (Core Workflow A)

Overview

Design ClickHouse tables with correct engine selection, ORDER BY keys, partitioning, and codec choices for analytical workloads.

Prerequisites

  • @clickhouse/client connected (see clickhouse-install-auth)
  • Understanding of your query patterns (what you filter and group on)

Instructions

Step 1: Choose the Right Engine

| Engine | Best For | Dedup? | Example | |--------|----------|--------|---------| | MergeTree | General analytics, append-only logs | No | Clickstream, IoT | | ReplacingMergeTree | Mutable rows (upserts) | Yes (on merge) | User profiles, state | | SummingMergeTree | Pre-aggregated counters | Sums numerics | Page view counts | | AggregatingMergeTree | Materialized view targets | Merges states | Dashboards | | CollapsingMergeTree | Stateful row updates | Collapses +-1 | Shopping carts |

ClickHouse Cloud uses SharedMergeTree — it is a drop-in replacement for MergeTree on Cloud. You do not need to change your DDL.

Step 2: Design the ORDER BY (Sort Key)

The ORDER BY clause is the single most important schema decision. It defines:

  • Primary index — sparse index over sort-key granules (8192 rows default)
  • Data layout on disk — rows sorted physically by these columns
  • Query speed — queries filtering on ORDER BY prefix columns hit fewer granules

Rules of thumb:

  1. Put low-cardinality filter columns first (event_type, status)
  2. Then high-cardinality columns you filter on (user_id, tenant_id)
  3. End with a time column if you use range filters (created_at)
  4. Do NOT put high-cardinality columns you never filter on in ORDER BY
-- Good: filter by tenant, then by time ranges
ORDER BY (tenant_id, event_type, created_at)

-- Bad: UUID first means every query scans the full index
ORDER BY (event_id, created_at)  -- event_id is random UUID

Step 3: Schema Examples

Event Analytics Table

CREATE TABLE analytics.events (
    event_id     UUID DEFAULT generateUUIDv4(),
    tenant_id    UInt32,
    event_type   LowCardinality(String),
    user_id      UInt64,
    session_id   String,
    properties   String CODEC(ZSTD(3)),  -- JSON blob, compress well
    url          String CODEC(ZSTD(1)),
    ip_address   IPv4,
    country      LowCardinality(FixedString(2)),
    created_at   DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
ORDER BY (tenant_id, event_type, toDate(created_at), user_id)
PARTITION BY toYYYYMM(created_at)
TTL created_at + INTERVAL 1 YEAR
SETTINGS index_granularity = 8192;

User Profile Table (Upserts)

CREATE TABLE analytics.users (
    user_id      UInt64,
    email        String,
    plan         LowCardinality(String),
    mrr_cents    UInt32,
    properties   String CODEC(ZSTD(3)),
    updated_at   DateTime DEFAULT now()
)
ENGINE = ReplacingMergeTree(updated_at)   -- keeps latest row per ORDER BY key
ORDER BY user_id;

-- Query with FINAL to get deduplicated results
SELECT * FROM analytics.users FINAL WHERE user_id = 42;

Daily Aggregation Table

CREATE TABLE analytics.daily_stats (
    date         Date,
    tenant_id    UInt32,
    event_type   LowCardinality(String),
    event_count  UInt64,
    unique_users AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, event_type, date);

Step 4: Partitioning Guidelines

| Partition Expression | Typical Use | Parts Per Partition | |---------------------|-------------|---------------------| | toYYYYMM(date) | Most common — monthly | Target 10-1000 | | toMonday(date) | Weekly rollups | More parts, finer drops | | toYYYYMMDD(date) | Daily TTL drops | Many parts — use carefully | | None | Small tables (<1M rows) | Fine |

Warning: Each partition creates separate parts on disk. Over-partitioning (e.g., by user_id) creates millions of tiny parts and kills performance.

Step 5: Codecs and Compression

-- Column-level compression codecs
column1  UInt64 CODEC(Delta, ZSTD(3)),      -- Time series / sequential IDs
column2  Float64 CODEC(Gorilla, ZSTD(1)),   -- Floating point (similar values)
column3  String CODEC(ZSTD(3)),              -- General text / JSON
column4  DateTime CODEC(DoubleDelta, ZSTD),  -- Timestamps (near-sequential)

Applying Schema via Node.js

import { createClient } from '@clickhouse/client';

const client = createClient({ url: process.env.CLICKHOUSE_HOST! });

async function applySchema() {
  await client.command({ query: 'CREATE DATABASE IF NOT EXISTS analytics' });

  await client.command({
    query: `
      CREATE TABLE IF NOT EXISTS analytics.events (
        event_id   UUID DEFAULT generateUUIDv4(),
        tenant_id  UInt32,
        event_type LowCardinality(String),
        user_id    UInt64,
        payload    String CODEC(ZSTD(3)),
        created_at DateTime DEFAULT now()
      )
      ENGINE = MergeTree()
      ORDER BY (tenant_id, event_type, created_at)
      PARTITION BY toYYYYMM(created_at)
    `,
  });

  console.log('Schema applied.');
}

Error Handling

| Error | Cause | Solution | |-------|-------|----------| | ORDER BY expression not in primary key | PRIMARY KEY != ORDER BY | Remove explicit PRIMARY KEY or align | | Too many parts (300+) | Over-partitioning | Use coarser partition expression | | Cannot convert String to UInt64 | Wrong data type | Match insert types to schema | | TTL expression type mismatch | TTL on non-date column | TTL must reference DateTime column |

Resources

Next Steps

For inserting and querying data, see clickhouse-core-workflow-b.