ClickHouse Query Optimization Skill

ClickHouse Query Optimization

Load when writing, debugging, or optimizing ClickHouse queries.

Goals

10-100x faster query execution
90%+ reduction in memory usage
Eliminate full table scans (read <1% of data)
JOINs that complete in milliseconds instead of minutes

Reference Documentation

Search terms: query optimization, PREWHERE, JOIN, dictionary, EXPLAIN, performance, slow query, memory limit, timeout, full table scan, dictGet, query cache, DELETE, UPDATE, mutation, aggregation, GROUP BY, COUNT DISTINCT, uniq, query_log

Critical Rules

[CRITICAL]

Filter on ORDER BY prefix columns. Queries that skip the prefix scan the entire table.
*Avoid SELECT . Only select columns you need.
Always check EXPLAIN indexes=1. Verify index usage and row estimates before optimizing.

[HIGH]

Use PREWHERE for selective filters. Reduces I/O before main WHERE clause.
Prefer dictionaries over JOINs. For key-value lookups on tables <10M rows.
Use IN subqueries for existence checks. Faster than JOIN when you don't need columns from the other table.

[MEDIUM]

Limit JOINs when possible. Each JOIN increases complexity exponentially.

Querying basics: ORDER BY, indexes and PREWHERE

ORDER BY

The ORDER BY clause in table definition determines query performance. ClickHouse stores data sorted by these columns and uses a sparse primary index to skip irrelevant data.

How it works:

Data is divided into granules (default 8192 rows), can vary
Primary index stores first value of each granule for ORDER BY columns
Queries filtering on ORDER BY prefix in exact order or a prefix of it skip entire granules

-- Table: ORDER BY (tenant_id, event_date, user_id)

-- FAST: Filters on ORDER BY prefix (left to right)
WHERE tenant_id = 1                              -- Uses index
WHERE tenant_id = 1 AND event_date = '2024-01-15' -- Uses index
WHERE tenant_id = 1 AND event_date >= '2024-01-01' AND user_id = 123 -- Uses index

-- SLOW: Skips columns or filters only on later columns
WHERE event_date = '2024-01-15'                  -- Skips tenant_id, full scan
WHERE user_id = 123                              -- Skips tenant_id and event_date, full scan

Key insight: Always filter on the leftmost ORDER BY columns first for best performance.

Indexes

Skip indexes (data skipping indexes) help when filtering on columns NOT in ORDER BY:

-- Table has ORDER BY (tenant_id, event_date, user_id)
-- But you need to query by session_id

-- Add bloom filter index
ALTER TABLE events
ADD INDEX idx_session session_id TYPE bloom_filter(0.01) GRANULARITY 4;

-- Now this query can skip granules instead of full scan
SELECT * FROM events WHERE session_id = 'abc123';

Check if index is used:

EXPLAIN indexes = 1
SELECT * FROM events WHERE session_id = 'abc123';

-- Look for "Skip" section showing granules skipped

Common index types:

bloom_filter - High-cardinality columns (UUIDs, session IDs)
minmax - Range queries on numeric/date columns
set(N) - Low-cardinality columns (<N unique values)

PREWHERE vs WHERE

| Clause | When columns are read | Use for | | -------- | ----------------------------------- | ---------------------------------- | | PREWHERE | Before other columns | Small columns, selective filters | | WHERE | After all selected columns are read | Large columns, complex expressions |

-- Small status column in PREWHERE, large error_message in WHERE
SELECT * FROM events
PREWHERE status = 'error'
WHERE error_message LIKE '%timeout%';

## Column Selection

### Always Specify Columns

```sql
-- BAD: Reads all columns from disk
SELECT * FROM events WHERE user_id = 123;

-- GOOD: Reads only needed columns
SELECT event_time, event_type, properties
FROM events
WHERE user_id = 123;

Full Example

-- Table definition
CREATE TABLE events (
    tenant_id UInt32,
    event_date Date,
    user_id UInt64,
    status LowCardinality(String),    -- 4 bytes per row
    error_message String,              -- 500 bytes avg per row
    payload String                     -- 2KB avg per row
) ENGINE = MergeTree()
ORDER BY (tenant_id, event_date, user_id);

-- Table has 100M rows across 12,000 granules (8192 rows each)

-- Query
SELECT user_id, error_message, payload
FROM events
PREWHERE tenant_id = 5 AND event_date = '2024-01-15' AND status = 'error'
WHERE error_message LIKE '%timeout%';

Step 1: Primary index (granule filtering)

Primary index stores first row of each granule:
  Granule 0:    tenant_id=1,  event_date=2024-01-01
  Granule 1:    tenant_id=1,  event_date=2024-01-01
  ...
  Granule 846:  tenant_id=5,  event_date=2024-01-14
  Granule 847:  tenant_id=5,  event_date=2024-01-15  ← Match
  Granule 848:  tenant_id=5,  event_date=2024-01-15  ← Match
  Granule 849:  tenant_id=5,  event_date=2024-01-15  ← Match
  Granule 850:  tenant_id=5,  event_date=2024-01-16
  ...

Result: Read 3 granules out of 12,000 (skip 99.97% of data)

Note: Skip indexes work similarly—they allow ClickHouse to skip granules for columns not in ORDER BY. Both mechanisms achieve the same goal: read less data by eliminating irrelevant granules before touching the actual rows.

Step 2: PREWHERE (column filtering within granules)

For the 3 matching granules (24,576 rows):

  1. Read tenant_id column      →  98 KB   (4 bytes × 24,576)
  2. Read event_date column     →  98 KB   (4 bytes × 24,576)
  3. Read status column         →  98 KB   (4 bytes × 24,576)
  4. Filter: 1,247 rows match PREWHERE conditions

  5. Read user_id column        →  10 KB   (8 bytes × 1,247 rows only)
  6. Read error_message column  →  623 KB  (500 bytes × 1,247 rows only)
  7. Read payload column        →  2.4 MB  (2KB × 1,247 rows only)

Total read: ~3.3 MB
Without PREWHERE: ~62 MB (read all columns for all 24,576 rows)

Step 3: WHERE (final filtering)

Apply: error_message LIKE '%timeout%'
Result: 43 rows returned

Impact

Reading fewer columns:

Reduces disk I/O
Improves compression efficiency
Lowers memory usage

JOIN Optimization

Decision Tree

Need data from another table?
│
├─ Only filtering rows, don't need columns from other table?
│  Example: "Get orders, but only for premium users"
│  └─ Use IN subquery
│
├─ Need columns from other table?
│  Example: "Get orders WITH user name and email"
│  │
│  ├─ Small lookup table (<10M rows), staleness OK?
│  │  └─ Use Dictionary
│  │
│  └─ Need real-time data or large table?
│     └─ Use JOIN with filtered subqueries
│
└─ Same join needed on every query?
   └─ Denormalize at insert time

Full Example: Dictionary vs JOIN

-- Tables
-- events: 500M rows, ORDER BY (tenant_id, event_date, user_id)
-- users: 2M rows (user_id, name, tier, country)

-- Task: Get user name and tier for today's events

Slow: JOIN

SELECT e.event_time, e.event_type, u.name, u.tier
FROM events e
JOIN users u ON e.user_id = u.user_id
WHERE e.tenant_id = 1 AND e.event_date = today();

-- Execution:
-- 1. Scan events: 850K rows
-- 2. Build hash table from users: 2M rows loaded into memory
-- 3. Probe hash table for each event row
-- Time: 3.2s, Memory: 890MB

Fast: Dictionary

-- Create once
CREATE DICTIONARY user_dict (
    user_id UInt64,
    name String,
    tier String
)
PRIMARY KEY user_id
SOURCE(CLICKHOUSE(TABLE 'users'))
LAYOUT(HASHED())
LIFETIME(300);

-- Query
SELECT
    event_time,
    event_type,
    dictGet('user_dict', 'name', user_id) AS name,
    dictGet('user_dict', 'tier', user_id) AS tier
FROM events
WHERE tenant_id = 1 AND event_date = today();

-- Execution:
-- 1. Scan events: 850K rows
-- 2. Dictionary lookup: O(1) hash lookup per row, already in memory
-- Time: 0.4s, Memory: 120MB

Tradeoff: Dictionaries are eventually consistent. LIFETIME(300) means data can be up to 5 minutes stale. For real-time accuracy, use JOIN. Always ask the user if this delay is acceptable before implementing a dictioary.

-- Check when dictionary was last updated
SELECT name, last_successful_update_time, loading_duration
FROM system.dictionaries;

-- Force immediate reload
SYSTEM RELOAD DICTIONARY user_dict;

IN Subquery for Filtering

When you only need to filter (not fetch columns):

-- Get orders from premium users
SELECT *
FROM orders
WHERE user_id IN (
    SELECT user_id FROM users WHERE tier = 'premium'
);

-- NOT this (builds unnecessary hash table)
SELECT o.*
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE u.tier = 'premium';

When You Must JOIN

Filter both sides first, smaller table on the right:

SELECT e.event_time, u.name, u.email
FROM (
    SELECT event_time, user_id
    FROM events
    WHERE tenant_id = 1 AND event_date = today()
) e
JOIN (
    SELECT user_id, name, email
    FROM users
    WHERE active = 1
) u ON e.user_id = u.user_id;

## Query Analysis

### EXPLAIN Queries

```sql
-- Basic execution plan
EXPLAIN SELECT ...;

-- Show index usage (most useful)
EXPLAIN indexes = 1 SELECT ...;

-- Show execution pipeline
EXPLAIN PIPELINE SELECT ...;

-- Show query AST
EXPLAIN AST SELECT ...;

Reading EXPLAIN indexes Output

EXPLAIN indexes = 1
SELECT count()
FROM events
WHERE tenant_id = 1 AND event_date = '2024-01-15';

-- Output interpretation:
-- ReadFromMergeTree
--   Indexes:
--     PrimaryKey
--       Keys: tenant_id, event_date
--       Condition: (tenant_id = 1) AND (event_date = '2024-01-15')
--       Parts: 5/120           <- 5 parts selected out of 120
--       Granules: 42/15000     <- 42 granules out of 15000 (good!)

Find Slow Queries

-- Recent slow queries
SELECT
    query,
    query_duration_ms,
    read_rows,
    read_bytes,
    memory_usage
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query_duration_ms > 1000
  AND event_date >= today() - 1
ORDER BY query_duration_ms DESC
LIMIT 20;

-- Queries with full table scans
SELECT
    query,
    read_rows,
    total_rows_approx
FROM system.query_log
WHERE type = 'QueryFinish'
  AND read_rows > 0.9 * total_rows_approx  -- Reading >90% of table
  AND event_date >= today() - 1
ORDER BY read_rows DESC
LIMIT 10;

Profile Query Execution

-- Enable query profiling
SET log_queries = 1;
SET log_query_threads = 1;

-- Run your query
SELECT ...;

-- Check thread-level metrics
SELECT
    thread_name,
    sum(read_rows) AS rows,
    sum(read_bytes) AS bytes,
    max(peak_memory_usage) AS peak_memory
FROM system.query_thread_log
WHERE query_id = '<your-query-id>'
GROUP BY thread_name;

Query Cache (22.8+)

Cache query results for repeated identical queries:

-- Enable query cache for a session
SET use_query_cache = 1;

-- Run query (first execution populates cache)
SELECT count(*), avg(amount)
FROM orders
WHERE order_date >= '2024-01-01';

-- Subsequent identical queries return cached result instantly
-- Cache invalidated automatically when underlying data changes

Configuration:

-- Set cache TTL (default: 60 seconds)
SET query_cache_ttl = 300;  -- 5 minutes

-- Minimum rows to cache (skip small results)
SET query_cache_min_query_runs = 2;  -- Cache after 2nd run

-- Check cache status
SELECT * FROM system.query_cache;

Best for:

Dashboard queries hitting same data repeatedly
Reports regenerated frequently
Multi-tenant queries with shared dimensions
Queries that do not need real-time data

Update rows without full table rewrite:

-- Update specific columns
ALTER TABLE events
UPDATE status = 'processed'
WHERE event_id = 12345;

-- Can update multiple columns
ALTER TABLE events
UPDATE
    status = 'cancelled',
    cancelled_at = now()
WHERE order_id IN (SELECT order_id FROM cancelled_orders);

Limitations:

Cannot update columns in ORDER BY or PARTITION BY
Uses mutations (async, check progress in system.mutations)

-- Check mutation progress
SELECT table, command, is_done, latest_fail_reason
FROM system.mutations
WHERE database = 'default'
ORDER BY create_time DESC;

EXPLAIN ESTIMATE

Get row/byte estimates without running query:

-- Estimate data to be read
EXPLAIN ESTIMATE
SELECT * FROM events
WHERE tenant_id = 1 AND event_date = '2024-01-15';

-- Output shows:
--   estimated_rows
--   estimated_bytes
--   number of parts/granules to read

Use for:

Validating filter effectiveness before running
Cost comparison between query approaches
Capacity planning

Troubleshooting

Always ask for user confirmation before applying schema changes (adding indexes, creating MVs, altering tables).

Memory Limit Exceeded

Problem: Memory limit exceeded, DB::Exception: Memory limit exceeded, query killed by OOM

Diagnose:

EXPLAIN indexes = 1 SELECT ...;  -- Check if filtering is effective

Solutions:

| Cause | Fix | | ------------------------- | ------------------------------------------------------------- | | High-cardinality GROUP BY | Use approximate: uniqHLL12() instead of COUNT(DISTINCT) | | Large aggregation | Pre-aggregate with AggregatingMergeTree + Materialized View | | Full table scan | Add filters on ORDER BY columns (see below) | | Too many columns | Select only needed columns |

-- BAD: Exact distinct on high cardinality
SELECT COUNT(DISTINCT user_id) FROM events;

-- GOOD: Approximate (2% error, 10x less memory)
SELECT uniqHLL12(user_id) FROM events;

-- BETTER: Pre-aggregate with MV (see clickhouse-materialized-views skill)

Full Table Scan Despite Filters

Problem: Query reads all rows, slow despite WHERE clause, Granules: N/N in EXPLAIN

Diagnose:

EXPLAIN indexes = 1
SELECT * FROM events WHERE user_id = 123;
-- If "Granules: N/N" → index not used

Solutions:

| Cause | Fix | | ----------------------------- | ------------------------------------------------ | | Filter not on ORDER BY prefix | Add ORDER BY columns to WHERE, or add skip index | | Function on column | Store computed column, filter on that | | OR across ORDER BY columns | Rewrite as UNION ALL | | Type mismatch | Match filter type to column type |

-- BAD: Function prevents index use
WHERE toDate(event_time) = '2024-01-15'

-- GOOD: Store date column, filter on it
WHERE event_date = '2024-01-15'

-- BAD: Skips ORDER BY prefix (tenant_id, event_date, user_id)
WHERE user_id = 123

-- GOOD: Include prefix columns
WHERE tenant_id = 1 AND event_date = today() AND user_id = 123

-- ALT: Add skip index if you can't change query
ALTER TABLE events ADD INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4;

Slow JOINs

Problem: JOINs take minutes, high memory during JOIN, query hangs on JOIN step

Solutions (in order of preference):

| Approach | When to Use | | ------------------ | --------------------------------------------------- | | Dictionary | Lookup table <10M rows, staleness OK | | IN subquery | Only filtering, don't need columns from other table | | Filter before JOIN | Must JOIN, but can reduce rows first | | Denormalize | Same JOIN on every query |

-- BAD: JOIN builds hash table of entire users table
SELECT e.*, u.name FROM events e JOIN users u ON e.user_id = u.user_id;

-- GOOD: Dictionary lookup (see "Dictionary vs JOIN" section)
SELECT *, dictGet('user_dict', 'name', user_id) AS name FROM events;

-- GOOD: IN subquery when only filtering
SELECT * FROM events WHERE user_id IN (SELECT user_id FROM users WHERE tier = 'premium');

-- GOOD: Filter both sides before JOIN
SELECT e.*, u.name
FROM (SELECT * FROM events WHERE tenant_id = 1 AND event_date = today()) e
JOIN (SELECT user_id, name FROM users WHERE active = 1) u ON e.user_id = u.user_id;

High Disk I/O

Problem: Queries slow with high read_bytes, disk wait times high, reading GBs for simple queries

Solutions:

| Cause | Fix | | ------------------- | ------------------------------------------------ | | Reading all columns | Select only needed columns | | No early filtering | Use PREWHERE on small, selective columns | | Missing skip index | Add bloom_filter for high-cardinality lookups | | Poor compression | Use appropriate codecs (see schema-design skill) |

-- BAD: Reads all columns, filters late
SELECT * FROM events WHERE status = 'error';

-- GOOD: PREWHERE filters before reading large columns
SELECT event_time, error_message
FROM events
PREWHERE status = 'error'
WHERE error_message LIKE '%timeout%';

Agent Skills: ClickHouse Query Optimization

Install this agent skill to your local

Skill Files