ClickHouse Materialized Views Skill

ClickHouse Materialized Views

Load when creating Materialized Views for real-time aggregation, ETL pipelines, or query optimization.

Prerequisite: Understand MergeTree engine variants in clickhouse-schema-design skill.

Reference Documentation

Search terms: materialized view, MV, real-time aggregation, AggregateFunction, -State, -Merge, SummingMergeTree, pre-aggregation, incremental

Critical Rules

[CRITICAL]

MVs are triggers, not caches. They process INSERT data, not query results.
Use correct engine. AggregatingMergeTree for complex aggregates, SummingMergeTree for simple counters.
Query with -Merge functions or argMax. Aggregation completes at query time, not insert time.
Chained MVs see the pre-aggregated block, NOT the merged table state. A downstream MV (A→B→C) receives the block just inserted into B's target table — already grouped by the upstream MV's SELECT, but not merged with existing data. This means each insert batch is processed independently and -MergeState is required when chaining AggregatingMergeTree MVs.

[HIGH]

MV sees INSERT only. No backfill; existing data must be inserted manually.
ORDER BY in target must match GROUP BY in MV. Otherwise aggregation won't work properly.

[MEDIUM]

Avoid too many MVs on one source table when possible. Each MV adds overhead to every INSERT.

How Materialized Views Work

Source Table ──INSERT──► MV Transform ──► Target Table
                │
                └─ MV executes SELECT for each inserted block

Key insight: The MV's SELECT query runs on each INSERT batch. Results go to the target table. The MV does NOT query historical data.

Pattern 1: Real-Time Counters (SummingMergeTree)

Best for simple sums and counts that need real-time updates.

-- Source: raw events
CREATE TABLE events (
    event_time DateTime,
    tenant_id UInt32,
    event_type String,
    user_id UInt64
) ENGINE = MergeTree()
ORDER BY (tenant_id, event_time);

-- Target: daily counters
CREATE TABLE daily_event_counts (
    date Date,
    tenant_id UInt32,
    event_type LowCardinality(String),
    event_count UInt64,
    unique_users UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (tenant_id, date, event_type);

-- MV: transform inserts
CREATE MATERIALIZED VIEW daily_event_counts_mv
TO daily_event_counts AS
SELECT
    toDate(event_time) AS date,
    tenant_id,
    event_type,
    count() AS event_count,
    uniq(user_id) AS unique_users  -- WARNING: Not additive!
FROM events
GROUP BY date, tenant_id, event_type;

Warning: uniq() in SummingMergeTree is not accurate—sums don't equal unique counts. Use AggregatingMergeTree for unique counts.

Pattern 2: Complex Aggregates (AggregatingMergeTree)

For accurate uniq, quantiles, or any non-additive aggregate.

-- Target: uses AggregateFunction types
CREATE TABLE user_metrics_agg (
    date Date,
    tenant_id UInt32,
    total_events AggregateFunction(sum, UInt64),
    unique_users AggregateFunction(uniq, UInt64),
    p95_duration AggregateFunction(quantile(0.95), Float32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, date);

-- MV: use -State functions
CREATE MATERIALIZED VIEW user_metrics_mv
TO user_metrics_agg AS
SELECT
    toDate(event_time) AS date,
    tenant_id,
    sumState(1) AS total_events,
    uniqState(user_id) AS unique_users,
    quantileState(0.95)(duration_ms) AS p95_duration
FROM events
GROUP BY date, tenant_id;

-- Query: use -Merge functions
SELECT
    date,
    tenant_id,
    sumMerge(total_events) AS events,
    uniqMerge(unique_users) AS users,
    quantileMerge(0.95)(p95_duration) AS p95
FROM user_metrics_agg
WHERE tenant_id = 1
GROUP BY date, tenant_id;

Key pattern: -State to insert, -Merge to query.

Pattern 3: Data Transformation Pipeline

Transform/enrich data as it arrives.

-- Source: raw JSON logs
CREATE TABLE raw_logs (
    timestamp DateTime,
    raw_json String
) ENGINE = MergeTree()
ORDER BY timestamp;

-- Target: parsed structured data
CREATE TABLE parsed_logs (
    timestamp DateTime,
    level LowCardinality(String),
    service LowCardinality(String),
    message String,
    trace_id String
)
ENGINE = MergeTree()
ORDER BY (service, timestamp);

-- MV: parse JSON on insert
CREATE MATERIALIZED VIEW parsed_logs_mv
TO parsed_logs AS
SELECT
    timestamp,
    JSONExtractString(raw_json, 'level') AS level,
    JSONExtractString(raw_json, 'service') AS service,
    JSONExtractString(raw_json, 'message') AS message,
    JSONExtractString(raw_json, 'trace_id') AS trace_id
FROM raw_logs;

Pattern 4: Last Value Tracking (ReplacingMergeTree)

Track latest state per entity.

-- Target: latest user state
CREATE TABLE user_latest_state (
    user_id UInt64,
    last_seen DateTime,
    last_action LowCardinality(String),
    total_actions UInt64
)
ENGINE = ReplacingMergeTree(last_seen)
ORDER BY user_id;

-- MV: update on each event
CREATE MATERIALIZED VIEW user_state_mv
TO user_latest_state AS
SELECT
    user_id,
    max(event_time) AS last_seen,
    argMax(event_type, event_time) AS last_action,
    count() AS total_actions
FROM events
GROUP BY user_id;

Query with argMax pattern (avoid FINAL on large tables):

SELECT
    user_id,
    argMax(last_seen, last_seen) AS last_seen,
    argMax(last_action, last_seen) AS last_action,
    argMax(total_actions, last_seen) AS total_actions
FROM user_latest_state
WHERE user_id = 123
GROUP BY user_id;

Backfilling Existing Data

MVs don't process existing data. Backfill manually:

-- Insert historical data into target table
INSERT INTO daily_event_counts
SELECT
    toDate(event_time) AS date,
    tenant_id,
    event_type,
    count() AS event_count,
    uniq(user_id) AS unique_users
FROM events
WHERE event_time < '2024-01-01'  -- Before MV was created
GROUP BY date, tenant_id, event_type;

MV Management

Check MV Status

-- List all MVs
SELECT name, engine, create_table_query
FROM system.tables
WHERE engine = 'MaterializedView';

-- Check target table size
SELECT
    table,
    formatReadableSize(sum(bytes_on_disk)) AS size,
    sum(rows) AS rows
FROM system.parts
WHERE active AND table = 'daily_event_counts'
GROUP BY table;

Pause/Resume MV

-- Pause (stop processing inserts)
ALTER TABLE events DETACH MATERIALIZED VIEW daily_event_counts_mv;

-- Resume
ALTER TABLE events ATTACH MATERIALIZED VIEW daily_event_counts_mv;

Modify MV

MVs cannot be altered. Drop and recreate:

DROP VIEW daily_event_counts_mv;
-- Optionally truncate target: TRUNCATE TABLE daily_event_counts;
CREATE MATERIALIZED VIEW daily_event_counts_mv TO daily_event_counts AS ...;
-- Backfill if needed

Common Pitfalls

| Pitfall | Problem | Solution | | --------------------------------------- | ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | Using uniq() with SummingMergeTree | Sums don't equal uniques | Use AggregatingMergeTree with uniqState/uniqMerge | | Forgetting argMax or -Merge | Incomplete aggregation results | Use argMax pattern for Replacing/Collapsing, -Merge for Aggregating | | No backfill after MV creation | Missing historical data | Manually INSERT historical aggregates | | MV on wrong table | Inserts to wrong source ignored | Ensure MV is on the table receiving INSERTs | | Too many MVs on one source | Slow inserts | Consider fewer MVs or async processing | | Assuming chain is cheaper than parallel | Wrong topology choice | Chain and parallel have nearly identical resource cost — choose topology based on query granularity needs, not performance | | Chained MV reads final table state | Wrong aggregation logic, missing data | Downstream MVs see the pre-aggregated block, not merged state — use -MergeState when chaining AggregatingMergeTree |

Decision Tree

Need to aggregate data at query time?
│
├─ Yes, and data changes frequently → Query raw data
│
├─ Yes, but queries are slow → Create MV
│  │
│  ├─ Simple sums/counts only?
│  │  └─ SummingMergeTree
│  │
│  ├─ Need uniq, quantile, or complex aggregates?
│  │  └─ AggregatingMergeTree with -State/-Merge
│  │
│  └─ Need latest value per key?
│     └─ ReplacingMergeTree with argMax pattern
│
└─ Need to transform/parse data on insert?
   └─ MV with regular MergeTree target

Complete Example: Multi-Level Aggregation

Chain MVs when each level serves different query granularity (hourly → daily). The downstream MV sees the pre-aggregated block (output of the upstream MV's SELECT), not the merged table state — use -MergeState to handle partial aggregation states.

Cost note: Chain (A→B→C) vs parallel (A→B, A→C) topologies have nearly identical resource consumption. The bottleneck is total data written to target tables, not the topology. Choose based on query needs, not performance assumptions.

-- Level 1: Raw events (source)
CREATE TABLE events (...) ENGINE = MergeTree() ORDER BY ...;

-- Level 2: Hourly aggregates
CREATE TABLE hourly_stats (
    hour DateTime,
    tenant_id UInt32,
    events AggregateFunction(sum, UInt64),
    users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree() ORDER BY (tenant_id, hour);

CREATE MATERIALIZED VIEW hourly_mv TO hourly_stats AS
SELECT
    toStartOfHour(event_time) AS hour,
    tenant_id,
    sumState(1) AS events,
    uniqState(user_id) AS users
FROM events GROUP BY hour, tenant_id;

-- Level 3: Daily aggregates (from hourly)
-- daily_mv receives the pre-aggregated block inserted into hourly_stats
-- (already grouped by hour/tenant with -State values), but NOT merged
-- with existing hourly_stats data. sumMergeState/uniqMergeState handles
-- this: merges partial -State values within the block, then re-wraps as -State.
CREATE TABLE daily_stats (
    date Date,
    tenant_id UInt32,
    events AggregateFunction(sum, UInt64),
    users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree() ORDER BY (tenant_id, date);

CREATE MATERIALIZED VIEW daily_mv TO daily_stats AS
SELECT
    toDate(hour) AS date,
    tenant_id,
    sumMergeState(events) AS events,   -- Merge partial states from block, then re-State
    uniqMergeState(users) AS users
FROM hourly_stats GROUP BY date, tenant_id;

Query any level:

-- Fast daily query
SELECT date, sumMerge(events), uniqMerge(users)
FROM daily_stats WHERE tenant_id = 1 GROUP BY date;

-- Drill down to hourly
SELECT hour, sumMerge(events), uniqMerge(users)
FROM hourly_stats WHERE tenant_id = 1 AND toDate(hour) = today() GROUP BY hour;

Troubleshooting

Always ask for user confirmation before creating/modifying MVs or target tables.

Wrong Aggregation Results

Problem: Counts are too high, uniq values don't match raw data, aggregates seem doubled

Diagnose:

-- Compare MV result vs raw query
SELECT count() FROM target_table;
SELECT count() FROM source_table WHERE <same_filters>;

-- Check for duplicate keys in target
SELECT date, tenant_id, count() AS rows
FROM target_table
GROUP BY date, tenant_id
HAVING rows > 1;

Solutions:

| Cause | Fix | | ------------------------------------------ | ----------------------------------------------------------------------- | | Using uniq() with SummingMergeTree | Switch to AggregatingMergeTree with uniqState/uniqMerge | | Forgetting -Merge in query | Always use sumMerge(), uniqMerge() for AggregatingMergeTree | | Forgetting argMax for ReplacingMergeTree | Use argMax pattern: SELECT key, argMax(col, version) ... GROUP BY key | | Duplicate inserts to source | Deduplicate source or use ReplacingMergeTree for target |

-- BAD: uniq in SummingMergeTree (sums don't work)
ENGINE = SummingMergeTree()
-- SELECT uniq(user_id) AS users  -- Wrong!

-- GOOD: AggregatingMergeTree with State/Merge
ENGINE = AggregatingMergeTree()
-- MV: uniqState(user_id) AS users
-- Query: uniqMerge(users)

MV Not Updating / Missing Data

Problem: Target table not receiving new data, counts stuck at old values

Diagnose:

-- Check if MV is attached
SELECT name, engine FROM system.tables WHERE engine = 'MaterializedView';

-- Check target table recent data
SELECT max(date), count() FROM target_table;

-- Verify source table is receiving inserts
SELECT max(event_time), count() FROM source_table WHERE event_time > now() - INTERVAL 1 HOUR;

Solutions:

| Cause | Fix | | -------------------------------- | ----------------------------------------------------- | | MV detached | ALTER TABLE source ATTACH MATERIALIZED VIEW mv_name | | MV on wrong source table | Drop MV, recreate with correct source | | Historical data not backfilled | Manually INSERT aggregated historical data | | Inserts going to different table | Ensure app inserts to the MV's source table |

-- Backfill historical data
INSERT INTO target_table
SELECT
    toDate(event_time) AS date,
    tenant_id,
    sumState(1) AS events,
    uniqState(user_id) AS users
FROM source_table
WHERE event_time < '2024-01-01'  -- Before MV existed
GROUP BY date, tenant_id;

Slow Inserts After Adding MV

Problem: INSERT performance degraded after creating MV, insert latency increased

Diagnose:

-- Check MVs on this table
SELECT name, as_select FROM system.tables
WHERE engine = 'MaterializedView' AND as_select LIKE '%source_table%';

-- Check MV query complexity
EXPLAIN SELECT ... FROM source_table ...;  -- Use MV's SELECT query

Solutions:

| Cause | Fix | | ------------------------------------------ | ------------------------------------------ | | Too many MVs on one source | Consolidate MVs or use async inserts | | Complex MV query (JOINs, heavy transforms) | Simplify MV, move complexity to query time | | MV target table has wrong ORDER BY | Match target ORDER BY to MV's GROUP BY |

-- Example: MV groups by (tenant_id, date)
-- Target table ORDER BY should match
CREATE TABLE target (...)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, date);  -- Matches GROUP BY in MV

Target Table Growing Too Large

Problem: MV target table larger than expected, not aggregating properly

Diagnose:

-- Check rows per key (should be 1 after merge for same GROUP BY)
SELECT date, tenant_id, count() AS rows
FROM target_table
GROUP BY date, tenant_id
ORDER BY rows DESC
LIMIT 10;

Solutions:

| Cause | Fix | | ------------------------------------------------------- | ---------------------------------------------------- | | ORDER BY doesn't match GROUP BY | Recreate target with ORDER BY matching MV's GROUP BY | | Background merges haven't run | Wait for automatic merge, or use argMax in query | | Wrong engine (MergeTree instead of Summing/Aggregating) | Recreate with correct engine |

-- ORDER BY must match GROUP BY columns for proper aggregation
-- MV: GROUP BY (tenant_id, date, event_type)
-- Target: ORDER BY (tenant_id, date, event_type)  -- Must match!

Agent Skills: ClickHouse Materialized Views

Install this agent skill to your local

Skill Files