Agent Skills: SQL Cookbook

Craft and optimize SQL for any warehouse dialect — Snowflake, BigQuery, Databricks, Redshift, PostgreSQL. Covers window functions, CTEs, aggregations, joins, funnel queries, cohort retention, deduplication, dialect translation, slow query tuning, and cross-platform syntax differences.

UncategorizedID: vm0-ai/vm0-skills/sql-cookbook

Install this agent skill to your local

pnpm dlx add-skill https://github.com/vm0-ai/vm0-skills/tree/HEAD/sql-cookbook

Skill Files

Browse the full folder contents for sql-cookbook.

Download Skill

Loading file tree…

sql-cookbook/SKILL.md

Skill Metadata

Name
sql-cookbook
Description
Craft and optimize SQL for any warehouse dialect — Snowflake, BigQuery, Databricks, Redshift, PostgreSQL. Covers window functions, CTEs, aggregations, joins, funnel queries, cohort retention, deduplication, dialect translation, slow query tuning, and cross-platform syntax differences.

SQL Cookbook

Practical reference for building analytical SQL across every major data warehouse platform.

Platform-Specific Syntax Guide

PostgreSQL (Aurora, RDS, Supabase, Neon)

Working with dates and times:

-- Today and now
CURRENT_DATE, CURRENT_TIMESTAMP, NOW()

-- Adding and subtracting intervals
some_date + INTERVAL '7 days'
some_date - INTERVAL '1 month'

-- Round down to a time boundary
DATE_TRUNC('month', event_ts)

-- Pull out individual components
EXTRACT(YEAR FROM event_ts)
EXTRACT(DOW FROM event_ts)  -- Sunday = 0

-- Render as formatted text
TO_CHAR(event_ts, 'YYYY-MM-DD')

Text operations:

-- Joining strings together
first_name || ' ' || last_name
CONCAT(first_name, ' ', last_name)

-- Flexible matching
col ILIKE '%term%'  -- ignores case
col ~ '^regex$'     -- POSIX regex

-- Splitting and extracting
LEFT(str, n), RIGHT(str, n)
SPLIT_PART(str, delimiter, part_num)
REGEXP_REPLACE(str, pattern, replacement)

JSON and array handling:

-- Accessing JSON fields
payload->>'key'             -- returns text
payload->'nested'->'field'  -- returns json object
payload#>>'{a,b,c}'        -- deep path as text

-- Array utilities
ARRAY_AGG(col)
ANY(arr_col)
arr_col @> ARRAY['val']

Tuning guidance:

  • Run EXPLAIN ANALYZE to inspect actual execution plans
  • Add indexes on columns used in WHERE and JOIN conditions
  • Prefer EXISTS over IN for correlated lookups
  • Consider partial indexes for frequently used filter predicates
  • Deploy connection pooling when handling concurrent workloads

Snowflake

Date and time functions:

-- Getting current moments
CURRENT_DATE(), CURRENT_TIMESTAMP(), SYSDATE()

-- Shifting dates
DATEADD(day, 7, some_date)
DATEDIFF(day, start_dt, end_dt)

-- Truncation
DATE_TRUNC('month', event_ts)

-- Component extraction
YEAR(event_ts), MONTH(event_ts), DAY(event_ts)
DAYOFWEEK(event_ts)

-- Formatting
TO_CHAR(event_ts, 'YYYY-MM-DD')

Text and pattern matching:

-- Case-insensitive matching (default collation dependent)
col ILIKE '%term%'
REGEXP_LIKE(col, 'pattern')

-- Working with VARIANT / semi-structured data
col:key::string
PARSE_JSON('{"a": 1}')
GET_PATH(variant_col, 'path.to.field')

-- Expanding nested arrays
SELECT f.value FROM tbl, LATERAL FLATTEN(input => arr_col) f

Navigating semi-structured columns:

-- Dot-path access on VARIANT
payload:customer:name::STRING
payload:items[0]:price::NUMBER

-- Exploding nested arrays into rows
SELECT
    t.id,
    elem.value:name::STRING AS item_name,
    elem.value:qty::NUMBER AS item_qty
FROM my_table t,
LATERAL FLATTEN(input => t.payload:items) elem

Tuning guidance:

  • Rely on clustering keys rather than traditional indexes for large tables
  • Target clustering key columns in filters to maximize partition pruning
  • Size the virtual warehouse appropriately for query complexity
  • Reuse results via RESULT_SCAN(LAST_QUERY_ID()) to skip re-execution
  • Use transient tables for ephemeral staging data

BigQuery

Date and time functions:

-- Current moments
CURRENT_DATE(), CURRENT_TIMESTAMP()

-- Shifting dates forward and backward
DATE_ADD(some_date, INTERVAL 7 DAY)
DATE_SUB(some_date, INTERVAL 1 MONTH)
DATE_DIFF(end_dt, start_dt, DAY)
TIMESTAMP_DIFF(end_ts, start_ts, HOUR)

-- Truncation
DATE_TRUNC(event_ts, MONTH)
TIMESTAMP_TRUNC(event_ts, HOUR)

-- Component extraction
EXTRACT(YEAR FROM event_ts)
EXTRACT(DAYOFWEEK FROM event_ts)  -- Sunday = 1

-- Display formatting
FORMAT_DATE('%Y-%m-%d', date_col)
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', ts_col)

Text operations:

-- No native ILIKE; lowercase first
LOWER(col) LIKE '%term%'
REGEXP_CONTAINS(col, r'pattern')
REGEXP_EXTRACT(col, r'pattern')

-- Splitting and reassembling
SPLIT(str, delimiter)  -- produces an ARRAY
ARRAY_TO_STRING(arr, delimiter)

Arrays and structs:

-- Array utilities
ARRAY_AGG(col)
UNNEST(arr_col)
ARRAY_LENGTH(arr_col)
val IN UNNEST(arr_col)

-- Struct field access
struct_col.field_name

Tuning guidance:

  • Always apply filters on partition columns (typically date) to minimize bytes scanned
  • Add clustering on columns that appear frequently in WHERE clauses
  • Swap COUNT(DISTINCT ...) for APPROX_COUNT_DISTINCT() on high-cardinality data
  • Avoid SELECT * since billing scales with bytes read
  • Use DECLARE / SET for parameterized script logic
  • Run a dry-run estimate before executing expensive queries

Redshift

Date and time functions:

-- Current moments
CURRENT_DATE, GETDATE(), SYSDATE

-- Shifting dates
DATEADD(day, 7, some_date)
DATEDIFF(day, start_dt, end_dt)

-- Truncation
DATE_TRUNC('month', event_ts)

-- Component extraction
EXTRACT(YEAR FROM event_ts)
DATE_PART('dow', event_ts)

Text operations:

-- Case-insensitive matching
col ILIKE '%term%'
REGEXP_INSTR(col, 'pattern') > 0

-- Splitting and concatenation
SPLIT_PART(str, delimiter, part_num)
LISTAGG(col, ', ') WITHIN GROUP (ORDER BY col)

Tuning guidance:

  • Assign distribution keys (DISTKEY) so joined tables are co-located on the same nodes
  • Define sort keys (SORTKEY) on columns commonly used for filtering
  • Review query plans with EXPLAIN
  • Watch for DS_BCAST and DS_DIST in plans (signs of costly cross-node shuffles)
  • Run ANALYZE and VACUUM on a regular cadence
  • Use late-binding views for flexibility when schemas evolve

Databricks SQL

Date and time functions:

-- Current moments
CURRENT_DATE(), CURRENT_TIMESTAMP()

-- Shifting dates
DATE_ADD(some_date, 7)
DATEDIFF(end_dt, start_dt)
ADD_MONTHS(some_date, 1)

-- Truncation
DATE_TRUNC('MONTH', event_ts)
TRUNC(date_col, 'MM')

-- Component extraction
YEAR(event_ts), MONTH(event_ts)
DAYOFWEEK(event_ts)

Delta Lake capabilities:

-- Query historical snapshots
SELECT * FROM tbl TIMESTAMP AS OF '2024-01-15'
SELECT * FROM tbl VERSION AS OF 42

-- Inspect table history
DESCRIBE HISTORY tbl

-- Upsert with MERGE
MERGE INTO target USING source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Tuning guidance:

  • Run OPTIMIZE with ZORDER on frequently queried columns
  • Enable the Photon engine for compute-heavy workloads
  • Apply CACHE TABLE on datasets that are read repeatedly
  • Partition by low-cardinality date columns for efficient pruning

Reusable Analytical Patterns

Window Function Recipes

-- Assign sequential position within groups
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts DESC)
RANK() OVER (PARTITION BY category ORDER BY revenue DESC)
DENSE_RANK() OVER (ORDER BY score DESC)

-- Cumulative sums and rolling averages
SUM(amount) OVER (ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_total
AVG(amount) OVER (ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7d_avg

-- Accessing adjacent rows
LAG(val, 1) OVER (PARTITION BY entity ORDER BY dt) AS prior_val
LEAD(val, 1) OVER (PARTITION BY entity ORDER BY dt) AS next_val

-- Boundary values within a partition
FIRST_VALUE(status) OVER (PARTITION BY user_id ORDER BY event_ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
LAST_VALUE(status) OVER (PARTITION BY user_id ORDER BY event_ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

-- Proportional share
revenue / SUM(revenue) OVER () AS share_of_total
revenue / SUM(revenue) OVER (PARTITION BY category) AS share_within_category

Structured Queries with CTEs

WITH
-- Stage 1: Narrow down to the target population
eligible_users AS (
    SELECT user_id, signup_date, tier
    FROM users
    WHERE signup_date >= DATE '2024-01-01'
      AND account_status = 'active'
),

-- Stage 2: Derive per-user measures
per_user AS (
    SELECT
        u.user_id,
        u.tier,
        COUNT(DISTINCT e.session_id) AS sessions,
        SUM(e.revenue) AS revenue
    FROM eligible_users u
    LEFT JOIN events e ON u.user_id = e.user_id
    GROUP BY u.user_id, u.tier
),

-- Stage 3: Roll up to tier-level summary
tier_summary AS (
    SELECT
        tier,
        COUNT(*) AS users,
        AVG(sessions) AS avg_sessions,
        SUM(revenue) AS total_revenue
    FROM per_user
    GROUP BY tier
)

SELECT * FROM tier_summary ORDER BY total_revenue DESC;

Cohort Retention Tracking

WITH signup_cohorts AS (
    SELECT
        user_id,
        DATE_TRUNC('month', first_seen) AS cohort
    FROM users
),
monthly_activity AS (
    SELECT
        user_id,
        DATE_TRUNC('month', activity_date) AS active_month
    FROM user_events
)
SELECT
    s.cohort,
    COUNT(DISTINCT s.user_id) AS cohort_size,
    COUNT(DISTINCT CASE
        WHEN a.active_month = s.cohort THEN a.user_id
    END) AS m0,
    COUNT(DISTINCT CASE
        WHEN a.active_month = s.cohort + INTERVAL '1 month' THEN a.user_id
    END) AS m1,
    COUNT(DISTINCT CASE
        WHEN a.active_month = s.cohort + INTERVAL '3 months' THEN a.user_id
    END) AS m3
FROM signup_cohorts s
LEFT JOIN monthly_activity a ON s.user_id = a.user_id
GROUP BY s.cohort
ORDER BY s.cohort;

Conversion Funnel Measurement

WITH step_flags AS (
    SELECT
        user_id,
        MAX(CASE WHEN event_name = 'page_view' THEN 1 ELSE 0 END) AS saw_page,
        MAX(CASE WHEN event_name = 'signup_start' THEN 1 ELSE 0 END) AS began_signup,
        MAX(CASE WHEN event_name = 'signup_complete' THEN 1 ELSE 0 END) AS finished_signup,
        MAX(CASE WHEN event_name = 'first_purchase' THEN 1 ELSE 0 END) AS made_purchase
    FROM events
    WHERE event_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
)
SELECT
    COUNT(*) AS total_users,
    SUM(saw_page) AS viewers,
    SUM(began_signup) AS signup_starts,
    SUM(finished_signup) AS signup_completions,
    SUM(made_purchase) AS purchasers,
    ROUND(100.0 * SUM(began_signup) / NULLIF(SUM(saw_page), 0), 1) AS view_to_start_rate,
    ROUND(100.0 * SUM(finished_signup) / NULLIF(SUM(began_signup), 0), 1) AS start_to_finish_rate,
    ROUND(100.0 * SUM(made_purchase) / NULLIF(SUM(finished_signup), 0), 1) AS finish_to_purchase_rate
FROM step_flags;

Removing Duplicate Records

-- Retain only the latest version of each entity
WITH ordered AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY entity_id
            ORDER BY modified_at DESC
        ) AS seq
    FROM raw_table
)
SELECT * FROM ordered WHERE seq = 1;

Troubleshooting Query Failures

When a query produces an error, work through these checks:

  1. Syntax issues: Verify dialect compatibility (e.g., BigQuery lacks ILIKE; only BigQuery has SAFE_DIVIDE)
  2. Missing columns: Confirm column names against the actual schema; watch for case sensitivity with quoted identifiers in PostgreSQL
  3. Type conflicts: Add explicit casts when comparing mismatched types (CAST(col AS DATE) or col::DATE)
  4. Divide-by-zero: Wrap denominators with NULLIF(denominator, 0) or use platform-specific safe division
  5. Column ambiguity: Alias every table in JOINs and always prefix column references
  6. GROUP BY violations: Every selected column that is not inside an aggregate must appear in GROUP BY (BigQuery permits alias references as an exception)