Databricks Observability Skill

[!WARNING] DEPRECATED — to be removed in databricks-pack@2.0.0.

This v1 skill is replaced in the v2 rebuild. Migrate to: databricks-streaming-guardian. See the pack README → Migration: v1 → v2 for the full map and rationale.

Databricks Observability

Overview

Monitor Databricks jobs, clusters, SQL warehouses, and costs using system tables in the system catalog. System tables provide queryable observability data: system.lakeflow (job runs), system.billing (costs), system.query (SQL history), system.access (audit logs), and system.compute (cluster metrics). Data updates throughout the day, not real-time.

Prerequisites

Databricks Premium or Enterprise with Unity Catalog enabled
Access to system.billing, system.lakeflow, system.query, and system.access schemas
SQL warehouse for running monitoring queries

Instructions

Step 1: Job Health Monitoring

-- Job success/failure over last 24 hours
SELECT
    COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) AS succeeded,
    COUNT(CASE WHEN result_state = 'FAILED' THEN 1 END) AS failed,
    COUNT(CASE WHEN result_state = 'TIMED_OUT' THEN 1 END) AS timed_out,
    ROUND(100.0 * COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) / COUNT(*), 1) AS success_rate_pct,
    ROUND(AVG(TIMESTAMPDIFF(MINUTE, start_time, end_time)), 1) AS avg_duration_min
FROM system.lakeflow.job_run_timeline
WHERE start_time > current_timestamp() - INTERVAL 24 HOURS;

-- Failed jobs with error details
SELECT job_id, run_name, result_state, start_time, end_time,
       TIMESTAMPDIFF(MINUTE, start_time, end_time) AS duration_min,
       error_message
FROM system.lakeflow.job_run_timeline
WHERE result_state = 'FAILED'
  AND start_time > current_timestamp() - INTERVAL 24 HOURS
ORDER BY start_time DESC;

Step 2: Cluster Utilization and Costs

-- DBU consumption by cluster (last 7 days)
SELECT usage_metadata.cluster_id,
       COALESCE(usage_metadata.cluster_name, 'unnamed') AS cluster_name,
       sku_name,
       SUM(usage_quantity) AS total_dbus,
       ROUND(SUM(usage_quantity * p.pricing.default), 2) AS cost_usd
FROM system.billing.usage u
LEFT JOIN system.billing.list_prices p ON u.sku_name = p.sku_name
WHERE u.usage_date >= current_date() - INTERVAL 7 DAYS
GROUP BY usage_metadata.cluster_id, cluster_name, u.sku_name
ORDER BY cost_usd DESC
LIMIT 20;

Step 3: SQL Warehouse Performance

-- Slow queries (>30s) on SQL warehouses
SELECT warehouse_id, statement_id, executed_by,
       ROUND(total_duration_ms / 1000, 1) AS duration_sec,
       rows_produced,
       ROUND(bytes_scanned / 1048576, 1) AS scanned_mb,
       LEFT(statement_text, 200) AS query_preview
FROM system.query.history
WHERE total_duration_ms > 30000
  AND start_time > current_timestamp() - INTERVAL 24 HOURS
ORDER BY total_duration_ms DESC
LIMIT 50;

-- Warehouse queue times (right-sizing indicator)
SELECT warehouse_id, warehouse_name,
       COUNT(*) AS query_count,
       ROUND(AVG(total_duration_ms) / 1000, 1) AS avg_sec,
       ROUND(MAX(queue_duration_ms) / 1000, 1) AS max_queue_sec
FROM system.query.history
WHERE start_time > current_timestamp() - INTERVAL 7 DAYS
GROUP BY warehouse_id, warehouse_name;

Step 4: Cost-per-Job Analysis

SELECT j.name AS job_name,
       COUNT(DISTINCT r.run_id) AS run_count,
       ROUND(AVG(TIMESTAMPDIFF(MINUTE, r.start_time, r.end_time)), 1) AS avg_min,
       ROUND(SUM(b.usage_quantity), 1) AS total_dbus,
       ROUND(SUM(b.usage_quantity * p.pricing.default), 2) AS total_cost_usd
FROM system.lakeflow.job_run_timeline r
JOIN system.lakeflow.jobs j ON r.job_id = j.job_id
LEFT JOIN system.billing.usage b
    ON r.run_id = b.usage_metadata.job_run_id
LEFT JOIN system.billing.list_prices p ON b.sku_name = p.sku_name
WHERE r.start_time > current_timestamp() - INTERVAL 7 DAYS
GROUP BY j.name
ORDER BY total_cost_usd DESC
LIMIT 15;

Step 5: SQL Alerts for Automated Notifications

-- Create as SQL Alert: trigger when failure_count > 3
-- Schedule: every 15 minutes
-- Notification destination: Slack/email

SELECT COUNT(*) AS failure_count
FROM system.lakeflow.job_run_timeline
WHERE result_state = 'FAILED'
  AND start_time > current_timestamp() - INTERVAL 1 HOUR;

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create SQL alert programmatically
alert = w.alerts.create(
    name="Hourly Job Failure Alert",
    query_id="<saved-query-id>",
    options={"column": "failure_count", "op": ">", "value": "3"},
    rearm=900,  # re-alert after 15 min if still triggered
)

Step 6: Export Metrics to External Systems

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Export cluster state metrics for Prometheus/Datadog
for cluster in w.clusters.list():
    if cluster.state.value == "RUNNING":
        print(f"databricks_cluster_workers{{name=\"{cluster.cluster_name}\"}} "
              f"{cluster.num_workers}")
        print(f"databricks_cluster_running{{name=\"{cluster.cluster_name}\"}} 1")

# Export job success rate for Grafana
runs = list(w.jobs.list_runs(limit=100, completed_only=True))
success = sum(1 for r in runs if r.state.result_state and r.state.result_state.value == "SUCCESS")
print(f"databricks_job_success_rate {success / len(runs):.2f}")

Step 7: Audit Log Monitoring

-- Security: who accessed what in the last 7 days
SELECT event_time, user_identity.email, action_name,
       request_params, response.status_code
FROM system.access.audit
WHERE service_name IN ('unityCatalog', 'jobs', 'clusters')
  AND event_date >= current_date() - 7
  AND action_name NOT IN ('getStatus', 'list')  -- exclude noisy reads
ORDER BY event_time DESC
LIMIT 100;

Output

Job health dashboard (success rate, duration, failures)
Cluster cost breakdown by team and SKU
SQL warehouse performance report (slow queries, queue times)
Per-job cost analysis
Automated SQL alerts with Slack/email notifications
External metric export for Prometheus/Grafana

Error Handling

| Issue | Cause | Solution | |-------|-------|----------| | System tables empty | Unity Catalog not enabled | Enable in Account Console > Settings | | TABLE_OR_VIEW_NOT_FOUND | Schema not accessible | Request admin to grant SELECT ON system.billing | | Billing data delayed | System table refresh lag (up to 24h) | Use for trends and alerts, not real-time | | Query history missing | Serverless queries not tracked | Use classic SQL warehouse or check retention |

Examples

Daily Standup Dashboard

-- Single query for daily pipeline health
SELECT
    'Last 24h' AS period,
    COUNT(*) AS total_runs,
    COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) AS ok,
    COUNT(CASE WHEN result_state = 'FAILED' THEN 1 END) AS failed,
    ROUND(AVG(TIMESTAMPDIFF(MINUTE, start_time, end_time)), 1) AS avg_min
FROM system.lakeflow.job_run_timeline
WHERE start_time > current_timestamp() - INTERVAL 24 HOURS;

Agent Skills: Databricks Observability

Install this agent skill to your local

Skill Files