Agent Skills: Observability & Site Reliability Engineering

Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices.

UncategorizedID: majiayu000/claude-arsenal/observability-sre

Install this agent skill to your local

pnpm dlx add-skill https://github.com/majiayu000/claude-arsenal/tree/HEAD/skills/observability-sre

Skill Files

Browse the full folder contents for observability-sre.

Download Skill

Loading file tree…

skills/observability-sre/SKILL.md

Skill Metadata

Name
observability-sre
Description
Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices.

Observability & Site Reliability Engineering

Core Principles

  • Three Pillars — Metrics, Logs, and Traces provide holistic visibility
  • Observability-First — Build systems that explain their own behavior
  • SLO-Driven — Define reliability targets that matter to users
  • Proactive Detection — Find issues before customers do
  • Blameless Culture — Learn from failures without blame
  • Automate Toil — Reduce repetitive operational work
  • Continuous Improvement — Each incident makes systems more resilient
  • Full-Stack Visibility — Monitor from infrastructure to business metrics

Hard Rules (Must Follow)

These rules are mandatory. Violating them means the skill is not working correctly.

Symptom-Based Alerts Only

Alert on user-facing symptoms, not internal infrastructure metrics.

# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
  expr: cpu_usage > 70%
  # Users don't care about CPU, they care about latency

- alert: MemoryHigh
  expr: memory_usage > 80%
  # Internal metric, may not affect users

# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200
  annotations:
    summary: "Users experiencing slow response times"

- alert: ErrorRateHigh
  expr: slo:api_errors:rate5m > 0.001
  annotations:
    summary: "Users encountering errors"

Low Cardinality Labels

Loki/Prometheus labels must have low cardinality (<10 unique labels).

# ❌ FORBIDDEN: High cardinality labels
labels:
  user_id: "usr_123"      # Millions of values!
  order_id: "ord_456"     # Millions of values!
  request_id: "req_789"   # Every request is unique!

# ✅ REQUIRED: Low cardinality only
labels:
  namespace: "production"  # Few values
  app: "api-server"        # Few values
  level: "error"           # 5-6 values
  method: "GET"            # ~10 values

# High cardinality data goes in log body:
logger.info({
  user_id: "usr_123",      # In JSON body, not label
  order_id: "ord_456",
}, "Order processed");

SLO-Based Error Budgets

Every service must have defined SLOs with error budget tracking.

# ❌ FORBIDDEN: No SLO definition
# Just monitoring without targets

# ✅ REQUIRED: Explicit SLO with budget
# SLO: 99.9% availability
# Error Budget: 0.1% = 43.2 minutes/month downtime

groups:
  - name: slo_tracking
    rules:
      - record: slo:api_availability:ratio
        expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

      - alert: ErrorBudgetBurnRate
        expr: slo:api_availability:ratio < 0.999
        for: 5m
        annotations:
          summary: "Burning error budget too fast"

Trace Context in Logs

All logs must include trace_id for correlation with distributed traces.

// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");

// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "Payment processed");

// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}

Quick Reference

When to Use What

| Scenario | Tool/Pattern | Reason | |----------|--------------|--------| | Metrics collection | Prometheus + Grafana | Industry standard, powerful query language | | Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard | | Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper | | Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics | | Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry | | Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling | | Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection | | AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis |

The Three Pillars

| Pillar | What | When | Tools | |--------|------|------|-------| | Metrics | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch | | Logs | Event records with context | Debugging, audit trails | Loki, ELK, Splunk | | Traces | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin |

Fourth Pillar (Emerging): Continuous Profiling — Code-level performance data (CPU, memory usage at function level)


Observability Architecture

Layered Prometheus Setup

# 2025 Best Practice: Federated architecture
# Prevents metric chaos while enabling drill-down

# Layer 1: Application Prometheus
# - Detailed business logic metrics
# - High cardinality acceptable
# - Short retention (7 days)

# Layer 2: Cluster Prometheus
# - Per-environment/cluster metrics
# - Medium retention (30 days)
# - Aggregates from application level

# Layer 3: Global Prometheus
# - Cross-cluster critical metrics
# - Long retention (1 year)
# - Federation from cluster level

# Global Prometheus config
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="kubernetes-nodes"}'
        - '{__name__=~"job:.*"}'  # Recording rules only
    static_configs:
      - targets:
        - 'cluster-prom-us-east.internal:9090'
        - 'cluster-prom-eu-west.internal:9090'

Recording Rules for Performance

# Precompute expensive queries
groups:
  - name: api_performance
    interval: 30s
    rules:
      # Request rate (requests per second)
      - record: job:api_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, status)

      # Error rate
      - record: job:api_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # P95 latency
      - record: job:api_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

Resource Optimization

# Increase scrape interval for high-target deployments
scrape_interval: 30s  # Default: 15s reduces load by 50%

# Use relabeling to drop unnecessary metrics
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'go_.*|process_.*'  # Drop Go runtime metrics
    action: drop

# Limit sample retention
storage:
  tsdb:
    retention.time: 15d  # Keep only 15 days locally
    retention.size: 50GB # Or max 50GB

Distributed Tracing with OpenTelemetry

Auto-Instrumentation Setup

// Node.js auto-instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();

Manual Instrumentation for Business Logic

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  // Create custom span for business operation
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      // Add business context
      span.setAttributes({
        'order.id': orderId,
        'payment.amount': amount,
        'payment.currency': 'USD',
      });

      // Child span for external API call
      const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
        const result = await stripe.charges.create({ amount, currency: 'usd' });
        childSpan.setAttribute('stripe.charge_id', result.id);
        childSpan.setStatus({ code: SpanStatusCode.OK });
        childSpan.end();
        return result;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return paymentResult;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

Sampling Strategies

# OpenTelemetry Collector config
processors:
  # Probabilistic sampling: Keep 10% of traces
  probabilistic_sampler:
    sampling_percentage: 10

  # Tail sampling: Make decisions after seeing full trace
  tail_sampling:
    policies:
      # Always sample errors
      - name: error-traces
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Always sample slow requests
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 1000}

      # Sample 5% of normal traffic
      - name: normal-traces
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

Context Propagation

// Ensure trace context flows across services
import { propagation, context } from '@opentelemetry/api';

// Outgoing HTTP request (automatic with auto-instrumentation)
fetch('https://api.example.com/data', {
  headers: {
    // W3C Trace Context headers injected automatically:
    // traceparent: 00-<trace-id>-<span-id>-01
    // tracestate: vendor=value
  },
});

// Manual propagation for non-HTTP (e.g., message queues)
const carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });

Structured Logging Best Practices

JSON Logging Format

// Use structured logging library
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // Include trace context in logs
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};

    const { traceId, spanId } = span.spanContext();
    return {
      trace_id: traceId,
      span_id: spanId,
    };
  },
});

// Structured logging with context
logger.info(
  {
    user_id: '123',
    order_id: 'ord_456',
    amount: 99.99,
    payment_method: 'card',
  },
  'Payment processed successfully'
);

// Output:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"}

Log Levels

// Follow standard severity levels
logger.trace({ details }, 'Low-level debugging');     // Very verbose
logger.debug({ state }, 'Debug information');          // Development
logger.info({ event }, 'Normal operation');            // Production default
logger.warn({ issue }, 'Warning condition');           // Potential issues
logger.error({ error, context }, 'Error occurred');    // Errors
logger.fatal({ critical }, 'Fatal error');             // Process crash

Grafana Loki Configuration

# Promtail config - ships logs to Loki
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Add pod labels as Loki labels (LOW cardinality only!)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
    pipeline_stages:
      # Parse JSON logs
      - json:
          expressions:
            level: level
            trace_id: trace_id
      # Extract fields as labels
      - labels:
          level:
          trace_id:

Loki Best Practices

  • Low Cardinality Labels — Use only 5-10 labels (namespace, app, level)
  • High Cardinality in Log Body — Put user_id, order_id in JSON, not labels
  • LogQL for Filtering — Use {app="api"} | json | user_id="123"
  • Retention Policy — Keep recent logs longer, compress old logs
# LogQL query examples
{namespace="production", app="api"} |= "error"  # Text search

{app="api"} | json | level="error" | line_format "{{.msg}}"  # JSON parsing

rate({app="api"}[5m])  # Log rate per second

sum by (level) (count_over_time({namespace="production"}[1h]))  # Count by level

SLO/SLI/SLA Management

Definitions

  • SLI (Service Level Indicator) — Quantifiable measurement of service behavior

    • Examples: Request latency, error rate, availability, throughput
  • SLO (Service Level Objective) — Target value/range for an SLI

    • Examples: 99.9% availability, P95 latency < 200ms
  • SLA (Service Level Agreement) — Formal commitment with consequences

    • Examples: "99.9% uptime or 10% credit"

The Four Golden Signals

# Google SRE's key metrics for any service

1. Latency
   SLI: P95 request latency
   SLO: 95% of requests complete in < 200ms

2. Traffic
   SLI: Requests per second
   SLO: Handle 10,000 req/s peak load

3. Errors
   SLI: Error rate (5xx / total)
   SLO: < 0.1% error rate

4. Saturation
   SLI: Resource utilization (CPU, memory, disk)
   SLO: CPU < 70%, Memory < 80%

Error Budget

# Error budget = 1 - SLO
SLO = 99.9%  # "three nines"
Error_Budget = 100% - 99.9% = 0.1%

# Monthly calculation (30 days)
Total_Minutes = 30 * 24 * 60 = 43,200 minutes
Allowed_Downtime = 43,200 * 0.001 = 43.2 minutes

# If you've had 20 minutes downtime this month:
Budget_Remaining = 43.2 - 20 = 23.2 minutes
Budget_Consumed = 20 / 43.2 = 46.3%

# Policy: If budget > 90% consumed, freeze deployments

SLO Implementation with Prometheus

# Recording rules for SLI calculation
groups:
  - name: slo_availability
    interval: 30s
    rules:
      # Total requests
      - record: slo:api_requests:total
        expr: sum(rate(http_requests_total[5m]))

      # Successful requests (non-5xx)
      - record: slo:api_requests:success
        expr: sum(rate(http_requests_total{status!~"5.."}[5m]))

      # Availability SLI
      - record: slo:api_availability:ratio
        expr: slo:api_requests:success / slo:api_requests:total

      # 30-day availability
      - record: slo:api_availability:30d
        expr: avg_over_time(slo:api_availability:ratio[30d])

  - name: slo_latency
    interval: 30s
    rules:
      # P95 latency SLI
      - record: slo:api_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Alerting on SLO burn rate
- alert: HighErrorBudgetBurnRate
  expr: |
    (
      slo:api_availability:ratio < 0.999  # Below 99.9% SLO
      and
      slo:api_availability:30d > 0.999    # But 30-day average still OK
    )
  for: 5m
  annotations:
    summary: "Burning error budget too fast"
    description: "Current availability {{ $value }} is below SLO. {{ $labels.service }}"

Incident Response

Incident Severity Levels

| Level | Impact | Response Time | Examples | |-------|--------|---------------|----------| | SEV-1 | Service down or major degradation | < 15 min | Complete outage, data loss, security breach | | SEV-2 | Significant impact, partial outage | < 1 hour | Feature unavailable, high error rates | | SEV-3 | Minor impact, workaround exists | < 4 hours | Single component degraded, slow performance | | SEV-4 | Cosmetic, no user impact | Next business day | UI glitches, logging errors |

Incident Response Roles (IMAG Framework)

Incident Commander (IC):
  - Overall coordination and decision-making
  - Declares incident start/end
  - Decides on escalations
  - Owns communication to leadership

Operations Lead (OL):
  - Technical investigation and mitigation
  - Coordinates engineers
  - Implements fixes
  - Reports status to IC

Communications Lead (CL):
  - Internal/external status updates
  - Customer communication
  - Stakeholder notifications
  - Status page updates

Incident Workflow

1. Detection (Alert fires or user reports)
   ↓
2. Triage (Assess severity, assign IC)
   ↓
3. Response (Assemble team, create war room)
   ↓
4. Mitigation (Stop the bleeding, restore service)
   ↓
5. Resolution (Fix root cause)
   ↓
6. Postmortem (Blameless review, action items)
   ↓
7. Follow-up (Implement improvements)

On-Call Best Practices

  • Rotation — 1-week shifts, balanced across timezones
  • Escalation — Primary → Secondary → Manager (15 min each)
  • Playbooks — Step-by-step debugging guides for common issues
  • Runbooks — Automated remediation scripts
  • Handoff — 15-min sync at rotation change
  • Compensation — On-call pay or comp time
  • Health — No more than 2 incidents/night target

Alert Fatigue Prevention

# Symptoms vs Causes alerting
# Alert on WHAT users experience, not WHY it's broken

# GOOD: Symptom-based alert
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200  # User-facing metric
  annotations:
    summary: "API is slow for users"

# BAD: Cause-based alert
- alert: CPUHigh
  expr: cpu_usage > 70%  # Internal metric, might not impact users
  # Don't alert unless this affects SLOs

# Use SLO-based alerting
# Alert when error budget burn rate is too high

Blameless Postmortems

Core Principles

  • Assume Good Intentions — Everyone did their best with available information
  • Focus on Systems — Identify gaps in process/tooling, not people
  • Psychological Safety — No punishment for honest mistakes
  • Learning Culture — Incidents are opportunities to improve
  • Separate from Performance Reviews — Postmortem participation never affects evaluations

Postmortem Template

# Incident Postmortem: [Title]

**Date:** 2025-01-15
**Duration:** 10:30 - 12:15 UTC (1h 45m)
**Severity:** SEV-2
**Incident Commander:** Jane Doe
**Responders:** John Smith, Alice Johnson

## Impact
- 15,000 users affected
- 12% error rate on payment processing
- $5,000 estimated revenue impact
- No data loss

## Timeline (UTC)
- 10:30 - Alert: Payment error rate > 5%
- 10:32 - IC assigned, war room created
- 10:45 - Identified: Database connection pool exhausted
- 11:00 - Mitigation: Increased pool size from 50 → 100
- 11:15 - Error rate back to normal
- 12:15 - Incident closed after monitoring

## Root Cause
Database connection pool configured for average load, not peak traffic.
Black Friday traffic spike (3x normal) exhausted connections.

## What Went Well
- Alert fired within 2 minutes of issue
- Clear escalation path, IC available immediately
- Mitigation applied quickly (30 minutes to fix)
- No data corruption or loss

## What Went Wrong
- No load testing at 3x scale
- No auto-scaling for connection pool
- No alert on connection pool saturation
- Insufficient monitoring of database metrics

## Action Items
- [ ] (@john) Add connection pool metrics to Grafana (Due: Jan 20)
- [ ] (@alice) Implement auto-scaling based on request rate (Due: Jan 25)
- [ ] (@jane) Add load testing to CI for 5x scale (Due: Feb 1)
- [ ] (@jane) Add alert: connection pool > 80% (Due: Jan 18)
- [ ] (@john) Document connection pool tuning runbook (Due: Jan 22)

## Lessons Learned
1. Black Friday load patterns need dedicated testing
2. Database metrics were missing from standard dashboards
3. Auto-scaling should cover ALL resources, not just pods

Follow-up

  • Review postmortem in team meeting within 1 week
  • Track action items to completion (not optional!)
  • Share learnings across teams
  • Update runbooks and playbooks
  • Celebrate successful incident response

Chaos Engineering

Principles

  1. Define Steady State — Normal system behavior (e.g., 99.9% success rate)
  2. Hypothesize — Predict system will remain stable under failure
  3. Inject Failures — Simulate real-world events
  4. Disprove Hypothesis — Look for deviations from steady state
  5. Learn and Improve — Fix weaknesses, increase resilience

Failure Types

Infrastructure:
  - Pod/node termination
  - Network latency/packet loss
  - DNS failures
  - Cloud region outage

Resources:
  - CPU stress
  - Memory exhaustion
  - Disk I/O saturation
  - File descriptor limits

Dependencies:
  - Database connection failures
  - API timeout/errors
  - Cache unavailability
  - Message queue backlog

Security:
  - DDoS simulation
  - Certificate expiration
  - Unauthorized access attempts

Chaos Mesh Example

# Network latency injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "100ms"
    correlation: "50"
    jitter: "50ms"
  duration: "5m"
  scheduler:
    cron: "@every 2h"  # Run every 2 hours

---
# Pod kill experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill
spec:
  action: pod-kill
  mode: fixed-percent
  value: "10"  # Kill 10% of pods
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  duration: "30s"

Best Practices

  • Start Small — Non-production first, then canary production
  • Collect Baselines — Know normal metrics before experiments
  • Define Success — Clear criteria for what "stable" means
  • Monitor Everything — Watch metrics, logs, traces during tests
  • Automate Rollback — Stop experiment if SLOs violated
  • Game Days — Scheduled chaos exercises with full team
  • Blameless Reviews — Treat chaos failures like production incidents

AIOps and AI in Observability

2025 Trends

  • Anomaly Detection — AI spots unusual patterns in metrics/logs
  • Root Cause Analysis — Correlate failures across services automatically
  • Predictive Alerting — Predict failures before they happen
  • Auto-Remediation — AI suggests or applies fixes autonomously
  • Natural Language Queries — Ask "Why is checkout slow?" instead of writing PromQL
  • AI Observability — Monitor AI model drift, hallucinations, token usage

AI-Driven Platforms (2025)

Dynatrace Davis AI:
  - Auto-detected 73% of incidents before customer impact
  - Reduced alert noise by 90%
  - Causal AI for root cause analysis

Datadog Watchdog:
  - Anomaly detection across metrics, logs, traces
  - Automated correlation of related issues
  - LLM-powered investigation assistant

Elastic AIOps:
  - Machine learning for log anomaly detection
  - Automated baseline learning
  - Predictive alerting

New Relic AI:
  - Natural language query interface
  - Automated incident summarization
  - Proactive capacity recommendations

Implementing AI Observability

# Monitor AI model performance
from opentelemetry import trace, metrics

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Create metrics for AI model
model_latency = meter.create_histogram(
    "ai.model.latency",
    description="AI model inference latency",
    unit="ms"
)
model_tokens = meter.create_counter(
    "ai.model.tokens",
    description="Token usage"
)

async def run_ai_model(prompt: str):
    with tracer.start_as_current_span("ai.inference") as span:
        start = time.time()

        span.set_attribute("ai.model", "gpt-4")
        span.set_attribute("ai.prompt_length", len(prompt))

        response = await openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        latency = (time.time() - start) * 1000
        tokens = response.usage.total_tokens

        # Record metrics
        model_latency.record(latency, {"model": "gpt-4"})
        model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})

        # Add to span
        span.set_attribute("ai.response_length", len(response.choices[0].message.content))
        span.set_attribute("ai.tokens_used", tokens)

        return response

Grafana Dashboards

3-3-3 Rule

  • 3 rows of panels per dashboard
  • 3 panels per row
  • 3 key metrics per panel

Avoid "dashboard sprawl" — Each dashboard should answer ONE question.

Dashboard Categories

RED Dashboard (for services):
  - Rate: Requests per second
  - Errors: Error rate
  - Duration: Latency (P50, P95, P99)

USE Dashboard (for resources):
  - Utilization: % of capacity used
  - Saturation: Queue depth, wait time
  - Errors: Error count

Four Golden Signals Dashboard:
  - Latency
  - Traffic
  - Errors
  - Saturation

SLO Dashboard:
  - Current SLI value
  - Error budget remaining
  - Burn rate
  - Trend (30-day)

Panel Best Practices

{
  "title": "API Request Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (method)",
      "legendFormat": "{{ method }}"
    }
  ],
  "options": {
    "tooltip": { "mode": "multi" },
    "legend": { "displayMode": "table", "calcs": ["mean", "last"] }
  },
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",  // Requests per second
      "color": { "mode": "palette-classic" },
      "custom": {
        "lineWidth": 2,
        "fillOpacity": 10
      }
    }
  }
}

Checklist

## Metrics (Prometheus + Grafana)
- [ ] Layered architecture (app/cluster/global)
- [ ] Recording rules for expensive queries
- [ ] Resource limits and retention configured
- [ ] Dashboards follow 3-3-3 rule
- [ ] Alerts based on SLOs, not internal metrics

## Tracing (OpenTelemetry)
- [ ] Auto-instrumentation enabled
- [ ] Custom spans for business operations
- [ ] Sampling strategy configured
- [ ] Trace context in logs (correlation)
- [ ] Backend connected (Tempo/Jaeger)

## Logging (Loki/ELK)
- [ ] Structured JSON logging
- [ ] Low cardinality labels (<10)
- [ ] Trace IDs in logs
- [ ] Appropriate log levels
- [ ] Retention policy defined

## SLOs
- [ ] SLIs defined for key user journeys
- [ ] SLOs documented and tracked
- [ ] Error budget calculated
- [ ] Burn rate alerting configured
- [ ] Monthly SLO review process

## Incident Response
- [ ] Severity levels defined
- [ ] On-call rotation scheduled
- [ ] Escalation policy documented
- [ ] Runbooks for common issues
- [ ] Postmortem template ready

## Culture
- [ ] Blameless postmortem process
- [ ] Action items tracked to completion
- [ ] Incident learnings shared
- [ ] On-call compensation policy
- [ ] Regular chaos engineering exercises

See Also