Observability & Site Reliability Engineering Skill

Observability & Site Reliability Engineering

Core Principles

Three Pillars — Metrics, Logs, and Traces provide holistic visibility
Observability-First — Build systems that explain their own behavior
SLO-Driven — Define reliability targets that matter to users
Proactive Detection — Find issues before customers do
Blameless Culture — Learn from failures without blame
Automate Toil — Reduce repetitive operational work
Continuous Improvement — Each incident makes systems more resilient
Full-Stack Visibility — Monitor from infrastructure to business metrics

Hard Rules (Must Follow)

These rules are mandatory. Violating them means the skill is not working correctly.

Symptom-Based Alerts Only

Alert on user-facing symptoms, not internal infrastructure metrics.

# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
  expr: cpu_usage > 70%
  # Users don't care about CPU, they care about latency

- alert: MemoryHigh
  expr: memory_usage > 80%
  # Internal metric, may not affect users

# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200
  annotations:
    summary: "Users experiencing slow response times"

- alert: ErrorRateHigh
  expr: slo:api_errors:rate5m > 0.001
  annotations:
    summary: "Users encountering errors"

Low Cardinality Labels

Loki/Prometheus labels must have low cardinality (<10 unique labels).

# ❌ FORBIDDEN: High cardinality labels
labels:
  user_id: "usr_123"      # Millions of values!
  order_id: "ord_456"     # Millions of values!
  request_id: "req_789"   # Every request is unique!

# ✅ REQUIRED: Low cardinality only
labels:
  namespace: "production"  # Few values
  app: "api-server"        # Few values
  level: "error"           # 5-6 values
  method: "GET"            # ~10 values

# High cardinality data goes in log body:
logger.info({
  user_id: "usr_123",      # In JSON body, not label
  order_id: "ord_456",
}, "Order processed");

SLO-Based Error Budgets

Every service must have defined SLOs with error budget tracking.

# ❌ FORBIDDEN: No SLO definition
# Just monitoring without targets

# ✅ REQUIRED: Explicit SLO with budget
# SLO: 99.9% availability
# Error Budget: 0.1% = 43.2 minutes/month downtime

groups:
  - name: slo_tracking
    rules:
      - record: slo:api_availability:ratio
        expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

      - alert: ErrorBudgetBurnRate
        expr: slo:api_availability:ratio < 0.999
        for: 5m
        annotations:
          summary: "Burning error budget too fast"

Trace Context in Logs

All logs must include trace_id for correlation with distributed traces.

// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");

// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "Payment processed");

// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}

Quick Reference

When to Use What

| Scenario | Tool/Pattern | Reason | |----------|--------------|--------| | Metrics collection | Prometheus + Grafana | Industry standard, powerful query language | | Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard | | Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper | | Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics | | Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry | | Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling | | Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection | | AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis |

The Three Pillars

| Pillar | What | When | Tools | |--------|------|------|-------| | Metrics | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch | | Logs | Event records with context | Debugging, audit trails | Loki, ELK, Splunk | | Traces | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin |

Fourth Pillar (Emerging): Continuous Profiling — Code-level performance data (CPU, memory usage at function level)

Observability Architecture

Layered Prometheus Setup

# 2025 Best Practice: Federated architecture
# Prevents metric chaos while enabling drill-down

# Layer 1: Application Prometheus
# - Detailed business logic metrics
# - High cardinality acceptable
# - Short retention (7 days)

# Layer 2: Cluster Prometheus
# - Per-environment/cluster metrics
# - Medium retention (30 days)
# - Aggregates from application level

# Layer 3: Global Prometheus
# - Cross-cluster critical metrics
# - Long retention (1 year)
# - Federation from cluster level

# Global Prometheus config
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="kubernetes-nodes"}'
        - '{__name__=~"job:.*"}'  # Recording rules only
    static_configs:
      - targets:
        - 'cluster-prom-us-east.internal:9090'
        - 'cluster-prom-eu-west.internal:9090'

Recording Rules for Performance

# Precompute expensive queries
groups:
  - name: api_performance
    interval: 30s
    rules:
      # Request rate (requests per second)
      - record: job:api_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, status)

      # Error rate
      - record: job:api_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # P95 latency
      - record: job:api_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

Resource Optimization

# Increase scrape interval for high-target deployments
scrape_interval: 30s  # Default: 15s reduces load by 50%

# Use relabeling to drop unnecessary metrics
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'go_.*|process_.*'  # Drop Go runtime metrics
    action: drop

# Limit sample retention
storage:
  tsdb:
    retention.time: 15d  # Keep only 15 days locally
    retention.size: 50GB # Or max 50GB

Distributed Tracing with OpenTelemetry

Auto-Instrumentation Setup

// Node.js auto-instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();

Manual Instrumentation for Business Logic

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  // Create custom span for business operation
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      // Add business context
      span.setAttributes({
        'order.id': orderId,
        'payment.amount': amount,
        'payment.currency': 'USD',
      });

      // Child span for external API call
      const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
        const result = await stripe.charges.create({ amount, currency: 'usd' });
        childSpan.setAttribute('stripe.charge_id', result.id);
        childSpan.setStatus({ code: SpanStatusCode.OK });
        childSpan.end();
        return result;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return paymentResult;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

Sampling Strategies

# OpenTelemetry Collector config
processors:
  # Probabilistic sampling: Keep 10% of traces
  probabilistic_sampler:
    sampling_percentage: 10

  # Tail sampling: Make decisions after seeing full trace
  tail_sampling:
    policies:
      # Always sample errors
      - name: error-traces
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Always sample slow requests
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 1000}

      # Sample 5% of normal traffic
      - name: normal-traces
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

Context Propagation

// Ensure trace context flows across services
import { propagation, context } from '@opentelemetry/api';

// Outgoing HTTP request (automatic with auto-instrumentation)
fetch('https://api.example.com/data', {
  headers: {
    // W3C Trace Context headers injected automatically:
    // traceparent: 00-<trace-id>-<span-id>-01
    // tracestate: vendor=value
  },
});

// Manual propagation for non-HTTP (e.g., message queues)
const carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });

Structured Logging Best Practices

JSON Logging Format

// Use structured logging library
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // Include trace context in logs
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};

    const { traceId, spanId } = span.spanContext();
    return {
      trace_id: traceId,
      span_id: spanId,
    };
  },
});

// Structured logging with context
logger.info(
  {
    user_id: '123',
    order_id: 'ord_456',
    amount: 99.99,
    payment_method: 'card',
  },
  'Payment processed successfully'
);

// Output:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"}

Log Levels

// Follow standard severity levels
logger.trace({ details }, 'Low-level debugging');     // Very verbose
logger.debug({ state }, 'Debug information');          // Development
logger.info({ event }, 'Normal operation');            // Production default
logger.warn({ issue }, 'Warning condition');           // Potential issues
logger.error({ error, context }, 'Error occurred');    // Errors
logger.fatal({ critical }, 'Fatal error');             // Process crash

Grafana Loki Configuration

# Promtail config - ships logs to Loki
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Add pod labels as Loki labels (LOW cardinality only!)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
    pipeline_stages:
      # Parse JSON logs
      - json:
          expressions:
            level: level
            trace_id: trace_id
      # Extract fields as labels
      - labels:
          level:
          trace_id:

Loki Best Practices

Low Cardinality Labels — Use only 5-10 labels (namespace, app, level)
High Cardinality in Log Body — Put user_id, order_id in JSON, not labels
LogQL for Filtering — Use {app="api"} | json | user_id="123"
Retention Policy — Keep recent logs longer, compress old logs

# LogQL query examples
{namespace="production", app="api"} |= "error"  # Text search

{app="api"} | json | level="error" | line_format "{{.msg}}"  # JSON parsing

rate({app="api"}[5m])  # Log rate per second

sum by (level) (count_over_time({namespace="production"}[1h]))  # Count by level

Extended Reference

Detailed material starting at ## SLO/SLI/SLA Management has been moved to reference/extended.md to keep this skill concise. Load that reference when the task requires the moved examples, command catalogs, checklists, platform details, or implementation templates.

Agent Skills: Observability & Site Reliability Engineering

Install this agent skill to your local

Skill Files