Agent Skills: Prometheus Monitoring and Alerting

|

UncategorizedID: cosmix/claude-loom/prometheus

Install this agent skill to your local

pnpm dlx add-skill https://github.com/cosmix/loom/tree/HEAD/skills/prometheus

Skill Files

Browse the full folder contents for prometheus.

Download Skill

Loading file tree…

skills/prometheus/SKILL.md

Skill Metadata

Name
prometheus
Description
|

Prometheus Monitoring and Alerting

Overview

Prometheus is a powerful open-source monitoring and alerting system designed for reliability and scalability in cloud-native environments. Built for multi-dimensional time-series data with flexible querying via PromQL.

Architecture Components

  • Prometheus Server: Core component that scrapes and stores time-series data with local TSDB
  • Alertmanager: Handles alerts, deduplication, grouping, routing, and notifications to receivers
  • Pushgateway: Allows ephemeral jobs to push metrics (use sparingly - prefer pull model)
  • Exporters: Convert metrics from third-party systems to Prometheus format (node, blackbox, etc.)
  • Client Libraries: Instrument application code (Go, Java, Python, Rust, etc.)
  • Prometheus Operator: Kubernetes-native deployment and management via CRDs
  • Remote Storage: Long-term storage via Thanos, Cortex, Mimir for multi-cluster federation

Data Model

  • Metrics: Time-series data identified by metric name and key-value labels
  • Format: metric_name{label1="value1", label2="value2"} sample_value timestamp
  • Metric Types:
    • Counter: Monotonically increasing value (requests, errors) - use rate() or increase() for querying
    • Gauge: Value that can go up/down (temperature, memory usage, queue length)
    • Histogram: Observations in configurable buckets (latency, request size) - exposes _bucket, _sum, _count
    • Summary: Similar to histogram but calculates quantiles client-side - use histograms for aggregation

Setup and Configuration

Basic Prometheus Server Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    cluster: "production"
    region: "us-east-1"

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Load rules files
rule_files:
  - "alerts/*.yml"
  - "rules/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Application services
  - job_name: "application"
    metrics_path: "/metrics"
    static_configs:
      - targets:
          - "app-1:8080"
          - "app-2:8080"
        labels:
          env: "production"
          team: "backend"

  # Kubernetes service discovery
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use custom metrics path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      # Use custom port if specified
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      # Add namespace label
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      # Add pod name label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
      # Add service name label
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: replace
        target_label: app

  # Node Exporter for host metrics
  - job_name: "node-exporter"
    static_configs:
      - targets:
          - "node-exporter:9100"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

# Template files for custom notifications
templates:
  - "/etc/alertmanager/templates/*.tmpl"

# Route alerts to appropriate receivers
route:
  group_by: ["alertname", "cluster", "service"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: "default"

  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: "pagerduty"
      continue: true

    # Database alerts to DBA team
    - match:
        team: database
      receiver: "dba-team"
      group_by: ["alertname", "instance"]

    # Development environment alerts
    - match:
        env: development
      receiver: "slack-dev"
      group_wait: 5m
      repeat_interval: 4h

# Inhibition rules (suppress alerts)
inhibit_rules:
  # Suppress warning alerts if critical alert is firing
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "instance"]

  # Suppress instance alerts if entire service is down
  - source_match:
      alertname: "ServiceDown"
    target_match_re:
      alertname: ".*"
    equal: ["service"]

receivers:
  - name: "default"
    slack_configs:
      - channel: "#alerts"
        title: "Alert: {{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"

  - name: "pagerduty"
    pagerduty_configs:
      - service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
        description: "{{ .GroupLabels.alertname }}"

  - name: "dba-team"
    slack_configs:
      - channel: "#database-alerts"
    email_configs:
      - to: "dba-team@example.com"
        headers:
          Subject: "Database Alert: {{ .GroupLabels.alertname }}"

  - name: "slack-dev"
    slack_configs:
      - channel: "#dev-alerts"
        send_resolved: true

Best Practices

Metric Naming Conventions

Follow these naming patterns for consistency:

# Format: <namespace>_<subsystem>_<metric>_<unit>

# Counters (always use _total suffix)
http_requests_total
http_request_errors_total
cache_hits_total

# Gauges
memory_usage_bytes
active_connections
queue_size

# Histograms (use _bucket, _sum, _count suffixes automatically)
http_request_duration_seconds
response_size_bytes
db_query_duration_seconds

# Use consistent base units
- seconds for duration (not milliseconds)
- bytes for size (not kilobytes)
- ratio for percentages (0.0-1.0, not 0-100)

Label Cardinality Management

DO

# Good: Bounded cardinality
http_requests_total{method="GET", status="200", endpoint="/api/users"}

# Good: Reasonable number of label values
db_queries_total{table="users", operation="select"}

DON'T

# Bad: Unbounded cardinality (user IDs, email addresses, timestamps)
http_requests_total{user_id="12345"}
http_requests_total{email="user@example.com"}
http_requests_total{timestamp="1234567890"}

# Bad: High cardinality (full URLs, IP addresses)
http_requests_total{url="/api/users/12345/profile"}
http_requests_total{client_ip="192.168.1.100"}

Guidelines

  • Keep label values to < 10 per label (ideally)
  • Total unique time-series per metric should be < 10,000
  • Use recording rules to pre-aggregate high-cardinality metrics
  • Avoid labels with unbounded values (IDs, timestamps, user input)

Recording Rules for Performance

Use recording rules to pre-compute expensive queries:

# rules/recording_rules.yml
groups:
  - name: performance_rules
    interval: 30s
    rules:
      # Pre-calculate request rates
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # Pre-calculate error rates
      - record: job:http_request_errors:rate5m
        expr: sum(rate(http_request_errors_total[5m])) by (job)

      # Pre-calculate error ratio
      - record: job:http_request_error_ratio:rate5m
        expr: |
          job:http_request_errors:rate5m
          /
          job:http_requests:rate5m

      # Pre-aggregate latency percentiles
      - record: job:http_request_duration_seconds:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

      - record: job:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

  - name: aggregation_rules
    interval: 1m
    rules:
      # Multi-level aggregation for dashboards
      - record: instance:node_cpu_utilization:ratio
        expr: |
          1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

      - record: cluster:node_cpu_utilization:ratio
        expr: avg(instance:node_cpu_utilization:ratio)

      # Memory aggregation
      - record: instance:node_memory_utilization:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes
            /
            node_memory_MemTotal_bytes
          )

Alert Design (Symptoms vs Causes)

Alert on symptoms (user-facing impact), not causes

# alerts/symptom_based.yml
groups:
  - name: symptom_alerts
    rules:
      # GOOD: Alert on user-facing symptoms
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 1
        for: 5m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "P95 latency is {{ $value }}s (threshold: 1s)"
          impact: "Users experiencing slow page loads"

      # GOOD: SLO-based alerting
      - alert: SLOBudgetBurnRate
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
          ) > (14.4 * (1 - 0.999))  # 14.4x burn rate for 99.9% SLO
        for: 5m
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "SLO budget burning too fast"
          description: "At current rate, monthly error budget will be exhausted in {{ $value | humanizeDuration }}"

Cause-based alerts (use for debugging, not paging)

# alerts/cause_based.yml
groups:
  - name: infrastructure_alerts
    rules:
      # Lower severity for infrastructure issues
      - alert: HighMemoryUsage
        expr: |
          (
            node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
          ) / node_memory_MemTotal_bytes > 0.9
        for: 10m
        labels:
          severity: warning # Not critical unless symptoms appear
          team: infrastructure
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanizePercentage }}"

      - alert: DiskSpaceLow
        expr: |
          (
            node_filesystem_avail_bytes{mountpoint="/"}
            /
            node_filesystem_size_bytes{mountpoint="/"}
          ) < 0.1
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Only {{ $value | humanizePercentage }} disk space remaining"
          action: "Clean up logs or expand disk"

Alert Best Practices

  1. For duration: Use for clause to avoid flapping
  2. Meaningful annotations: Include summary, description, runbook URL, impact
  3. Proper severity levels: critical (page immediately), warning (ticket), info (log)
  4. Actionable alerts: Every alert should require human action
  5. Include context: Add labels for team ownership, service, environment

PromQL Query Patterns

PromQL is the query language for Prometheus. Key concepts: instant vectors, range vectors, scalar, string literals, selectors, operators, functions, and aggregation.

Selectors and Matchers

# Instant vector selector (latest sample for each time-series)
http_requests_total

# Filter by label values
http_requests_total{method="GET", status="200"}

# Regex matching (=~) and negative regex (!~)
http_requests_total{status=~"5.."}  # 5xx errors
http_requests_total{endpoint!~"/admin.*"}  # exclude admin endpoints

# Label absence/presence
http_requests_total{job="api", status=""}  # empty label
http_requests_total{job="api", status!=""}  # non-empty label

# Range vector selector (samples over time)
http_requests_total[5m]  # last 5 minutes of samples

Rate Calculations

# Request rate (requests per second) - ALWAYS use rate() for counters
rate(http_requests_total[5m])

# Sum by service
sum(rate(http_requests_total[5m])) by (service)

# Increase over time window (total count) - for alerts/dashboards showing total
increase(http_requests_total[1h])

# irate() for volatile, fast-moving counters (more sensitive to spikes)
irate(http_requests_total[5m])

Error Ratios

# Error rate ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Success rate
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

Histogram Queries

# P95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# P50, P95, P99 latency by service
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# Average request duration
sum(rate(http_request_duration_seconds_sum[5m])) by (service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)

Aggregation Operations

# Sum across all instances
sum(node_memory_MemTotal_bytes) by (cluster)

# Average CPU usage
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

# Maximum value
max(http_request_duration_seconds) by (service)

# Minimum value
min(node_filesystem_avail_bytes) by (instance)

# Count number of instances
count(up == 1) by (job)

# Standard deviation
stddev(http_request_duration_seconds) by (service)

Advanced Queries

# Top 5 services by request rate
topk(5, sum(rate(http_requests_total[5m])) by (service))

# Bottom 3 instances by available memory
bottomk(3, node_memory_MemAvailable_bytes)

# Predict disk full time (linear regression)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0

# Compare with 1 day ago
http_requests_total - http_requests_total offset 1d

# Rate of change (derivative)
deriv(node_memory_MemAvailable_bytes[5m])

# Absent metric detection
absent(up{job="critical-service"})

Complex Aggregations

# Calculate Apdex score (Application Performance Index)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) * 0.5
)
/
sum(rate(http_request_duration_seconds_count[5m]))

# Multi-window multi-burn-rate SLO
(
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
  > 0.001 * 14.4
)
and
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
  > 0.001 * 14.4
)

Binary Operators and Vector Matching

# Arithmetic operators (+, -, *, /, %, ^)
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Comparison operators (==, !=, >, <, >=, <=) - filter to matching values
http_request_duration_seconds > 1

# Logical operators (and, or, unless)
up{job="api"} and rate(http_requests_total[5m]) > 100

# One-to-one matching (default)
method:http_requests:rate5m / method:http_requests:total

# Many-to-one matching with group_left
sum(rate(http_requests_total[5m])) by (instance, method)
  / on(instance) group_left
sum(rate(http_requests_total[5m])) by (instance)

# One-to-many matching with group_right
sum(rate(http_requests_total[5m])) by (instance)
  / on(instance) group_right
sum(rate(http_requests_total[5m])) by (instance, method)

Time Functions and Offsets

# Compare with previous time period
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)

# Day-over-day comparison
http_requests_total - http_requests_total offset 1d

# Time-based filtering
http_requests_total and hour() >= 9 and hour() < 17  # business hours
day_of_week() == 0 or day_of_week() == 6  # weekends

# Timestamp functions
time() - process_start_time_seconds  # uptime in seconds

Service Discovery

Prometheus supports multiple service discovery mechanisms for dynamic environments where targets appear and disappear.

Static Configuration

scrape_configs:
  - job_name: "static-targets"
    static_configs:
      - targets:
          - "host1:9100"
          - "host2:9100"
        labels:
          env: production
          region: us-east-1

File-based Service Discovery

scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
          - '/etc/prometheus/targets/*.yml'
        refresh_interval: 30s

# targets/webservers.json
[
  {
    "targets": ["web1:8080", "web2:8080"],
    "labels": {
      "job": "web",
      "env": "prod"
    }
  }
]

Kubernetes Service Discovery

scrape_configs:
  # Pod-based discovery
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - production
            - staging
    relabel_configs:
      # Keep only pods with prometheus.io/scrape=true annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # Extract custom scrape path from annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # Extract custom port from annotation
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

      # Add standard Kubernetes labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name

  # Service-based discovery
  - job_name: "kubernetes-services"
    kubernetes_sd_configs:
      - role: service
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

  # Node-based discovery (for node exporters)
  - job_name: "kubernetes-nodes"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

  # Endpoints discovery (for service endpoints)
  - job_name: "kubernetes-endpoints"
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: metrics

Consul Service Discovery

scrape_configs:
  - job_name: "consul-services"
    consul_sd_configs:
      - server: "consul.example.com:8500"
        datacenter: "dc1"
        services: ["web", "api", "cache"]
        tags: ["production"]
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service
      - source_labels: [__meta_consul_tags]
        target_label: tags

EC2 Service Discovery

scrape_configs:
  - job_name: "ec2-instances"
    ec2_sd_configs:
      - region: us-east-1
        access_key: YOUR_ACCESS_KEY
        secret_key: YOUR_SECRET_KEY
        port: 9100
        filters:
          - name: tag:Environment
            values: [production]
          - name: instance-state-name
            values: [running]
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance_name
      - source_labels: [__meta_ec2_availability_zone]
        target_label: availability_zone
      - source_labels: [__meta_ec2_instance_type]
        target_label: instance_type

DNS Service Discovery

scrape_configs:
  - job_name: "dns-srv-records"
    dns_sd_configs:
      - names:
          - "_prometheus._tcp.example.com"
        type: "SRV"
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_dns_name]
        target_label: instance

Relabeling Actions Reference

| Action | Description | Use Case | | ----------- | -------------------------------------------------- | ---------------------------------- | | keep | Keep targets where regex matches source labels | Filter targets by annotation/label | | drop | Drop targets where regex matches source labels | Exclude specific targets | | replace | Replace target label with value from source labels | Extract custom labels/paths/ports | | labelmap | Map source label names to target labels via regex | Copy all Kubernetes labels | | labeldrop | Drop labels matching regex | Remove internal metadata labels | | labelkeep | Keep only labels matching regex | Reduce cardinality | | hashmod | Set target label to hash of source labels modulo N | Sharding/routing |

High Availability and Scalability

Prometheus High Availability Setup

# Deploy multiple identical Prometheus instances scraping same targets
# Use external labels to distinguish instances
global:
  external_labels:
    replica: prometheus-1 # Change to prometheus-2, etc.
    cluster: production

# Alertmanager will deduplicate alerts from multiple Prometheus instances
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager-1:9093
            - alertmanager-2:9093
            - alertmanager-3:9093

Alertmanager Clustering

# alertmanager.yml - HA cluster configuration
global:
  resolve_timeout: 5m

route:
  receiver: "default"
  group_by: ["alertname", "cluster"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

receivers:
  - name: "default"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK"
        channel: "#alerts"

# Start Alertmanager cluster members
# alertmanager-1: --cluster.peer=alertmanager-2:9094 --cluster.peer=alertmanager-3:9094
# alertmanager-2: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-3:9094
# alertmanager-3: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-2:9094

Federation for Hierarchical Monitoring

# Global Prometheus federating from regional instances
scrape_configs:
  - job_name: "federate"
    scrape_interval: 15s
    honor_labels: true
    metrics_path: "/federate"
    params:
      "match[]":
        # Pull aggregated metrics only
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}' # Recording rules
        - "up"
    static_configs:
      - targets:
          - "prometheus-us-east-1:9090"
          - "prometheus-us-west-2:9090"
          - "prometheus-eu-west-1:9090"
        labels:
          region: "us-east-1"

Remote Storage for Long-term Retention

# Prometheus remote write to Thanos/Cortex/Mimir
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"
    queue_config:
      capacity: 10000
      max_shards: 50
      min_shards: 1
      max_samples_per_send: 5000
      batch_send_deadline: 5s
      min_backoff: 30ms
      max_backoff: 100ms
    write_relabel_configs:
      # Drop high-cardinality metrics before remote write
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop

# Prometheus remote read from long-term storage
remote_read:
  - url: "http://thanos-query:9090/api/v1/read"
    read_recent: true

Thanos Architecture for Global View

# Thanos Sidecar - runs alongside Prometheus
thanos sidecar \
  --prometheus.url=http://localhost:9090 \
  --tsdb.path=/prometheus \
  --objstore.config-file=/etc/thanos/bucket.yml \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902

# Thanos Store - queries object storage
thanos store \
  --data-dir=/var/thanos/store \
  --objstore.config-file=/etc/thanos/bucket.yml \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902

# Thanos Query - global query interface
thanos query \
  --http-address=0.0.0.0:9090 \
  --grpc-address=0.0.0.0:10901 \
  --store=prometheus-1-sidecar:10901 \
  --store=prometheus-2-sidecar:10901 \
  --store=thanos-store:10901

# Thanos Compactor - downsample and compact blocks
thanos compact \
  --data-dir=/var/thanos/compact \
  --objstore.config-file=/etc/thanos/bucket.yml \
  --retention.resolution-raw=30d \
  --retention.resolution-5m=90d \
  --retention.resolution-1h=365d

Horizontal Sharding with Hashmod

# Split scrape targets across multiple Prometheus instances using hashmod
scrape_configs:
  - job_name: "kubernetes-pods-shard-0"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Hash pod name and keep only shard 0 (mod 3)
      - source_labels: [__meta_kubernetes_pod_name]
        modulus: 3
        target_label: __tmp_hash
        action: hashmod
      - source_labels: [__tmp_hash]
        regex: "0"
        action: keep

  - job_name: "kubernetes-pods-shard-1"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_name]
        modulus: 3
        target_label: __tmp_hash
        action: hashmod
      - source_labels: [__tmp_hash]
        regex: "1"
        action: keep

  # shard-2 similar pattern...

Kubernetes Integration

ServiceMonitor for Prometheus Operator

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
  labels:
    app: myapp
    release: prometheus
spec:
  # Select services to monitor
  selector:
    matchLabels:
      app: myapp

  # Define namespaces to search
  namespaceSelector:
    matchNames:
      - production
      - staging

  # Endpoint configuration
  endpoints:
    - port: metrics # Service port name
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

      # Relabeling
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_name]
          targetLabel: pod
        - sourceLabels: [__meta_kubernetes_namespace]
          targetLabel: namespace

      # Metric relabeling (filter/modify metrics)
      metricRelabelings:
        - sourceLabels: [__name__]
          regex: "go_.*"
          action: drop # Drop Go runtime metrics
        - sourceLabels: [status]
          regex: "[45].."
          targetLabel: error
          replacement: "true"

  # Optional: TLS configuration
  # tlsConfig:
  #   insecureSkipVerify: true
  #   ca:
  #     secret:
  #       name: prometheus-tls
  #       key: ca.crt

PodMonitor for Direct Pod Scraping

# podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: app-pods
  namespace: monitoring
  labels:
    release: prometheus
spec:
  # Select pods to monitor
  selector:
    matchLabels:
      app: myapp

  # Namespace selection
  namespaceSelector:
    matchNames:
      - production

  # Pod metrics endpoints
  podMetricsEndpoints:
    - port: metrics
      path: /metrics
      interval: 15s

      # Relabeling
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_label_version]
          targetLabel: version
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          targetLabel: node

PrometheusRule for Alerts and Recording Rules

# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-rules
  namespace: monitoring
  labels:
    release: prometheus
    role: alert-rules
spec:
  groups:
    - name: app_alerts
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            (
              sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m]))
              /
              sum(rate(http_requests_total{app="myapp"}[5m]))
            ) > 0.05
          for: 5m
          labels:
            severity: critical
            team: backend
          annotations:
            summary: "High error rate on {{ $labels.namespace }}/{{ $labels.pod }}"
            description: "Error rate is {{ $value | humanizePercentage }}"
            dashboard: "https://grafana.example.com/d/app-overview"
            runbook: "https://wiki.example.com/runbooks/high-error-rate"

        - alert: PodCrashLooping
          expr: |
            rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
            description: "Container {{ $labels.container }} has restarted {{ $value }} times in 15m"

    - name: app_recording_rules
      interval: 30s
      rules:
        - record: app:http_requests:rate5m
          expr: sum(rate(http_requests_total{app="myapp"}[5m])) by (namespace, pod, method, status)

        - record: app:http_request_duration_seconds:p95
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le, namespace, pod)
            )

Prometheus Custom Resource

# prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 2
  version: v2.45.0

  # Service account for Kubernetes API access
  serviceAccountName: prometheus

  # Select ServiceMonitors
  serviceMonitorSelector:
    matchLabels:
      release: prometheus

  # Select PodMonitors
  podMonitorSelector:
    matchLabels:
      release: prometheus

  # Select PrometheusRules
  ruleSelector:
    matchLabels:
      release: prometheus
      role: alert-rules

  # Resource limits
  resources:
    requests:
      memory: 2Gi
      cpu: 1000m
    limits:
      memory: 4Gi
      cpu: 2000m

  # Storage
  storage:
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
        storageClassName: fast-ssd

  # Retention
  retention: 30d
  retentionSize: 45GB

  # Alertmanager configuration
  alerting:
    alertmanagers:
      - namespace: monitoring
        name: alertmanager
        port: web

  # External labels
  externalLabels:
    cluster: production
    region: us-east-1

  # Security context
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000

  # Enable admin API for management operations
  enableAdminAPI: false

  # Additional scrape configs (from Secret)
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml

Application Instrumentation Examples

Go Application

// main.go
package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // Counter for total requests
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    // Histogram for request duration
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "endpoint"},
    )

    // Gauge for active connections
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )

    // Summary for response sizes
    responseSizeBytes = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "http_response_size_bytes",
            Help:       "HTTP response size in bytes",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"endpoint"},
    )
)

// Middleware to instrument HTTP handlers
func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        activeConnections.Inc()
        defer activeConnections.Dec()

        // Wrap response writer to capture status code
        wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}

        handler(wrapped, r)

        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, endpoint,
            http.StatusText(wrapped.statusCode)).Inc()
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func handleUsers(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"users": []}`))
}

func main() {
    // Register handlers
    http.HandleFunc("/api/users", instrumentHandler("/api/users", handleUsers))
    http.Handle("/metrics", promhttp.Handler())

    // Start server
    http.ListenAndServe(":8080", nil)
}

Python Application (Flask)

# app.py
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time

app = Flask(__name__)

# Define metrics
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

active_requests = Gauge(
    'active_requests',
    'Number of active requests'
)

# Middleware for instrumentation
@app.before_request
def before_request():
    active_requests.inc()
    request.start_time = time.time()

@app.after_request
def after_request(response):
    active_requests.dec()

    duration = time.time() - request.start_time
    request_duration.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown'
    ).observe(duration)

    request_count.labels(
        method=request.method,
        endpoint=request.endpoint or 'unknown',
        status=response.status_code
    ).inc()

    return response

@app.route('/metrics')
def metrics():
    return generate_latest()

@app.route('/api/users')
def users():
    return {'users': []}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Production Deployment Checklist

  • [ ] Set appropriate retention period (balance storage vs history needs)
  • [ ] Configure persistent storage with adequate size
  • [ ] Enable high availability (multiple Prometheus replicas or federation)
  • [ ] Set up remote storage for long-term retention (Thanos, Cortex, Mimir)
  • [ ] Configure service discovery for dynamic environments
  • [ ] Implement recording rules for frequently-used queries
  • [ ] Create symptom-based alerts with proper annotations
  • [ ] Set up Alertmanager with appropriate routing and receivers
  • [ ] Configure inhibition rules to reduce alert noise
  • [ ] Add runbook URLs to all critical alerts
  • [ ] Implement proper label hygiene (avoid high cardinality)
  • [ ] Monitor Prometheus itself (meta-monitoring)
  • [ ] Set up authentication and authorization
  • [ ] Enable TLS for scrape targets and remote storage
  • [ ] Configure rate limiting for queries
  • [ ] Test alert and recording rule validity (promtool check rules)
  • [ ] Implement backup and disaster recovery procedures
  • [ ] Document metric naming conventions for the team
  • [ ] Create dashboards in Grafana for common queries
  • [ ] Set up log aggregation alongside metrics (Loki)

Troubleshooting Commands

# Check Prometheus configuration syntax
promtool check config prometheus.yml

# Check rules file syntax
promtool check rules alerts/*.yml

# Test PromQL queries
promtool query instant http://localhost:9090 'up'

# Check which targets are up
curl http://localhost:9090/api/v1/targets

# Query current metric values
curl 'http://localhost:9090/api/v1/query?query=up'

# Check service discovery
curl http://localhost:9090/api/v1/targets/metadata

# View TSDB stats
curl http://localhost:9090/api/v1/status/tsdb

# Check runtime information
curl http://localhost:9090/api/v1/status/runtimeinfo

Quick Reference

Common PromQL Patterns

# Request rate per second
rate(http_requests_total[5m])

# Error ratio percentage
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# P95 latency from histogram
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Average latency from histogram
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

# Memory utilization percentage
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# CPU utilization (non-idle)
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))

# Disk space remaining percentage
100 * node_filesystem_avail_bytes / node_filesystem_size_bytes

# Top 5 endpoints by request rate
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))

# Service uptime in days
(time() - process_start_time_seconds) / 86400

# Request rate growth compared to 1 hour ago
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)

Alert Rule Patterns

# High error rate (symptom-based)
alert: HighErrorRate
expr: |
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
  severity: critical
annotations:
  summary: "Error rate is {{ $value | humanizePercentage }}"
  runbook: "https://runbooks.example.com/high-error-rate"

# High latency P95
alert: HighLatency
expr: |
  histogram_quantile(0.95,
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
  ) > 1
for: 5m
labels:
  severity: warning

# Service down
alert: ServiceDown
expr: up{job="critical-service"} == 0
for: 2m
labels:
  severity: critical

# Disk space low (cause-based, warning only)
alert: DiskSpaceLow
expr: |
  node_filesystem_avail_bytes{mountpoint="/"}
  / node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 10m
labels:
  severity: warning

# Pod crash looping
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
  severity: warning

Recording Rule Naming Convention

# Format: level:metric:operations
# level = aggregation level (job, instance, cluster)
# metric = base metric name
# operations = transformations applied (rate5m, sum, ratio)

groups:
  - name: aggregation_rules
    rules:
      # Instance-level aggregation
      - record: instance:node_cpu_utilization:ratio
        expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

      # Job-level aggregation
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # Job-level error ratio
      - record: job:http_request_errors:ratio
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          / sum(rate(http_requests_total[5m])) by (job)

      # Cluster-level aggregation
      - record: cluster:cpu_utilization:ratio
        expr: avg(instance:node_cpu_utilization:ratio)

Metric Naming Best Practices

| Pattern | Good Example | Bad Example | | ---------------- | ------------------------------- | -------------------------------------- | | Counter suffix | http_requests_total | http_requests | | Base units | http_request_duration_seconds | http_request_duration_ms | | Ratio range | cache_hit_ratio (0.0-1.0) | cache_hit_percentage (0-100) | | Byte units | response_size_bytes | response_size_kb | | Namespace prefix | myapp_http_requests_total | http_requests_total | | Label naming | {method="GET", status="200"} | {httpMethod="GET", statusCode="200"} |

Label Cardinality Guidelines

| Cardinality | Examples | Recommendation | | --------------- | ---------------------------------------- | ----------------------- | | Low (<10) | HTTP method, status code, environment | Safe for all labels | | Medium (10-100) | API endpoint, service name, pod name | Safe with aggregation | | High (100-1000) | Container ID, hostname | Use only when necessary | | Unbounded | User ID, IP address, timestamp, URL path | Never use as label |

Kubernetes Annotation-based Scraping

# Pod annotations for automatic Prometheus scraping
apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
    prometheus.io/scheme: "http"
spec:
  containers:
    - name: app
      image: myapp:latest
      ports:
        - containerPort: 8080
          name: metrics

Alertmanager Routing Patterns

route:
  receiver: default
  group_by: ["alertname", "cluster"]
  routes:
    # Critical alerts to PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      continue: true # Also send to default

    # Team-based routing
    - match:
        team: database
      receiver: dba-team
      group_by: ["alertname", "instance"]

    # Environment-based routing
    - match:
        env: development
      receiver: slack-dev
      repeat_interval: 4h

    # Time-based routing (office hours only)
    - match:
        severity: warning
      receiver: email
      active_time_intervals:
        - business-hours

time_intervals:
  - name: business-hours
    time_intervals:
      - times:
          - start_time: "09:00"
            end_time: "17:00"
        weekdays: ["monday:friday"]

Additional Resources