Grafana and LGTM Stack Skill Skill

Grafana and LGTM Stack Skill

Overview

The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:

Loki: Log aggregation and querying (LogQL)
Grafana: Visualization, dashboarding, alerting, and exploration
Tempo: Distributed tracing (TraceQL)
Mimir: Long-term metrics storage (Prometheus-compatible)

This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.

When to Use This Skill

Primary Use Cases

Creating or modifying Grafana dashboards
Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.)
Writing queries (PromQL, LogQL, TraceQL)
Configuring data sources (Prometheus, Loki, Tempo, Mimir)
Setting up alerting rules and notification policies
Implementing dashboard variables and templates
Dashboard provisioning and GitOps workflows
Troubleshooting observability queries
Analyzing application performance, errors, or system behavior

Who Uses This Skill

senior-software-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture (use infrastructure skills for deployment)
software-engineer: Application dashboards, service metrics visualization

LGTM Stack Components

Loki - Log Aggregation

Architecture - Loki

Horizontally scalable log aggregation inspired by Prometheus

Indexes only metadata (labels), not log content
Cost-effective storage with object stores (S3, GCS, etc.)
LogQL query language similar to PromQL

Key Concepts - Loki

Labels for indexing (low cardinality)
Log streams identified by unique label sets
Parsers: logfmt, JSON, regex, pattern
Line filters and label filters

Grafana - Visualization

Features

Multi-datasource dashboarding
Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series
Templating and variables for dynamic dashboards
Alerting (unified alerting with contact points and notification policies)
Dashboard provisioning and GitOps integration
Role-based access control (RBAC)
Explore mode for ad-hoc queries
Annotations for event markers
Dashboard folders and organization

Tempo - Distributed Tracing

Architecture - Tempo

Scalable distributed tracing backend

Cost-effective trace storage
TraceQL for trace querying
Integration with logs and metrics (trace-to-logs, trace-to-metrics)
OpenTelemetry compatible

Mimir - Metrics Storage

Architecture - Mimir

Horizontally scalable long-term Prometheus storage

Multi-tenancy support
Query federation
High availability
Prometheus remote_write compatible

Dashboard Design and Best Practices

Dashboard Organization Principles

Hierarchy: Overview -> Service -> Component -> Deep Dive
Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method)
Variable-driven: Use templates for flexibility across environments
Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow
Performance: Limit queries, use query caching, appropriate time intervals

Panel Types and When to Use Them

| Panel Type | Use Case | Best For | | ----------------------- | ---------------------- | ------------------------------------------------ | | Time Series / Graph | Trends over time | Request rates, latency, resource usage | | Stat | Single metric value | Error rates, current values, percentage | | Gauge | Progress toward limit | CPU usage, memory, disk space | | Bar Gauge | Comparative values | Top N items, distribution | | Table | Structured data | Service lists, error details, resource inventory | | Pie Chart | Proportions | Traffic distribution, error breakdown | | Heatmap | Distribution over time | Latency percentiles, request patterns | | Logs | Log streams | Error investigation, debugging | | Traces | Distributed tracing | Performance analysis, dependency mapping |

Panel Configuration Best Practices

Titles and Descriptions

Clear, descriptive titles: Include units and metric context
Tooltips: Add description fields for panel documentation
Examples:
- Good: "P95 Latency (seconds) by Endpoint"
- Bad: "Latency"

Legends and Labels

Show legends only when needed (multiple series)
Use {{label}} format for dynamic legend names
Place legends appropriately (bottom, right, or hidden)
Sort by value when showing Top N

Axes and Units

Always label axes with units
Use appropriate unit formats (seconds, bytes, percent, requests/sec)
Set reasonable min/max ranges to avoid misleading scales
Use logarithmic scales for wide value ranges

Thresholds and Colors

Use thresholds for visual cues (green/yellow/red)
Standard threshold pattern:
- Green: Normal operation
- Yellow: Warning (action may be needed)
- Red: Critical (immediate attention required)
Examples:
- Error rate: 0% (green), 1% (yellow), 5% (red)
- P95 latency: <1s (green), 1-3s (yellow), >3s (red)

Links and Drilldowns

Link panels to related dashboards
Use data links for context (logs, traces, related services)
Create drill-down paths: Overview -> Service -> Component -> Details
Link to runbooks for alert panels

Dashboard Variables and Templating

Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.

Variable Types

| Type | Purpose | Example | | -------------- | --------------------------- | ------------------------------- | | Query | Populate from data source | Namespaces, services, pods | | Custom | Static list of options | Environments (prod/staging/dev) | | Interval | Time interval selection | Auto-adjusted query intervals | | Datasource | Switch between data sources | Multiple Prometheus instances | | Constant | Hidden values for queries | Cluster name, region | | Text box | Free-form input | Custom filters |

Common Variable Patterns

{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus",
        "description": "Select Prometheus data source"
      },
      {
        "name": "namespace",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info, namespace)",
        "multi": true,
        "includeAll": true,
        "description": "Kubernetes namespace filter"
      },
      {
        "name": "app",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
        "multi": true,
        "includeAll": true,
        "description": "Application filter (depends on namespace)"
      },
      {
        "name": "interval",
        "type": "interval",
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s",
        "options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
        "description": "Query resolution interval"
      },
      {
        "name": "environment",
        "type": "custom",
        "options": [
          { "text": "Production", "value": "prod" },
          { "text": "Staging", "value": "staging" },
          { "text": "Development", "value": "dev" }
        ],
        "current": { "text": "Production", "value": "prod" }
      }
    ]
  }
}

Variable Usage in Queries

Variables are referenced with $variable_name or ${variable_name} syntax:

# Simple variable reference
rate(http_requests_total{namespace="$namespace"}[5m])

# Multi-select with regex match
rate(http_requests_total{namespace=~"$namespace"}[5m])

# Variable in legend
rate(http_requests_total{app="$app"}[5m]) by (method)
# Legend format: "{{method}}"

# Using interval variable for adaptive queries
rate(http_requests_total[$__interval])

# Chained variables (app depends on namespace)
rate(http_requests_total{namespace="$namespace", app="$app"}[5m])

Advanced Variable Techniques

Regex filtering:

{
  "name": "pod",
  "type": "query",
  "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
  "regex": "/^$app-.*/",
  "description": "Filter pods by app prefix"
}

All option with custom value:

{
  "name": "status",
  "type": "custom",
  "options": ["200", "404", "500"],
  "includeAll": true,
  "allValue": ".*",
  "description": "HTTP status code filter"
}

Dependent variables (variable chain):

$datasource (datasource type)
$cluster (query: depends on datasource)
$namespace (query: depends on cluster)
$app (query: depends on namespace)
$pod (query: depends on app)

Annotations

Annotations display events as vertical markers on time series panels:

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
        "tagKeys": "deployment,namespace",
        "textFormat": "Deployment: {{deployment}}",
        "iconColor": "blue"
      },
      {
        "name": "Alerts",
        "datasource": "Loki",
        "expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
        "textFormat": "Alert: {{alertname}}",
        "iconColor": "red"
      }
    ]
  }
}

Dashboard Performance Optimization

Query Optimization

Limit number of panels (< 15 per dashboard)
Use appropriate time ranges (avoid queries over months)
Leverage $__interval for adaptive sampling
Avoid high-cardinality grouping (too many series)
Use query caching when available

Panel Performance

Set max data points to reasonable values
Use instant queries for current-state panels
Combine related metrics into single queries when possible
Disable auto-refresh on heavy dashboards

Dashboard as Code and Provisioning

Dashboard Provisioning

Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.

Provisioning Provider Configuration

File: /etc/grafana/provisioning/dashboards/dashboards.yaml

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: ""
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards

  - name: "application"
    orgId: 1
    folder: "Applications"
    type: file
    disableDeletion: true
    editable: false
    options:
      path: /var/lib/grafana/dashboards/application

  - name: "infrastructure"
    orgId: 1
    folder: "Infrastructure"
    type: file
    options:
      path: /var/lib/grafana/dashboards/infrastructure

Dashboard JSON Structure

Complete dashboard JSON with metadata and provisioning:

{
  "dashboard": {
    "title": "Application Observability - ${app}",
    "uid": "app-observability",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s",
    "templating": { "list": [] },
    "panels": [],
    "links": []
  },
  "overwrite": true,
  "folderId": null,
  "folderUid": null
}

Kubernetes ConfigMap Provisioning

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  application-dashboard.json: |
    {
      "dashboard": {
        "title": "Application Metrics",
        "uid": "app-metrics",
        "tags": ["application"],
        "panels": []
      }
    }

Grafana Operator (CRD)

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: application-observability
  namespace: monitoring
spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
  json: |
    {
      "dashboard": {
        "title": "Application Observability",
        "panels": []
      }
    }

Data Source Provisioning

Loki Data Source

File: /etc/grafana/provisioning/datasources/loki.yaml

apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 1000
      derivedFields:
        - datasourceUid: tempo_uid
          matcherRegex: "trace_id=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"
    editable: false

Tempo Data Source

File: /etc/grafana/provisioning/datasources/tempo.yaml

apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo_uid
    jsonData:
      httpMethod: GET
      tracesToLogs:
        datasourceUid: loki_uid
        tags: ["job", "instance", "pod", "namespace"]
        mappedTags: [{ key: "service.name", value: "service" }]
        spanStartTimeShift: "1h"
        spanEndTimeShift: "1h"
      tracesToMetrics:
        datasourceUid: prometheus_uid
        tags: [{ key: "service.name", value: "service" }]
      serviceMap:
        datasourceUid: prometheus_uid
      nodeGraph:
        enabled: true
    editable: false

Mimir/Prometheus Data Source

File: /etc/grafana/provisioning/datasources/mimir.yaml

apiVersion: 1

datasources:
  - name: Mimir
    type: prometheus
    access: proxy
    url: http://mimir:8080/prometheus
    uid: prometheus_uid
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - datasourceUid: tempo_uid
          name: trace_id
      prometheusType: Mimir
      prometheusVersion: 2.40.0
      cacheLevel: "High"
      incrementalQuerying: true
      incrementalQueryOverlapWindow: 10m
    editable: false

Alerting

Alert Rule Configuration

Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.

Prometheus/Mimir Alert Rule

File: /etc/grafana/provisioning/alerting/rules.yaml

apiVersion: 1

groups:
  - name: application_alerts
    interval: 1m
    rules:
      - uid: error_rate_high
        title: High Error Rate
        condition: A
        data:
          - refId: A
            queryType: ""
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus_uid
            model:
              expr: |
                sum(rate(http_requests_total{status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total[5m]))
                > 0.05
              intervalMs: 1000
              maxDataPoints: 43200
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)'
          summary: Application error rate is above threshold
          runbook_url: https://wiki.company.com/runbooks/high-error-rate
        labels:
          severity: critical
          team: platform
        isPaused: false

      - uid: high_latency
        title: High P95 Latency
        condition: A
        data:
          - refId: A
            datasourceUid: prometheus_uid
            model:
              expr: |
                histogram_quantile(0.95,
                  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
                ) > 2
        for: 10m
        annotations:
          description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"
          runbook_url: https://wiki.company.com/runbooks/high-latency
        labels:
          severity: warning

Loki Alert Rule

apiVersion: 1

groups:
  - name: log_based_alerts
    interval: 1m
    rules:
      - uid: error_spike
        title: Error Log Spike
        condition: A
        data:
          - refId: A
            queryType: ""
            datasourceUid: loki_uid
            model:
              expr: |
                sum(rate({app="api"} | json | level="error" [5m]))
                > 10
        for: 2m
        annotations:
          description: "Error log rate is {{ $values.A.Value }} logs/sec"
          summary: Spike in error logs detected
        labels:
          severity: warning

      - uid: critical_error_pattern
        title: Critical Error Pattern Detected
        condition: A
        data:
          - refId: A
            datasourceUid: loki_uid
            model:
              expr: |
                sum(count_over_time({app="api"}
                  |~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
                )) > 0
        for: 0m
        annotations:
          description: "Critical error pattern found in logs"
        labels:
          severity: critical
          page: true

Contact Points and Notification Policies

File: /etc/grafana/provisioning/alerting/contactpoints.yaml

apiVersion: 1

contactPoints:
  - orgId: 1
    name: slack-critical
    receivers:
      - uid: slack_critical
        type: slack
        settings:
          url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
          title: "{{ .GroupLabels.alertname }}"
          text: |
            {{ range .Alerts }}
            *Alert:* {{ .Labels.alertname }}
            *Summary:* {{ .Annotations.summary }}
            *Description:* {{ .Annotations.description }}
            *Severity:* {{ .Labels.severity }}
            {{ end }}
        disableResolveMessage: false

  - orgId: 1
    name: pagerduty-oncall
    receivers:
      - uid: pagerduty_oncall
        type: pagerduty
        settings:
          integrationKey: YOUR_INTEGRATION_KEY
          severity: critical
          class: infrastructure

  - orgId: 1
    name: email-team
    receivers:
      - uid: email_team
        type: email
        settings:
          addresses: team@company.com
          singleEmail: true

notificationPolicies:
  - orgId: 1
    receiver: slack-critical
    group_by: ["alertname", "namespace"]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-oncall
        matchers:
          - severity = critical
          - page = true
        group_wait: 10s
        repeat_interval: 1h
        continue: true

      - receiver: email-team
        matchers:
          - severity = warning
          - team = platform
        group_interval: 10m
        repeat_interval: 12h

LogQL Query Patterns

Basic Log Queries

Stream Selection

# Simple label matching
{namespace="production", app="api"}

# Regex matching
{app=~"api|web|worker"}

# Not equal
{env!="staging"}

# Multiple conditions
{namespace="production", app="api", level!="debug"}

Line Filters

# Contains
{app="api"} |= "error"

# Does not contain
{app="api"} != "debug"

# Regex match
{app="api"} |~ "error|exception|fatal"

# Case insensitive
{app="api"} |~ "(?i)error"

# Chaining filters
{app="api"} |= "error" != "timeout"

Parsing and Extraction

JSON Parsing

# Parse JSON logs
{app="api"} | json

# Extract specific fields
{app="api"} | json message="msg", level="severity"

# Filter on extracted field
{app="api"} | json | level="error"

# Nested JSON
{app="api"} | json | line_format "{{.response.status}}"

Logfmt Parsing

# Parse logfmt (key=value)
{app="api"} | logfmt

# Extract specific fields
{app="api"} | logfmt level, caller, msg

# Filter parsed fields
{app="api"} | logfmt | level="error"

Pattern Parsing

# Extract with pattern
{app="nginx"} | pattern `<ip> - - <_> "<method> <uri> <_>" <status> <_>`

# Filter on extracted values
{app="nginx"} | pattern `<_> <status> <_>` | status >= 400

# Complex pattern
{app="api"} | pattern `level=<level> msg="<msg>" duration=<duration>ms`

Aggregations and Metrics

Count Queries

# Count log lines over time
count_over_time({app="api"}[5m])

# Rate of logs
rate({app="api"}[5m])

# Errors per second
sum(rate({app="api"} |= "error" [5m])) by (namespace)

# Error ratio
sum(rate({app="api"} |= "error" [5m]))
/
sum(rate({app="api"}[5m]))

Extracted Metrics

# Average duration
avg_over_time({app="api"}
  | logfmt
  | unwrap duration [5m]) by (endpoint)

# P95 latency
quantile_over_time(0.95, {app="api"}
  | regexp `duration=(?P<duration>[0-9.]+)ms`
  | unwrap duration [5m]) by (method)

# Top 10 error messages
topk(10,
  sum by (msg) (
    count_over_time({app="api"}
      | json
      | level="error" [1h]
    )
  )
)

TraceQL Query Patterns

Basic Trace Queries

# Find traces by service
{ .service.name = "api" }

# HTTP status codes
{ .http.status_code = 500 }

# Combine conditions
{ .service.name = "api" && .http.status_code >= 400 }

# Duration filter
{ duration > 1s }

Advanced TraceQL

# Parent-child relationship
{ .service.name = "frontend" }
  >> { .service.name = "backend" && .http.status_code = 500 }

# Descendant spans
{ .service.name = "api" }
  >>+ { .db.system = "postgresql" && duration > 1s }

# Failed database queries
{ .service.name = "api" }
  >> { .db.system = "postgresql" && status = "error" }

Complete Dashboard Examples

Application Observability Dashboard

{
  "dashboard": {
    "title": "Application Observability - ${app}",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "app",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up, app)",
          "current": {
            "selected": false,
            "text": "api",
            "value": "api"
          }
        },
        {
          "name": "namespace",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up{app=\"$app\"}, namespace)",
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
            "legendFormat": "{{method}} - {{status}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        },
        "yaxes": [
          {
            "format": "reqps",
            "label": "Requests/sec"
          }
        ]
      },
      {
        "id": 2,
        "title": "P95 Latency",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        },
        "yaxes": [
          {
            "format": "s",
            "label": "Duration"
          }
        ],
        "thresholds": [
          {
            "value": 1,
            "colorMode": "critical",
            "fill": true,
            "line": true,
            "op": "gt"
          }
        ]
      },
      {
        "id": 3,
        "title": "Error Rate",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
            "legendFormat": "Error %"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 8
        },
        "yaxes": [
          {
            "format": "percentunit",
            "max": 1,
            "min": 0
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.01],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "avg"
              },
              "type": "query"
            }
          ],
          "frequency": "1m",
          "handler": 1,
          "name": "Error Rate Alert",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 4,
        "title": "Recent Error Logs",
        "type": "logs",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
            "refId": "A"
          }
        ],
        "options": {
          "showTime": true,
          "showLabels": false,
          "showCommonLabels": false,
          "wrapLogMessage": true,
          "dedupStrategy": "none",
          "enableLogDetails": true
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 8
        }
      }
    ],
    "links": [
      {
        "title": "Explore Logs",
        "url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
        "type": "link",
        "icon": "doc"
      },
      {
        "title": "Explore Traces",
        "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
        "type": "link",
        "icon": "gf-traces"
      }
    ]
  }
}

LGTM Stack Configuration

Loki Configuration

File: loki.yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  aws:
    s3: s3://us-east-1/my-loki-bucket
    s3forcepathstyle: true
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    shared_store: s3

limits_config:
  retention_period: 744h # 31 days
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_query_series: 500
  max_query_lookback: 30d
  reject_old_samples: true
  reject_old_samples_max_age: 168h

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

Tempo Configuration

File: tempo.yaml

server:
  http_listen_port: 3200
  grpc_listen_port: 9096

distributor:
  receivers:
    otlp:
      protocols:
        http:
        grpc:
    jaeger:
      protocols:
        thrift_http:
        grpc:

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 720h # 30 days

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: primary
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://mimir:9009/api/v1/push
        send_exemplars: true

Production Best Practices

Performance Optimization

Query Optimization

Use label filters before line filters
Limit time ranges for expensive queries
Use unwrap instead of parsing when possible
Cache query results with query frontend

Dashboard Performance

Limit number of panels (< 15 per dashboard)
Use appropriate time intervals
Avoid high-cardinality grouping
Use $__interval for adaptive sampling

Storage Optimization

Configure retention policies
Use compaction for Loki and Tempo
Implement tiered storage (hot/warm/cold)
Monitor storage growth

Security Best Practices

Authentication

Enable auth (auth_enabled: true in Loki/Tempo)
Use OAuth/LDAP for Grafana
Implement multi-tenancy with org isolation

Authorization

Configure RBAC in Grafana
Limit datasource access by team
Use folder permissions for dashboards

Network Security

TLS for all components
Network policies in Kubernetes
Rate limiting at ingress

Troubleshooting

Common Issues

High Cardinality: Too many unique label combinations
- Solution: Reduce label dimensions, use log parsing instead
Query Timeouts: Complex queries on large datasets
- Solution: Reduce time range, use aggregations, add query limits
Storage Growth: Unbounded retention
- Solution: Configure retention policies, enable compaction
Missing Traces: Incomplete trace data
- Solution: Check sampling rates, verify instrumentation

Agent Skills: Grafana and LGTM Stack Skill

Install this agent skill to your local

Skill Files