Agent Skills: Grafana and LGTM Stack Skill

|

UncategorizedID: cosmix/claude-loom/grafana

Install this agent skill to your local

pnpm dlx add-skill https://github.com/cosmix/loom/tree/HEAD/skills/grafana

Skill Files

Browse the full folder contents for grafana.

Download Skill

Loading file tree…

skills/grafana/SKILL.md

Skill Metadata

Name
grafana
Description
|

Grafana and LGTM Stack Skill

Overview

The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:

  • Loki: Log aggregation and querying (LogQL)
  • Grafana: Visualization, dashboarding, alerting, and exploration
  • Tempo: Distributed tracing (TraceQL)
  • Mimir: Long-term metrics storage (Prometheus-compatible)

This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.

When to Use This Skill

Primary Use Cases

  • Creating or modifying Grafana dashboards
  • Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.)
  • Writing queries (PromQL, LogQL, TraceQL)
  • Configuring data sources (Prometheus, Loki, Tempo, Mimir)
  • Setting up alerting rules and notification policies
  • Implementing dashboard variables and templates
  • Dashboard provisioning and GitOps workflows
  • Troubleshooting observability queries
  • Analyzing application performance, errors, or system behavior

Who Uses This Skill

  • senior-software-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture (use infrastructure skills for deployment)
  • software-engineer: Application dashboards, service metrics visualization

LGTM Stack Components

Loki - Log Aggregation

Architecture - Loki

Horizontally scalable log aggregation inspired by Prometheus

  • Indexes only metadata (labels), not log content
  • Cost-effective storage with object stores (S3, GCS, etc.)
  • LogQL query language similar to PromQL

Key Concepts - Loki

  • Labels for indexing (low cardinality)
  • Log streams identified by unique label sets
  • Parsers: logfmt, JSON, regex, pattern
  • Line filters and label filters

Grafana - Visualization

Features

  • Multi-datasource dashboarding
  • Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series
  • Templating and variables for dynamic dashboards
  • Alerting (unified alerting with contact points and notification policies)
  • Dashboard provisioning and GitOps integration
  • Role-based access control (RBAC)
  • Explore mode for ad-hoc queries
  • Annotations for event markers
  • Dashboard folders and organization

Tempo - Distributed Tracing

Architecture - Tempo

Scalable distributed tracing backend

  • Cost-effective trace storage
  • TraceQL for trace querying
  • Integration with logs and metrics (trace-to-logs, trace-to-metrics)
  • OpenTelemetry compatible

Mimir - Metrics Storage

Architecture - Mimir

Horizontally scalable long-term Prometheus storage

  • Multi-tenancy support
  • Query federation
  • High availability
  • Prometheus remote_write compatible

Dashboard Design and Best Practices

Dashboard Organization Principles

  1. Hierarchy: Overview -> Service -> Component -> Deep Dive
  2. Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method)
  3. Variable-driven: Use templates for flexibility across environments
  4. Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow
  5. Performance: Limit queries, use query caching, appropriate time intervals

Panel Types and When to Use Them

| Panel Type | Use Case | Best For | | ----------------------- | ---------------------- | ------------------------------------------------ | | Time Series / Graph | Trends over time | Request rates, latency, resource usage | | Stat | Single metric value | Error rates, current values, percentage | | Gauge | Progress toward limit | CPU usage, memory, disk space | | Bar Gauge | Comparative values | Top N items, distribution | | Table | Structured data | Service lists, error details, resource inventory | | Pie Chart | Proportions | Traffic distribution, error breakdown | | Heatmap | Distribution over time | Latency percentiles, request patterns | | Logs | Log streams | Error investigation, debugging | | Traces | Distributed tracing | Performance analysis, dependency mapping |

Panel Configuration Best Practices

Titles and Descriptions

  • Clear, descriptive titles: Include units and metric context
  • Tooltips: Add description fields for panel documentation
  • Examples:
    • Good: "P95 Latency (seconds) by Endpoint"
    • Bad: "Latency"

Legends and Labels

  • Show legends only when needed (multiple series)
  • Use {{label}} format for dynamic legend names
  • Place legends appropriately (bottom, right, or hidden)
  • Sort by value when showing Top N

Axes and Units

  • Always label axes with units
  • Use appropriate unit formats (seconds, bytes, percent, requests/sec)
  • Set reasonable min/max ranges to avoid misleading scales
  • Use logarithmic scales for wide value ranges

Thresholds and Colors

  • Use thresholds for visual cues (green/yellow/red)
  • Standard threshold pattern:
    • Green: Normal operation
    • Yellow: Warning (action may be needed)
    • Red: Critical (immediate attention required)
  • Examples:
    • Error rate: 0% (green), 1% (yellow), 5% (red)
    • P95 latency: <1s (green), 1-3s (yellow), >3s (red)

Links and Drilldowns

  • Link panels to related dashboards
  • Use data links for context (logs, traces, related services)
  • Create drill-down paths: Overview -> Service -> Component -> Details
  • Link to runbooks for alert panels

Dashboard Variables and Templating

Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.

Variable Types

| Type | Purpose | Example | | -------------- | --------------------------- | ------------------------------- | | Query | Populate from data source | Namespaces, services, pods | | Custom | Static list of options | Environments (prod/staging/dev) | | Interval | Time interval selection | Auto-adjusted query intervals | | Datasource | Switch between data sources | Multiple Prometheus instances | | Constant | Hidden values for queries | Cluster name, region | | Text box | Free-form input | Custom filters |

Common Variable Patterns

{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus",
        "description": "Select Prometheus data source"
      },
      {
        "name": "namespace",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info, namespace)",
        "multi": true,
        "includeAll": true,
        "description": "Kubernetes namespace filter"
      },
      {
        "name": "app",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
        "multi": true,
        "includeAll": true,
        "description": "Application filter (depends on namespace)"
      },
      {
        "name": "interval",
        "type": "interval",
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s",
        "options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
        "description": "Query resolution interval"
      },
      {
        "name": "environment",
        "type": "custom",
        "options": [
          { "text": "Production", "value": "prod" },
          { "text": "Staging", "value": "staging" },
          { "text": "Development", "value": "dev" }
        ],
        "current": { "text": "Production", "value": "prod" }
      }
    ]
  }
}

Variable Usage in Queries

Variables are referenced with $variable_name or ${variable_name} syntax:

# Simple variable reference
rate(http_requests_total{namespace="$namespace"}[5m])

# Multi-select with regex match
rate(http_requests_total{namespace=~"$namespace"}[5m])

# Variable in legend
rate(http_requests_total{app="$app"}[5m]) by (method)
# Legend format: "{{method}}"

# Using interval variable for adaptive queries
rate(http_requests_total[$__interval])

# Chained variables (app depends on namespace)
rate(http_requests_total{namespace="$namespace", app="$app"}[5m])

Advanced Variable Techniques

Regex filtering:

{
  "name": "pod",
  "type": "query",
  "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
  "regex": "/^$app-.*/",
  "description": "Filter pods by app prefix"
}

All option with custom value:

{
  "name": "status",
  "type": "custom",
  "options": ["200", "404", "500"],
  "includeAll": true,
  "allValue": ".*",
  "description": "HTTP status code filter"
}

Dependent variables (variable chain):

  1. $datasource (datasource type)
  2. $cluster (query: depends on datasource)
  3. $namespace (query: depends on cluster)
  4. $app (query: depends on namespace)
  5. $pod (query: depends on app)

Annotations

Annotations display events as vertical markers on time series panels:

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
        "tagKeys": "deployment,namespace",
        "textFormat": "Deployment: {{deployment}}",
        "iconColor": "blue"
      },
      {
        "name": "Alerts",
        "datasource": "Loki",
        "expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
        "textFormat": "Alert: {{alertname}}",
        "iconColor": "red"
      }
    ]
  }
}

Dashboard Performance Optimization

Query Optimization

  • Limit number of panels (< 15 per dashboard)
  • Use appropriate time ranges (avoid queries over months)
  • Leverage $__interval for adaptive sampling
  • Avoid high-cardinality grouping (too many series)
  • Use query caching when available

Panel Performance

  • Set max data points to reasonable values
  • Use instant queries for current-state panels
  • Combine related metrics into single queries when possible
  • Disable auto-refresh on heavy dashboards

Dashboard as Code and Provisioning

Dashboard Provisioning

Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.

Provisioning Provider Configuration

File: /etc/grafana/provisioning/dashboards/dashboards.yaml

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: ""
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards

  - name: "application"
    orgId: 1
    folder: "Applications"
    type: file
    disableDeletion: true
    editable: false
    options:
      path: /var/lib/grafana/dashboards/application

  - name: "infrastructure"
    orgId: 1
    folder: "Infrastructure"
    type: file
    options:
      path: /var/lib/grafana/dashboards/infrastructure

Dashboard JSON Structure

Complete dashboard JSON with metadata and provisioning:

{
  "dashboard": {
    "title": "Application Observability - ${app}",
    "uid": "app-observability",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s",
    "templating": { "list": [] },
    "panels": [],
    "links": []
  },
  "overwrite": true,
  "folderId": null,
  "folderUid": null
}

Kubernetes ConfigMap Provisioning

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  application-dashboard.json: |
    {
      "dashboard": {
        "title": "Application Metrics",
        "uid": "app-metrics",
        "tags": ["application"],
        "panels": []
      }
    }

Grafana Operator (CRD)

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: application-observability
  namespace: monitoring
spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
  json: |
    {
      "dashboard": {
        "title": "Application Observability",
        "panels": []
      }
    }

Data Source Provisioning

Loki Data Source

File: /etc/grafana/provisioning/datasources/loki.yaml

apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 1000
      derivedFields:
        - datasourceUid: tempo_uid
          matcherRegex: "trace_id=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"
    editable: false

Tempo Data Source

File: /etc/grafana/provisioning/datasources/tempo.yaml

apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo_uid
    jsonData:
      httpMethod: GET
      tracesToLogs:
        datasourceUid: loki_uid
        tags: ["job", "instance", "pod", "namespace"]
        mappedTags: [{ key: "service.name", value: "service" }]
        spanStartTimeShift: "1h"
        spanEndTimeShift: "1h"
      tracesToMetrics:
        datasourceUid: prometheus_uid
        tags: [{ key: "service.name", value: "service" }]
      serviceMap:
        datasourceUid: prometheus_uid
      nodeGraph:
        enabled: true
    editable: false

Mimir/Prometheus Data Source

File: /etc/grafana/provisioning/datasources/mimir.yaml

apiVersion: 1

datasources:
  - name: Mimir
    type: prometheus
    access: proxy
    url: http://mimir:8080/prometheus
    uid: prometheus_uid
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - datasourceUid: tempo_uid
          name: trace_id
      prometheusType: Mimir
      prometheusVersion: 2.40.0
      cacheLevel: "High"
      incrementalQuerying: true
      incrementalQueryOverlapWindow: 10m
    editable: false

Alerting

Alert Rule Configuration

Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.

Prometheus/Mimir Alert Rule

File: /etc/grafana/provisioning/alerting/rules.yaml

apiVersion: 1

groups:
  - name: application_alerts
    interval: 1m
    rules:
      - uid: error_rate_high
        title: High Error Rate
        condition: A
        data:
          - refId: A
            queryType: ""
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus_uid
            model:
              expr: |
                sum(rate(http_requests_total{status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total[5m]))
                > 0.05
              intervalMs: 1000
              maxDataPoints: 43200
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)'
          summary: Application error rate is above threshold
          runbook_url: https://wiki.company.com/runbooks/high-error-rate
        labels:
          severity: critical
          team: platform
        isPaused: false

      - uid: high_latency
        title: High P95 Latency
        condition: A
        data:
          - refId: A
            datasourceUid: prometheus_uid
            model:
              expr: |
                histogram_quantile(0.95,
                  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
                ) > 2
        for: 10m
        annotations:
          description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"
          runbook_url: https://wiki.company.com/runbooks/high-latency
        labels:
          severity: warning

Loki Alert Rule

apiVersion: 1

groups:
  - name: log_based_alerts
    interval: 1m
    rules:
      - uid: error_spike
        title: Error Log Spike
        condition: A
        data:
          - refId: A
            queryType: ""
            datasourceUid: loki_uid
            model:
              expr: |
                sum(rate({app="api"} | json | level="error" [5m]))
                > 10
        for: 2m
        annotations:
          description: "Error log rate is {{ $values.A.Value }} logs/sec"
          summary: Spike in error logs detected
        labels:
          severity: warning

      - uid: critical_error_pattern
        title: Critical Error Pattern Detected
        condition: A
        data:
          - refId: A
            datasourceUid: loki_uid
            model:
              expr: |
                sum(count_over_time({app="api"}
                  |~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
                )) > 0
        for: 0m
        annotations:
          description: "Critical error pattern found in logs"
        labels:
          severity: critical
          page: true

Contact Points and Notification Policies

File: /etc/grafana/provisioning/alerting/contactpoints.yaml

apiVersion: 1

contactPoints:
  - orgId: 1
    name: slack-critical
    receivers:
      - uid: slack_critical
        type: slack
        settings:
          url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
          title: "{{ .GroupLabels.alertname }}"
          text: |
            {{ range .Alerts }}
            *Alert:* {{ .Labels.alertname }}
            *Summary:* {{ .Annotations.summary }}
            *Description:* {{ .Annotations.description }}
            *Severity:* {{ .Labels.severity }}
            {{ end }}
        disableResolveMessage: false

  - orgId: 1
    name: pagerduty-oncall
    receivers:
      - uid: pagerduty_oncall
        type: pagerduty
        settings:
          integrationKey: YOUR_INTEGRATION_KEY
          severity: critical
          class: infrastructure

  - orgId: 1
    name: email-team
    receivers:
      - uid: email_team
        type: email
        settings:
          addresses: team@company.com
          singleEmail: true

notificationPolicies:
  - orgId: 1
    receiver: slack-critical
    group_by: ["alertname", "namespace"]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-oncall
        matchers:
          - severity = critical
          - page = true
        group_wait: 10s
        repeat_interval: 1h
        continue: true

      - receiver: email-team
        matchers:
          - severity = warning
          - team = platform
        group_interval: 10m
        repeat_interval: 12h

LogQL Query Patterns

Basic Log Queries

Stream Selection

# Simple label matching
{namespace="production", app="api"}

# Regex matching
{app=~"api|web|worker"}

# Not equal
{env!="staging"}

# Multiple conditions
{namespace="production", app="api", level!="debug"}

Line Filters

# Contains
{app="api"} |= "error"

# Does not contain
{app="api"} != "debug"

# Regex match
{app="api"} |~ "error|exception|fatal"

# Case insensitive
{app="api"} |~ "(?i)error"

# Chaining filters
{app="api"} |= "error" != "timeout"

Parsing and Extraction

JSON Parsing

# Parse JSON logs
{app="api"} | json

# Extract specific fields
{app="api"} | json message="msg", level="severity"

# Filter on extracted field
{app="api"} | json | level="error"

# Nested JSON
{app="api"} | json | line_format "{{.response.status}}"

Logfmt Parsing

# Parse logfmt (key=value)
{app="api"} | logfmt

# Extract specific fields
{app="api"} | logfmt level, caller, msg

# Filter parsed fields
{app="api"} | logfmt | level="error"

Pattern Parsing

# Extract with pattern
{app="nginx"} | pattern `<ip> - - <_> "<method> <uri> <_>" <status> <_>`

# Filter on extracted values
{app="nginx"} | pattern `<_> <status> <_>` | status >= 400

# Complex pattern
{app="api"} | pattern `level=<level> msg="<msg>" duration=<duration>ms`

Aggregations and Metrics

Count Queries

# Count log lines over time
count_over_time({app="api"}[5m])

# Rate of logs
rate({app="api"}[5m])

# Errors per second
sum(rate({app="api"} |= "error" [5m])) by (namespace)

# Error ratio
sum(rate({app="api"} |= "error" [5m]))
/
sum(rate({app="api"}[5m]))

Extracted Metrics

# Average duration
avg_over_time({app="api"}
  | logfmt
  | unwrap duration [5m]) by (endpoint)

# P95 latency
quantile_over_time(0.95, {app="api"}
  | regexp `duration=(?P<duration>[0-9.]+)ms`
  | unwrap duration [5m]) by (method)

# Top 10 error messages
topk(10,
  sum by (msg) (
    count_over_time({app="api"}
      | json
      | level="error" [1h]
    )
  )
)

TraceQL Query Patterns

Basic Trace Queries

# Find traces by service
{ .service.name = "api" }

# HTTP status codes
{ .http.status_code = 500 }

# Combine conditions
{ .service.name = "api" && .http.status_code >= 400 }

# Duration filter
{ duration > 1s }

Advanced TraceQL

# Parent-child relationship
{ .service.name = "frontend" }
  >> { .service.name = "backend" && .http.status_code = 500 }

# Descendant spans
{ .service.name = "api" }
  >>+ { .db.system = "postgresql" && duration > 1s }

# Failed database queries
{ .service.name = "api" }
  >> { .db.system = "postgresql" && status = "error" }

Complete Dashboard Examples

Application Observability Dashboard

{
  "dashboard": {
    "title": "Application Observability - ${app}",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "app",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up, app)",
          "current": {
            "selected": false,
            "text": "api",
            "value": "api"
          }
        },
        {
          "name": "namespace",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up{app=\"$app\"}, namespace)",
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
            "legendFormat": "{{method}} - {{status}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        },
        "yaxes": [
          {
            "format": "reqps",
            "label": "Requests/sec"
          }
        ]
      },
      {
        "id": 2,
        "title": "P95 Latency",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        },
        "yaxes": [
          {
            "format": "s",
            "label": "Duration"
          }
        ],
        "thresholds": [
          {
            "value": 1,
            "colorMode": "critical",
            "fill": true,
            "line": true,
            "op": "gt"
          }
        ]
      },
      {
        "id": 3,
        "title": "Error Rate",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
            "legendFormat": "Error %"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 8
        },
        "yaxes": [
          {
            "format": "percentunit",
            "max": 1,
            "min": 0
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.01],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "avg"
              },
              "type": "query"
            }
          ],
          "frequency": "1m",
          "handler": 1,
          "name": "Error Rate Alert",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 4,
        "title": "Recent Error Logs",
        "type": "logs",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
            "refId": "A"
          }
        ],
        "options": {
          "showTime": true,
          "showLabels": false,
          "showCommonLabels": false,
          "wrapLogMessage": true,
          "dedupStrategy": "none",
          "enableLogDetails": true
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 8
        }
      }
    ],
    "links": [
      {
        "title": "Explore Logs",
        "url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
        "type": "link",
        "icon": "doc"
      },
      {
        "title": "Explore Traces",
        "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
        "type": "link",
        "icon": "gf-traces"
      }
    ]
  }
}

LGTM Stack Configuration

Loki Configuration

File: loki.yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  aws:
    s3: s3://us-east-1/my-loki-bucket
    s3forcepathstyle: true
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    shared_store: s3

limits_config:
  retention_period: 744h # 31 days
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_query_series: 500
  max_query_lookback: 30d
  reject_old_samples: true
  reject_old_samples_max_age: 168h

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

Tempo Configuration

File: tempo.yaml

server:
  http_listen_port: 3200
  grpc_listen_port: 9096

distributor:
  receivers:
    otlp:
      protocols:
        http:
        grpc:
    jaeger:
      protocols:
        thrift_http:
        grpc:

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 720h # 30 days

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: primary
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://mimir:9009/api/v1/push
        send_exemplars: true

Production Best Practices

Performance Optimization

Query Optimization

  • Use label filters before line filters
  • Limit time ranges for expensive queries
  • Use unwrap instead of parsing when possible
  • Cache query results with query frontend

Dashboard Performance

  • Limit number of panels (< 15 per dashboard)
  • Use appropriate time intervals
  • Avoid high-cardinality grouping
  • Use $__interval for adaptive sampling

Storage Optimization

  • Configure retention policies
  • Use compaction for Loki and Tempo
  • Implement tiered storage (hot/warm/cold)
  • Monitor storage growth

Security Best Practices

Authentication

  • Enable auth (auth_enabled: true in Loki/Tempo)
  • Use OAuth/LDAP for Grafana
  • Implement multi-tenancy with org isolation

Authorization

  • Configure RBAC in Grafana
  • Limit datasource access by team
  • Use folder permissions for dashboards

Network Security

  • TLS for all components
  • Network policies in Kubernetes
  • Rate limiting at ingress

Troubleshooting

Common Issues

  1. High Cardinality: Too many unique label combinations

    • Solution: Reduce label dimensions, use log parsing instead
  2. Query Timeouts: Complex queries on large datasets

    • Solution: Reduce time range, use aggregations, add query limits
  3. Storage Growth: Unbounded retention

    • Solution: Configure retention policies, enable compaction
  4. Missing Traces: Incomplete trace data

    • Solution: Check sampling rates, verify instrumentation

Resources