Agent Skills: Monitoring & Observability

Monitoring and observability patterns for Prometheus metrics, Grafana dashboards, Langfuse v4 LLM tracing (as_type, score_current_span, should_export_span, LangfuseMedia), and drift detection. Use when adding logging, metrics, distributed tracing, LLM cost tracking, or quality drift monitoring.

document-asset-creationID: yonatangross/orchestkit/monitoring-observability

Install this agent skill to your local

pnpm dlx add-skill https://github.com/yonatangross/orchestkit/tree/HEAD/src/skills/monitoring-observability

Skill Files

Browse the full folder contents for monitoring-observability.

Download Skill

Loading file tree…

src/skills/monitoring-observability/SKILL.md

Skill Metadata

Name
monitoring-observability
Description
Monitoring and observability patterns for Prometheus metrics, Grafana dashboards, Langfuse v4 LLM tracing (as_type, score_current_span, should_export_span, LangfuseMedia), and drift detection. Use when adding logging, metrics, distributed tracing, LLM cost tracking, or quality drift monitoring.

Monitoring & Observability

Comprehensive patterns for infrastructure monitoring, LLM observability, and quality drift detection. Each category has individual rule files in rules/ loaded on-demand.

Quick Reference

| Category | Rules | Impact | When to Use | |----------|-------|--------|-------------| | Infrastructure Monitoring | 3 | CRITICAL | Prometheus metrics, Grafana dashboards, alerting rules | | LLM Observability | 3 | HIGH | Langfuse tracing, cost tracking, evaluation scoring | | Drift Detection | 3 | HIGH | Statistical drift, quality regression, drift alerting | | Silent Failures | 3 | HIGH | Tool skipping, quality degradation, loop/token spike alerting |

Total: 12 rules across 4 categories

Quick Start

# Prometheus metrics with RED method
from prometheus_client import Counter, Histogram

http_requests = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status'])
http_duration = Histogram('http_request_duration_seconds', 'Request latency',
    buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5])
# Langfuse v4 LLM tracing — semantic as_type + inline scoring
from langfuse import observe, get_client

@observe(as_type="generation", name="analyze_content")
async def analyze_content(content: str):
    get_client().update_current_trace(
        user_id="user_123", session_id="session_abc",
        tags=["production", "orchestkit"],
    )
    result = await llm.generate(content)
    get_client().score_current_span(name="response_quality", value=0.85)
    return result
# PSI drift detection
import numpy as np

psi_score = calculate_psi(baseline_scores, current_scores)
if psi_score >= 0.25:
    alert("Significant quality drift detected!")

Infrastructure Monitoring

Prometheus metrics, Grafana dashboards, and alerting for application health.

| Rule | File | Key Pattern | |------|------|-------------| | Prometheus Metrics | rules/monitoring-prometheus.md | RED method, counters, histograms, cardinality | | Grafana Dashboards | rules/monitoring-grafana.md | Golden Signals, SLO/SLI, health checks | | Alerting Rules | rules/monitoring-alerting.md | Severity levels, grouping, escalation, fatigue prevention |

LLM Observability

Langfuse-based tracing, cost tracking, and evaluation for LLM applications.

| Rule | File | Key Pattern | |------|------|-------------| | Langfuse Traces | rules/llm-langfuse-traces.md | @observe decorator, OTEL spans, agent graphs | | Cost Tracking | rules/llm-cost-tracking.md | Token usage, spend alerts, Metrics API v2 | | Eval Scoring | rules/llm-eval-scoring.md | Custom scores, evaluator tracing, quality monitoring |

Drift Detection

Statistical and quality drift detection for production LLM systems.

| Rule | File | Key Pattern | |------|------|-------------| | Statistical Drift | rules/drift-statistical.md | PSI, KS test, KL divergence, EWMA | | Quality Drift | rules/drift-quality.md | Score regression, baseline comparison, canary prompts | | Drift Alerting | rules/drift-alerting.md | Dynamic thresholds, correlation, anti-patterns |

Silent Failures

Detection and alerting for silent failures in LLM agents.

| Rule | File | Key Pattern | |------|------|-------------| | Tool Skipping | rules/silent-tool-skipping.md | Expected vs actual tool calls, Langfuse traces | | Quality Degradation | rules/silent-degraded-quality.md | Heuristics + LLM-as-judge, z-score baselines | | Silent Alerting | rules/silent-alerting.md | Loop detection, token spikes, escalation workflow |

Key Decisions

| Decision | Recommendation | Rationale | |----------|----------------|-----------| | Metric methodology | RED method (Rate, Errors, Duration) | Industry standard, covers essential service health | | Log format | Structured JSON | Machine-parseable, supports log aggregation | | Tracing | OpenTelemetry | Vendor-neutral, auto-instrumentation, broad ecosystem | | LLM observability | Langfuse (not LangSmith) | Open-source, self-hosted, built-in prompt management | | LLM tracing API | @observe(as_type=...) + score_current_span() | v4: semantic types, inline scoring, span filtering | | Langfuse APIs | Observations API v2 + Metrics API v2 | v4 (Mar 2026): faster querying, aggregations at scale | | Drift method | PSI for production, KS for small samples | PSI is stable for large datasets, KS more sensitive | | Threshold strategy | Dynamic (95th percentile) over static | Reduces alert fatigue, context-aware | | Alert severity | 4 levels (Critical, High, Medium, Low) | Clear escalation paths, appropriate response times |

Detailed Documentation

| Resource | Description | |----------|-------------| | ${CLAUDE_SKILL_DIR}/references/ | Logging, metrics, tracing, Langfuse, drift analysis guides | | ${CLAUDE_SKILL_DIR}/checklists/ | Implementation checklists for monitoring and Langfuse setup | | ${CLAUDE_SKILL_DIR}/examples/ | Real-world monitoring dashboard and trace examples | | ${CLAUDE_SKILL_DIR}/scripts/ | Templates: Prometheus, OpenTelemetry, health checks, Langfuse |

Related Skills

  • defense-in-depth - Layer 8 observability as part of security architecture
  • devops-deployment - Observability integration with CI/CD and Kubernetes
  • resilience-patterns - Monitoring circuit breakers and failure scenarios
  • llm-evaluation - Evaluation patterns that integrate with Langfuse scoring
  • caching - Caching strategies that reduce costs tracked by Langfuse