Agent Skills: Observability Alert Manager

Configure Grafana alerts for Claude Code anomalies and thresholds. Use when setting up monitoring alerts for sessions, errors, context usage, or subagents.

UncategorizedID: adaptationio/skrillz/observability-alert-manager

Install this agent skill to your local

pnpm dlx add-skill https://github.com/adaptationio/Skrillz/tree/HEAD/.claude/skills/observability-alert-manager

Skill Files

Browse the full folder contents for observability-alert-manager.

Download Skill

Loading file tree…

.claude/skills/observability-alert-manager/SKILL.md

Skill Metadata

Name
observability-alert-manager
Description
Configure Grafana alerts for Claude Code anomalies and thresholds. Use when setting up monitoring alerts for sessions, errors, context usage, or subagents.

Observability Alert Manager

Configure and manage Grafana alerts for Claude Code monitoring using enhanced telemetry.

Data Source

Primary: {job="claude_code_enhanced"} in Loki

Operations

create-alert

Define new alert rule. Parameters: name, query (LogQL), threshold, duration, severity, notification.

list-alerts

Show all configured alerts and their status.

test-alert

Simulate alert conditions.

delete-alert

Remove alert rule.

Pre-built Alert Templates

Session Alerts

  1. Long Session Duration: Session >1 hour

    {job="claude_code_enhanced", event_type="session_end"} | json | duration_seconds > 3600
    
  2. High Turn Count: Session >50 turns

    {job="claude_code_enhanced", event_type="session_end"} | json | turn_count > 50
    
  3. Session Error Spike: >5 errors in session

    {job="claude_code_enhanced", event_type="session_end"} | json | error_count > 5
    

Error Alerts

  1. High Error Rate: >5 errors/hour

    count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error"} [1h]) > 5
    
  2. Specific Tool Failures: Bash errors

    count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error", tool="Bash"} [1h]) > 3
    

Context Alerts

  1. High Context Usage: >80% context window

    {job="claude_code_enhanced", event_type="context_utilization"} | json | context_percentage > 80
    
  2. Auto Compaction Triggered: Context full

    {job="claude_code_enhanced", event_type="context_compact", trigger="auto"}
    

Subagent Alerts

  1. Excessive Subagent Spawning: >10 subagents/session
    {job="claude_code_enhanced", event_type="session_end"} | json | subagents_spawned > 10
    

Activity Alerts

  1. Telemetry Staleness: No data >10min

    absent_over_time({job="claude_code_enhanced"} [10m])
    
  2. Unusual Activity Spike: >100 tool calls/hour

    count_over_time({job="claude_code_enhanced", event_type="tool_call"} [1h]) > 100
    

Prompt Pattern Alerts

  1. Debugging Session Spike: Many debugging prompts
    count_over_time({job="claude_code_enhanced", event_type="user_prompt", pattern="debugging"} [1h]) > 10
    

Example Alert Configurations

Create High Error Rate Alert

create-alert \
  --name "High Error Rate" \
  --query 'count_over_time({job="claude_code_enhanced", event_type="tool_result", status="error"} [1h]) > 5' \
  --severity warning \
  --notification slack

Create Context Usage Alert

create-alert \
  --name "High Context Usage" \
  --query '{job="claude_code_enhanced", event_type="context_utilization"} | json | context_percentage > 80' \
  --severity info \
  --notification email

Create Session Duration Alert

create-alert \
  --name "Long Session Warning" \
  --query '{job="claude_code_enhanced", event_type="session_end"} | json | duration_seconds > 3600' \
  --severity info \
  --notification dashboard

Grafana Alert Setup

Via Grafana UI

  1. Navigate to Alerting → Alert rules
  2. Create new rule with Loki data source
  3. Enter LogQL query from templates above
  4. Configure conditions and notifications

Via API

curl -X POST http://localhost:3000/api/ruler/grafana/api/v1/rules/claude-code \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "name": "claude-code-alerts",
    "rules": [
      {
        "alert": "HighErrorRate",
        "expr": "count_over_time({job=\"claude_code_enhanced\", status=\"error\"} [1h]) > 5",
        "for": "5m",
        "labels": {"severity": "warning"},
        "annotations": {"summary": "High error rate detected"}
      }
    ]
  }'

Notification Channels

  • Slack: Webhook integration
  • Email: SMTP configuration
  • PagerDuty: Incident management
  • Dashboard: On-screen annotations

Alert Severity Levels

| Level | Use Case | |-------|----------| | critical | Immediate action required | | warning | Needs attention soon | | info | Informational, no action needed |

Scripts

  • scripts/create-alert.sh - Create new alert
  • scripts/list-alerts.sh - List all alerts
  • scripts/test-alerts.sh - Test alert conditions
  • scripts/import-alert-templates.sh - Import all pre-built templates
Observability Alert Manager Skill | Agent Skills