SLO Alerting Skill | Agent Skills

SLO Alerting

Define SLIs, set SLO targets, alert on burn rate (not raw error rate).

Concepts

| Term | Definition | Example | |------|------------|---------| | SLI | Quantitative measure | % successful requests | | SLO | Target for SLI | 99.9% success | | Error Budget | Allowed failure | 0.1% = 43 min/month | | Burn Rate | Budget consumption speed | 10x = exhausted in 3 days |

Common SLIs

Availability: successful_requests / total_requests
Latency:      requests_under_threshold / total_requests
Error Rate:   error_requests / total_requests

Burn Rate Alerting

Alert on how fast you're consuming budget, not raw error rate:

| Alert Level | Burn Rate | Time to Exhaust | |-------------|-----------|-----------------| | Page (critical) | 14.4x | 2 days | | Page (warning) | 6x | 5 days | | Ticket (medium) | 3x | 10 days |

Multi-Window Strategy

Use long + short windows to balance speed and noise:

# Critical: Fast burn (14.4x over 1h AND 5m)
- alert: HighBurnRate_Critical
  expr: (rate_1h / budget > 14.4) and (rate_5m / budget > 14.4)
  severity: critical

# Warning: Slower burn (6x over 6h AND 30m)
- alert: HighBurnRate_Warning
  expr: (rate_6h / budget > 6) and (rate_30m / budget > 6)
  severity: warning

Dashboard Essentials

Current burn rate
Error budget remaining (%)
Time until exhaustion at current rate

Anti-Patterns

Too many SLOs → SLO per user journey, not per endpoint
Alerting on raw error rate → Noisy, doesn't account for budget
No budget visualization → Teams don't understand burn rate

References

references/methodology/sli-slo-framework.md

Agent Skills: SLO Alerting

Install this agent skill to your local

Skill Files