SLO Implementation
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
Do not use this skill when
- The task is unrelated to slo implementation
- You need a different domain or tool outside this scope
Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open
resources/implementation-playbook.md.
Purpose
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
Use this skill when
- Define service reliability targets
- Measure user-perceived reliability
- Implement error budgets
- Create SLO-based alerts
- Track reliability goals
SLI/SLO/SLA Hierarchy
SLA (Service Level Agreement)
↓ Contract with customers
SLO (Service Level Objective)
↓ Internal reliability target
SLI (Service Level Indicator)
↓ Actual measurement
Defining SLIs
Common SLI Types
1. Availability SLI
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
2. Latency SLI
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
3. Durability SLI
# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)
Reference: See references/slo-definitions.md
Setting SLO Targets
Availability SLO Examples
| SLO % | Downtime/Month | Downtime/Year | |-------|----------------|---------------| | 99% | 7.2 hours | 3.65 days | | 99.9% | 43.2 minutes | 8.76 hours | | 99.95%| 21.6 minutes | 4.38 hours | | 99.99%| 4.32 minutes | 52.56 minutes |
Choose Appropriate SLOs
Consider:
- User expectations
- Business requirements
- Current performance
- Cost of reliability
- Competitor benchmarks
Example SLOs:
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
Error Budget Calculation
Error Budget Formula
Error Budget = 1 - SLO Target
Example:
- SLO: 99.9% availability
- Error Budget: 0.1% = 43.2 minutes/month
- Current Error: 0.05% = 21.6 minutes/month
- Remaining Budget: 50%
Error Budget Policy
error_budget_policy:
- remaining_budget: 100%
action: Normal development velocity
- remaining_budget: 50%
action: Consider postponing risky changes
- remaining_budget: 10%
action: Freeze non-critical changes
- remaining_budget: 0%
action: Feature freeze, focus on reliability
Reference: See references/error-budget.md
SLO Implementation
Prometheus Recording Rules
# SLI Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO compliance (1 = meeting SLO, 0 = violating)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# Error budget remaining (percentage)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# Error budget burn rate
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)
SLO Alerting Rules
groups:
- name: slo_alerts
interval: 1m
rules:
# Fast burn: 14.4x rate, 1 hour window
# Consumes 2% error budget in 1 hour
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Slow burn: 6x rate, 6 hour window
# Consumes 5% error budget in 6 hours
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Error budget exhausted
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO error budget exhausted"
description: "Error budget remaining: {{ $value }}%"
SLO Dashboard
Grafana Dashboard Structure:
┌────────────────────────────────────┐
│ SLO Compliance (Current) │
│ ✓ 99.95% (Target: 99.9%) │
├────────────────────────────────────┤
│ Error Budget Remaining: 65% │
│ ████████░░ 65% │
├────────────────────────────────────┤
│ SLI Trend (28 days) │
│ [Time series graph] │
├────────────────────────────────────┤
│ Burn Rate Analysis │
│ [Burn rate by time window] │
└────────────────────────────────────┘
Example Queries:
# Current SLO compliance
sli:http_availability:ratio * 100
# Error budget remaining
slo:http_availability:error_budget_remaining
# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)
Multi-Window Burn Rate Alerts
# Combination of short and long windows reduces false positives
rules:
- alert: SLOBurnRateHigh
expr: |
(
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
)
or
(
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
)
labels:
severity: critical
SLO Review Process
Weekly Review
- Current SLO compliance
- Error budget status
- Trend analysis
- Incident impact
Monthly Review
- SLO achievement
- Error budget usage
- Incident postmortems
- SLO adjustments
Quarterly Review
- SLO relevance
- Target adjustments
- Process improvements
- Tooling enhancements
Best Practices
- Start with user-facing services
- Use multiple SLIs (availability, latency, etc.)
- Set achievable SLOs (don't aim for 100%)
- Implement multi-window alerts to reduce noise
- Track error budget consistently
- Review SLOs regularly
- Document SLO decisions
- Align with business goals
- Automate SLO reporting
- Use SLOs for prioritization
Reference Files
assets/slo-template.md- SLO definition templatereferences/slo-definitions.md- SLO definition patternsreferences/error-budget.md- Error budget calculations
Related Skills
prometheus-configuration- For metric collectiongrafana-dashboards- For SLO visualization