Runbook Creator Skill | Agent Skills

Runbook Creator

Templates and best practices for creating effective operational runbooks.

When to Use This Skill

Creating runbooks for new services
Documenting incident response procedures
Writing operational playbooks
Standardizing on-call documentation
Automating common procedures

Runbook Principles

Actionable: Every step should be executable
Testable: Verify each step works
Current: Update when systems change
Accessible: Available during incidents (not behind VPN-only)
Linked: Referenced from alerts

Standard Runbook Template

Copy and customize this template:

# [Service Name] - [Issue Type]

## Overview
Brief description of what this runbook addresses.

**Last Updated**: YYYY-MM-DD
**Owner**: [Team/Person]
**Related Alerts**: [Alert names that link here]

## Symptoms
What indicates this issue is occurring:
- [ ] Symptom 1
- [ ] Symptom 2
- [ ] Symptom 3

## Impact
- **Users Affected**: [Description]
- **Severity**: [SEV1/SEV2/SEV3/SEV4]
- **Business Impact**: [Description]

## Prerequisites
- Access to [system/tool]
- Permissions: [required permissions]
- Tools: [required CLI tools]

## Diagnostic Steps

### Step 1: [Verify the Issue]
```bash
# Command to run
kubectl get pods -n production | grep -v Running

Expected Output: [What you should see] If Different: [What to do]

Step 2: [Gather Information]

# Command to run
kubectl logs deployment/my-service -n production --tail=100

Look For: [What to look for in output]

Resolution Steps

Option A: [Quick Fix - e.g., Restart]

Use when: [conditions]

# Step 1: Restart the service
kubectl rollout restart deployment/my-service -n production

# Step 2: Verify pods are coming up
kubectl get pods -n production -w

Verification: [How to confirm fix worked]

Option B: [Rollback]

Use when: [conditions]

# Step 1: Check rollout history
kubectl rollout history deployment/my-service -n production

# Step 2: Rollback to previous version
kubectl rollout undo deployment/my-service -n production

Verification: [How to confirm fix worked]

Verification

How to confirm the issue is resolved:

[ ] Error rate returned to normal
[ ] Latency within SLO
[ ] No related alerts firing
[ ] User-facing functionality working

Escalation

If this runbook doesn't resolve the issue:

First: Contact [Team/Person] via [Slack/Phone]
Then: Page [Escalation contact]
Finally: [Further escalation path]

Related Resources

Revision History

| Date | Author | Change | |------|--------|--------| | YYYY-MM-DD | Name | Initial version |


## Quick Runbook Templates

### Service Restart

```markdown
# [Service] - Restart Procedure

## When to Use
- Service unresponsive
- Memory leak suspected
- After configuration change

## Steps

1. **Notify team**

Post in #incidents: "Restarting [service] due to [reason]"


2. **Restart service**
```bash
kubectl rollout restart deployment/[service] -n [namespace]

Monitor rollout

kubectl rollout status deployment/[service] -n [namespace]

Verify health

kubectl get pods -n [namespace] | grep [service]
# All pods should be Running, 1/1 Ready

Check metrics
- Error rate: [dashboard link]
- Latency: [dashboard link]

Rollback

If restart makes things worse:

kubectl rollout undo deployment/[service] -n [namespace]


### Database Failover

```markdown
# [Database] - Failover Procedure

## When to Use
- Primary database unresponsive
- Planned maintenance
- Primary showing errors

## Prerequisites
- Database admin access
- Verify replica is in sync

## Pre-Failover Checks

1. **Check replication status**
   ```sql
   SELECT * FROM pg_stat_replication;

Verify: state = 'streaming', lag is minimal

Check replica health
```
pg_isready -h replica-host -p 5432
```

Failover Steps

Stop writes to primary (if possible)

ALTER SYSTEM SET default_transaction_read_only = on;
SELECT pg_reload_conf();

Promote replica

pg_ctl promote -D /var/lib/postgresql/data

Update connection strings
- Update DNS/load balancer to point to new primary
- Or update application config

Verify applications reconnected

SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

Post-Failover

[ ] Monitor error rates
[ ] Set up new replica from old primary
[ ] Update documentation


### Cache Clear

```markdown
# [Service] - Cache Clear Procedure

## When to Use
- Stale data being served
- Cache corruption suspected
- After data migration

## Impact Assessment
- Cache clear will cause temporary latency spike
- Database load will increase temporarily

## Steps

1. **Notify team**

Post in #incidents: "Clearing [cache] cache due to [reason]"


2. **Clear cache**

**Redis - All keys**:
```bash
redis-cli -h [host] FLUSHALL

Redis - Specific pattern:

redis-cli -h [host] --scan --pattern "user:*" | xargs redis-cli DEL

Application cache:

curl -X POST http://[service]/admin/cache/clear

Monitor
- Watch cache hit rate recover
- Monitor database load
- Check latency

Verification

Cache hit rate returning to normal
No errors from cache operations
Latency stabilizing


## Runbook Checklist

Before publishing a runbook, verify:

Runbook Quality Checklist:

[ ] Title clearly describes the issue/procedure
[ ] Symptoms section helps identify when to use
[ ] All commands are copy-pasteable
[ ] Expected output documented for each command
[ ] Verification steps confirm success
[ ] Escalation path is clear
[ ] Links to dashboards work
[ ] Tested by someone other than author
[ ] Linked from relevant alerts


## Automation Integration

### Runbook with Automation Hooks

```markdown
# [Service] - Automated Recovery

## Automatic Actions
The following actions run automatically:
1. Pod restart on OOMKilled (Kubernetes)
2. Scale-up on high CPU (HPA)

## Manual Steps (if auto-recovery fails)

### Check why auto-recovery failed
```bash
kubectl describe hpa [service] -n [namespace]
kubectl get events -n [namespace] --sort-by='.lastTimestamp'

Manual intervention

[Steps here]


### Script-Backed Runbook

```markdown
# [Service] - Diagnostic Script

## Quick Diagnosis
Run the diagnostic script:
```bash
./scripts/diagnose-service.sh [service-name]

This script checks:

Pod status
Recent logs
Resource usage
Dependency health

Interpreting Results

| Result | Meaning | Action | |--------|---------|--------| | HEALTHY | All checks pass | No action needed | | DEGRADED | Some issues | Follow specific recommendations | | CRITICAL | Major issues | Escalate immediately |


## Common Runbook Categories

Every service should have runbooks for:

Essential Runbooks:

[ ] Service restart
[ ] Rollback deployment
[ ] Scale up/down
[ ] Clear cache
[ ] Database failover (if applicable)
[ ] Dependency failure response
[ ] High error rate investigation
[ ] High latency investigation


## Additional Resources

- [Example Runbooks](references/example-runbooks.md)
- [Runbook Automation](references/automation.md)

Agent Skills: Runbook Creator

Install this agent skill to your local

Skill Files

Runbook Creator

When to Use This Skill

Runbook Principles

Standard Runbook Template

Step 2: [Gather Information]

Resolution Steps

Option A: [Quick Fix - e.g., Restart]

Option B: [Rollback]

Verification

Escalation

Related Resources

Revision History

Rollback

Failover Steps

Post-Failover

Verification

Manual intervention

Interpreting Results