Agent Skills: Runbook Creator

Templates and patterns for creating operational runbooks and playbooks. Use when creating runbooks, writing operational documentation, playbook creation, or documenting procedures for on-call teams.

UncategorizedID: nik-kale/sre-skills/runbook-creator

Install this agent skill to your local

pnpm dlx add-skill https://github.com/nik-kale/sre-skills/tree/HEAD/skills/runbook-creator

Skill Files

Browse the full folder contents for runbook-creator.

Download Skill

Loading file tree…

skills/runbook-creator/SKILL.md

Skill Metadata

Name
runbook-creator
Description
Templates and patterns for creating operational runbooks and playbooks. Use when creating runbooks, writing operational documentation, playbook creation, or documenting procedures for on-call teams.

Runbook Creator

Templates and best practices for creating effective operational runbooks.

When to Use This Skill

  • Creating runbooks for new services
  • Documenting incident response procedures
  • Writing operational playbooks
  • Standardizing on-call documentation
  • Automating common procedures

Runbook Principles

  1. Actionable: Every step should be executable
  2. Testable: Verify each step works
  3. Current: Update when systems change
  4. Accessible: Available during incidents (not behind VPN-only)
  5. Linked: Referenced from alerts

Standard Runbook Template

Copy and customize this template:

# [Service Name] - [Issue Type]

## Overview
Brief description of what this runbook addresses.

**Last Updated**: YYYY-MM-DD
**Owner**: [Team/Person]
**Related Alerts**: [Alert names that link here]

## Symptoms
What indicates this issue is occurring:
- [ ] Symptom 1
- [ ] Symptom 2
- [ ] Symptom 3

## Impact
- **Users Affected**: [Description]
- **Severity**: [SEV1/SEV2/SEV3/SEV4]
- **Business Impact**: [Description]

## Prerequisites
- Access to [system/tool]
- Permissions: [required permissions]
- Tools: [required CLI tools]

## Diagnostic Steps

### Step 1: [Verify the Issue]
```bash
# Command to run
kubectl get pods -n production | grep -v Running

Expected Output: [What you should see] If Different: [What to do]

Step 2: [Gather Information]

# Command to run
kubectl logs deployment/my-service -n production --tail=100

Look For: [What to look for in output]

Resolution Steps

Option A: [Quick Fix - e.g., Restart]

Use when: [conditions]

# Step 1: Restart the service
kubectl rollout restart deployment/my-service -n production

# Step 2: Verify pods are coming up
kubectl get pods -n production -w

Verification: [How to confirm fix worked]

Option B: [Rollback]

Use when: [conditions]

# Step 1: Check rollout history
kubectl rollout history deployment/my-service -n production

# Step 2: Rollback to previous version
kubectl rollout undo deployment/my-service -n production

Verification: [How to confirm fix worked]

Verification

How to confirm the issue is resolved:

  • [ ] Error rate returned to normal
  • [ ] Latency within SLO
  • [ ] No related alerts firing
  • [ ] User-facing functionality working

Escalation

If this runbook doesn't resolve the issue:

  1. First: Contact [Team/Person] via [Slack/Phone]
  2. Then: Page [Escalation contact]
  3. Finally: [Further escalation path]

Related Resources

Revision History

| Date | Author | Change | |------|--------|--------| | YYYY-MM-DD | Name | Initial version |


## Quick Runbook Templates

### Service Restart

```markdown
# [Service] - Restart Procedure

## When to Use
- Service unresponsive
- Memory leak suspected
- After configuration change

## Steps

1. **Notify team**

Post in #incidents: "Restarting [service] due to [reason]"


2. **Restart service**
```bash
kubectl rollout restart deployment/[service] -n [namespace]
  1. Monitor rollout

    kubectl rollout status deployment/[service] -n [namespace]
    
  2. Verify health

    kubectl get pods -n [namespace] | grep [service]
    # All pods should be Running, 1/1 Ready
    
  3. Check metrics

    • Error rate: [dashboard link]
    • Latency: [dashboard link]

Rollback

If restart makes things worse:

kubectl rollout undo deployment/[service] -n [namespace]

### Database Failover

```markdown
# [Database] - Failover Procedure

## When to Use
- Primary database unresponsive
- Planned maintenance
- Primary showing errors

## Prerequisites
- Database admin access
- Verify replica is in sync

## Pre-Failover Checks

1. **Check replication status**
   ```sql
   SELECT * FROM pg_stat_replication;

Verify: state = 'streaming', lag is minimal

  1. Check replica health
    pg_isready -h replica-host -p 5432
    

Failover Steps

  1. Stop writes to primary (if possible)

    ALTER SYSTEM SET default_transaction_read_only = on;
    SELECT pg_reload_conf();
    
  2. Promote replica

    pg_ctl promote -D /var/lib/postgresql/data
    
  3. Update connection strings

    • Update DNS/load balancer to point to new primary
    • Or update application config
  4. Verify applications reconnected

    SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
    

Post-Failover

  • [ ] Monitor error rates
  • [ ] Set up new replica from old primary
  • [ ] Update documentation

### Cache Clear

```markdown
# [Service] - Cache Clear Procedure

## When to Use
- Stale data being served
- Cache corruption suspected
- After data migration

## Impact Assessment
- Cache clear will cause temporary latency spike
- Database load will increase temporarily

## Steps

1. **Notify team**

Post in #incidents: "Clearing [cache] cache due to [reason]"


2. **Clear cache**

**Redis - All keys**:
```bash
redis-cli -h [host] FLUSHALL

Redis - Specific pattern:

redis-cli -h [host] --scan --pattern "user:*" | xargs redis-cli DEL

Application cache:

curl -X POST http://[service]/admin/cache/clear
  1. Monitor
    • Watch cache hit rate recover
    • Monitor database load
    • Check latency

Verification

  • Cache hit rate returning to normal
  • No errors from cache operations
  • Latency stabilizing

## Runbook Checklist

Before publishing a runbook, verify:

Runbook Quality Checklist:

  • [ ] Title clearly describes the issue/procedure
  • [ ] Symptoms section helps identify when to use
  • [ ] All commands are copy-pasteable
  • [ ] Expected output documented for each command
  • [ ] Verification steps confirm success
  • [ ] Escalation path is clear
  • [ ] Links to dashboards work
  • [ ] Tested by someone other than author
  • [ ] Linked from relevant alerts

## Automation Integration

### Runbook with Automation Hooks

```markdown
# [Service] - Automated Recovery

## Automatic Actions
The following actions run automatically:
1. Pod restart on OOMKilled (Kubernetes)
2. Scale-up on high CPU (HPA)

## Manual Steps (if auto-recovery fails)

### Check why auto-recovery failed
```bash
kubectl describe hpa [service] -n [namespace]
kubectl get events -n [namespace] --sort-by='.lastTimestamp'

Manual intervention

[Steps here]


### Script-Backed Runbook

```markdown
# [Service] - Diagnostic Script

## Quick Diagnosis
Run the diagnostic script:
```bash
./scripts/diagnose-service.sh [service-name]

This script checks:

  • Pod status
  • Recent logs
  • Resource usage
  • Dependency health

Interpreting Results

| Result | Meaning | Action | |--------|---------|--------| | HEALTHY | All checks pass | No action needed | | DEGRADED | Some issues | Follow specific recommendations | | CRITICAL | Major issues | Escalate immediately |


## Common Runbook Categories

Every service should have runbooks for:

Essential Runbooks:

  • [ ] Service restart
  • [ ] Rollback deployment
  • [ ] Scale up/down
  • [ ] Clear cache
  • [ ] Database failover (if applicable)
  • [ ] Dependency failure response
  • [ ] High error rate investigation
  • [ ] High latency investigation

## Additional Resources

- [Example Runbooks](references/example-runbooks.md)
- [Runbook Automation](references/automation.md)