Anthropic Incident Runbook Skill

Anthropic Incident Runbook

Severity Classification

| Severity | Condition | Response Time | |----------|-----------|---------------| | P1 | API returning 500/529 for all requests | Immediate | | P2 | Rate limiting (429) or high latency (>10s p99) | 15 minutes | | P3 | Intermittent errors (<5% error rate) | 1 hour | | P4 | Degraded quality (not errors) | Next business day |

Immediate Triage (First 5 Minutes)

# 1. Check Anthropic status page
curl -s https://status.anthropic.com/api/v2/status.json | python3 -c \
  "import sys,json; d=json.load(sys.stdin); print(d['status']['indicator'], '-', d['status']['description'])"

# 2. Test API connectivity
curl -s -w "\nHTTP %{http_code} | Time: %{time_total}s\n" \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-20250514","max_tokens":8,"messages":[{"role":"user","content":"1"}]}'

# 3. Check rate limit headers
curl -s -D - https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-20250514","max_tokens":8,"messages":[{"role":"user","content":"1"}]}' \
  2>/dev/null | grep -i "ratelimit\|retry-after\|request-id"

Decision Tree

API returning errors?
├── 401/403 → Key issue → Check ANTHROPIC_API_KEY is set and valid
├── 429 → Rate limited → Check headers, reduce traffic, wait for retry-after
├── 500 → Server error → Check status.anthropic.com, retry with backoff
├── 529 → Overloaded → Temporary, retry after 30-60s
└── Timeouts → Network or long generation → Increase timeout, check max_tokens

Mitigation Actions

Rate Limiting (429)

# Immediate: reduce traffic
# 1. Enable circuit breaker
# 2. Queue non-critical requests
# 3. Switch to Message Batches for bulk work
# 4. Reduce max_tokens to shorten generation time

API Outage (500/529)

# Graceful degradation
def get_response_with_fallback(prompt: str) -> str:
    try:
        msg = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return msg.content[0].text
    except (anthropic.InternalServerError, anthropic.APIStatusError):
        return "Our AI assistant is temporarily unavailable. Please try again shortly."

Key Compromise

# 1. Immediately revoke key at console.anthropic.com
# 2. Generate new key
# 3. Deploy new key to all environments
# 4. Audit recent usage for unauthorized calls
# 5. File incident report

Postmortem Template

## Incident: [Title]
- **Duration:** [start] to [end]
- **Severity:** P[1-4]
- **Impact:** [what users experienced]
- **Root Cause:** [what went wrong]
- **Detection:** [how we found out]
- **Mitigation:** [what we did to fix it]
- **Request IDs:** [from debug logs]
- **Action Items:**
  - [ ] [preventive measure 1]
  - [ ] [preventive measure 2]

Error Handling

| Symptom | Likely Cause | Quick Fix | |---------|-------------|-----------| | All requests fail 401 | Key rotated/expired | Check Console for active keys | | Sudden 429 spike | Traffic burst or tier change | Check rate limit headers | | Slow responses (>10s) | Large max_tokens or complex prompt | Reduce max_tokens, use Haiku | | Intermittent 500s | Upstream API issue | Check status.anthropic.com |

Resources

Next Steps

For data compliance, see anth-data-handling.

Agent Skills: Anthropic Incident Runbook

Install this agent skill to your local

Skill Files