Agent Skills: Anthropic Incident Runbook

|

UncategorizedID: jeremylongshore/claude-code-plugins-plus-skills/clade-incident-runbook

Install this agent skill to your local

pnpm dlx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/HEAD/plugins/saas-packs/claude-pack/skills/clade-incident-runbook

Skill Files

Browse the full folder contents for clade-incident-runbook.

Download Skill

Loading file tree…

plugins/saas-packs/claude-pack/skills/clade-incident-runbook/SKILL.md

Skill Metadata

Name
clade-incident-runbook
Description
"Respond to Anthropic API incidents \u2014 outages, degraded performance,\n\

Anthropic Incident Runbook

Overview

Respond to Anthropic API incidents in production — outages, sustained 529 errors, authentication failures, and timeouts. Covers status page checking, severity classification, model fallback activation, communication, and post-incident review.

Step 1: Confirm the Issue

# Check Anthropic status
curl -s https://status.anthropic.com/api/v2/status.json | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f\"Status: {d['status']['description']} ({d['status']['indicator']})\")"

# Test API directly
curl -s -w "\nHTTP %{http_code} in %{time_total}s\n" \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "claude-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-5-20251001","max_tokens":5,"messages":[{"role":"user","content":"ping"}]}'

Step 2: Classify Severity

| Symptom | Severity | Action | |---------|----------|--------| | 529 overloaded (intermittent) | Low | SDK auto-retries handle this | | 529 overloaded (sustained 5+ min) | Medium | Switch to fallback model | | 401/403 on all requests | High | API key issue — check console | | All requests timing out | High | Check status page, activate fallback | | Status page shows incident | Varies | Follow status page updates |

Step 3: Activate Fallback

async function callWithFallback(params: Anthropic.MessageCreateParams) {
  try {
    return await client.messages.create(params);
  } catch (err) {
    if (err instanceof Anthropic.APIError && (err.status === 529 || err.status === 500)) {
      // Try a different model
      if (params.model.includes('opus')) {
        return await client.messages.create({ ...params, model: 'claude-sonnet-4-20250514' });
      }
      if (params.model.includes('sonnet')) {
        return await client.messages.create({ ...params, model: 'claude-haiku-4-5-20251001' });
      }
    }
    throw err;
  }
}

Step 4: Communicate

  • Update your status page if user-facing
  • Note: Anthropic incidents typically resolve in 15-60 minutes

Step 5: Post-Incident

  • Check your error logs for the incident window
  • Calculate impact (failed requests, user impact)
  • Verify all systems recovered

Output

  • Incident confirmed via status page and direct API test
  • Severity classified (Low/Medium/High) based on symptoms
  • Fallback activated if needed (downgrade model or queue requests)
  • Impact assessed and documented post-incident

Error Handling

| Error | Cause | Solution | |-------|-------|----------| | API Error | Check error type and status code | See clade-common-errors |

Examples

See Step 1 (curl status check and API test), Step 2 (severity classification table), Step 3 (fallback code with model downgrade), and Step 5 (post-incident checklist) above.

Resources

Next Steps

See clade-reliability-patterns for building resilient integrations.

Prerequisites

  • Production Claude integration deployed
  • Fallback model configuration in place (see clade-reliability-patterns)
  • Monitoring/alerting configured (see clade-observability)

Instructions

Step 1: Review the patterns below

Each section contains production-ready code examples. Copy and adapt them to your use case.

Step 2: Apply to your codebase

Integrate the patterns that match your requirements. Test each change individually.

Step 3: Verify

Run your test suite to confirm the integration works correctly.