Incident Response Skill

Incident Response

Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.

When to Use

Production incident in progress (outage, degradation, data loss)
Designing circuit breakers, bulkheads, or fallback strategies
Conducting or planning chaos engineering exercises
Writing or reviewing postmortem documents
Establishing on-call procedures and escalation paths

Avoid when:

The issue is a development-time bug with no production impact
Designing general system architecture (use system-design instead)

Quick Reference

| Topic | Load reference | | --- | --- | | Triage Framework | skills/incident-response/references/triage-framework.md | | Postmortem Patterns | skills/incident-response/references/postmortem-patterns.md |

Incident Response Workflow

Phase 1: Detect

Alert fires or user report received
Confirm the issue is real (not a false positive)
Identify affected services and user impact scope

Phase 2: Triage

Classify severity (P0-P3)
Assign incident commander
Open communication channel (war room, Slack channel)
Begin status page updates

Phase 3: Contain

Stop the bleeding: rollback, feature flag, traffic shift
Prevent cascade: circuit breakers, load shedding, bulkhead isolation
Communicate: stakeholder updates every 15 minutes for P0/P1

Phase 4: Resolve

Implement fix (minimal viable fix first)
Validate in staging if time permits
Deploy with monitoring and rollback plan ready
Confirm recovery with metrics returning to baseline

Phase 5: Postmortem

Document timeline within 48 hours
Conduct blameless review with all participants
Identify root cause and contributing factors
Assign action items with owners and deadlines
Update runbooks and alerting based on lessons learned

Severity Framework

| Level | Impact | Response Time | Examples | |-------|--------|---------------|---------| | P0 | Complete outage, data loss, security breach | Immediate (< 5 min) | Service down, data corruption, credential leak | | P1 | Major feature broken, significant user impact | < 30 min | Payment processing failed, auth broken for region | | P2 | Degraded performance, partial feature loss | < 4 hours | Elevated latency, non-critical feature unavailable | | P3 | Minor issue, workaround available | Next business day | UI glitch, slow report generation, cosmetic error |

Output

Incident timeline and severity classification
Containment actions taken
Postmortem document with action items
Updated runbooks and alerting rules

Common Mistakes

Skipping severity classification and treating everything as P0
Making changes without a rollback plan
Forgetting to communicate status to stakeholders
Writing postmortems that assign blame instead of identifying systemic issues
Not following up on postmortem action items

Agent Skills: Incident Response

Install this agent skill to your local

Skill Files