Postmortem Standards
<metadata>- Scope: Technical/engineering incident postmortem templates, methodologies, and best practices
- Load if: Conducting incident postmortems, writing postmortem reports, establishing postmortem processes, incident response workflows, post-incident analysis
- Prerequisites: None (standalone guideline)
Postmortems are structured reviews after incidents to understand what happened, why it happened, and how to prevent recurrence. Principles: blameless culture (systems, not people), learning focus, timely execution (48-72 hours), actionable outcomes (action items with timelines).
</context>Core Principles
<required>- MUST maintain blameless culture - focus on systems, processes, and contributing factors, not individual fault
- MUST be conducted within 48-72 hours of incident resolution
- MUST include all key participants (incident commander, responders, affected teams)
- MUST result in specific, assigned action items with timelines
- MUST be shared widely within the organization for learning
- NEVER assign blame to individuals or teams
- NEVER skip postmortems for "minor" incidents - all incidents provide learning opportunities
- NEVER create action items without owners and timelines
- NEVER conduct postmortems without key participants present
Report Structure
Include these sections in order:
1. Incident Summary
Include: title, ID, date, duration (ISO 8601 time range), severity (P0/P1/P2), brief description (2-3 sentences), key metrics (downtime, affected users, error rates)
2. Impact Assessment
Include: customer impact (users, regions, services), business impact (revenue, SLA violations, reputation), technical impact (degradation, data loss, performance), duration
3. Timeline
Include: discovery time/method, key events chronologically (local timezone, ISO 8601), response actions, resolution time, post-resolution verification
YYYY-MM-DDTHH:MM:SS±HH:MM - Alert triggered: [Alert description]
YYYY-MM-DDTHH:MM:SS±HH:MM - On-call engineer paged, investigation started
YYYY-MM-DDTHH:MM:SS±HH:MM - Root cause identified: [Root cause description]
YYYY-MM-DDTHH:MM:SS±HH:MM - Mitigation applied: [Mitigation action]
YYYY-MM-DDTHH:MM:SS±HH:MM - Service restored, monitoring confirmed normal operation
4. Root Cause Analysis
Include: primary root cause, contributing factors (system design, process gaps, monitoring gaps, documentation gaps, training gaps, environmental factors), analysis methodology (Five Whys, fishbone diagram, timeline analysis), evidence/data
5. Resolution Steps
Include: immediate mitigation actions, long-term fixes, verification steps, rollback procedures (if applicable)
6. Action Items
Include: ID, description, owner (individual or team), priority (P0/P1/P2 or High/Medium/Low), target completion date, success criteria
Tracking: Use structured lists, issue trackers, or project management tools.
7. Lessons Learned
Include: what went well, what could be improved, process improvements, tooling improvements, knowledge gaps
8. Communication Plan
Include: internal notifications, customer communications (if applicable), status page updates, post-incident review meetings, documentation updates
Root Cause Analysis Methodologies
Five Whys Technique
Ask "why" five times to drill down to root cause:
- Why did the service fail? → [Immediate cause]
- Why [immediate cause]? → [Underlying cause]
- Why [underlying cause]? → [Deeper cause]
- Why wasn't this caught? → [Detection gap]
- Why [detection gap]? → [Root cause]
Fishbone Diagram (Ishikawa)
Categorize contributing factors:
- People: training, knowledge, communication
- Process: procedures, workflows, documentation
- Technology: tools, systems, infrastructure
- Environment: external factors, dependencies
Timeline Analysis
Identify: trigger events, cascade failures, response delays, resolution bottlenecks
Best Practices
When to Conduct
<required>- MUST conduct for all P0/P1 incidents (critical/high severity)
- SHOULD conduct for P2 incidents (medium severity) if they reveal systemic issues
- SHOULD conduct for recurring incidents even if individually low severity
- SHOULD conduct for incidents with customer impact
Participants
Required: Incident commander, primary responders, on-call engineers involved, team leads from affected systems, product/engineering managers (if customer impact)
Optional: SRE/DevOps team members, security team (if security-related), customer support (if customer impact), executive stakeholders (for high-severity incidents)
Timeline for Completion
- 24 hours: Initial incident summary, impact assessment, basic timeline reconstruction
- 48-72 hours: Complete postmortem document, root cause analysis, initial action items identified
- 1-2 weeks: Action items assigned and prioritized, follow-up review meeting scheduled, documentation updates completed
Sharing and Documentation
<required>- MUST publish postmortem in accessible location (wiki, documentation system)
- MUST share with all engineering teams
- MUST include in team retrospectives and learning sessions
- MUST update runbooks and documentation based on learnings
- MUST track action items to completion
Blameless Language
Core principle: Focus on systems, not people. Incidents are system failures; blame prevents learning.
Guidelines: Use "we" not "they". Focus on "what" and "why" not "who".
<forbidden>"[Person] deployed broken code" - assigns blame
</forbidden> <examples>"The deployment process allowed code with a connection leak to reach production" - describes system gap
</examples>ACTION (Recency Zone)
<required>When conducting postmortems:
- Schedule within 48-72 hours of incident resolution
- Include all key participants (incident commander, responders, affected teams)
- Follow 8-section structure (Summary → Impact → Timeline → Root Cause → Resolution → Action Items → Lessons → Communication)
- Use Five Whys or fishbone diagram for root cause analysis
- Assign owners and timelines to all action items
- Share widely for organizational learning
@smith-clarity/SKILL.md- Root cause analysis techniques (Five Whys, fishbone)@smith-validation/SKILL.md- Hypothesis testing