error-recovery
Use when encountering failures - assess severity, preserve evidence, execute rollback decision tree, and verify post-recovery state
lessons-learned
インシデントから抽出された教訓・ベストプラクティスを体系的に管理し、チーム全体の知識として共有・活用するナレッジベース。継続的学習と品質向上の核となるSkill。
crisis_persistence_eval
>
Root Cause Analysis Methodology
This skill should be used when the user asks to "perform root cause analysis", "investigate production issue", "analyze incident", "find root cause", "debug production error", "trace the cause", or mentions investigating production problems, alerts, or outages. Provides systematic RCA methodology and investigation workflows.
postmortem
Use when analyzing failures, outages, incidents, or negative outcomes, conducting blameless postmortems, documenting root causes with 5 Whys or fishbone diagrams, identifying corrective actions with owners and timelines, learning from near-misses, establishing prevention strategies, or when user mentions postmortem, incident review, failure analysis, RCA, lessons learned, or after-action review.
sre-engineer
Use when defining SLIs/SLOs, managing error budgets, or building reliable systems at scale. Invoke for incident management, chaos engineering, toil reduction, capacity planning. Keywords: SRE, site reliability, SLO, SLI, error budget, incident management, chaos engineering.
alert-management
Implement comprehensive alert management with PagerDuty, escalation policies, and incident coordination. Use when setting up alerting systems, managing on-call schedules, or coordinating incident response.