π΅οΈββοΈ Skill: Debug Master (v1.1.0)
Executive Summary
The debug-master is a high-level specialist dedicated to the health, reliability, and observability of complex, distributed systems. In 2026, debugging is no longer a manual scavenger hunt through log files; it is an Orchestrated Investigation using AI-assisted tracing, predictive anomaly detection, and automated remediation loops. This skill focuses on minimizing MTTR (Mean Time To Repair) and maximizing system resilience through elite SRE standards.
π Table of Contents
- Incident Resolution Protocol
- The "Do Not" List (Anti-Patterns)
- Distributed Tracing (OpenTelemetry)
- Autonomous Remediation (Agentic Loop)
- Predictive Observability
- Fullstack Troubleshooting Layers
- Reference Library
π οΈ Incident Resolution Protocol
Every incident follows the Elite SRE Loop:
- Evidence Collection: Correlate metrics, logs, and traces. Read the "Observability Graph" to find the service in red.
- Impact Analysis: Determine the blast radius. Is it a single user, a region, or the entire tenant base?
- Isolation: Use binary search (
git bisect) and trace-filtering to isolate the logic or infra failure. - Surgical Fix / Rollback: Apply a precise fix or execute a total rollback if the 5-minute MTTR window is exceeded.
- Post-Mortem: Generate an automated report summarizing the "Why" and store it in long-term vector memory.
π« The "Do Not" List (Anti-Patterns)
| Anti-Pattern | Why it fails in 2026 | Modern Alternative | | :--- | :--- | :--- | | "Guess and Check" | Extremely slow and dangerous. | Use Distributed Tracing. | | Ignoring Warnings | Leads to "Alert Fatigue" and outages. | Use Dynamic SLO Tracking. | | Manual Log Scraping| Inefficient for large datasets. | Use AI-Assisted Querying (o3). | | Hotfixing Production | Bypasses CI/CD and causes drift. | Fix in Feature Branch + Deploy. | | Disabling RLS/Security| Huge security risk for a "quick fix." | Fix the Capability Scope. |
πΈοΈ Distributed Tracing (OpenTelemetry)
We use OTel as our source of truth.
- Standard Spans: Every operation must have a traceable span ID.
- Adaptive Sampling: 100% errors, 1% healthy traffic.
- Context Propagation: Mandatory headers for cross-service calls.
See References: Distributed Tracing for setup.
π€ Autonomous Remediation
In 2026, AI agents handle the triage.
- Detection: Automatic anomaly triggers.
- Remediation: Agents execute safe actions (scale up, cache clear).
- HITL Gate: Humans approve destructive actions.
See References: Agentic Response for patterns.
π Predictive Observability
Identify failures before they occur.
- Anomaly Detection: Spotting memory leaks or CPU creep.
- Chaos Engineering: Running agentic "stress tests" weekly.
- Dynamic SLOs: Thresholds that adjust based on business importance.
π Reference Library
Detailed deep-dives into SRE excellence:
- Distributed Tracing (OTel): Standardizing your observability.
- Agentic Incident Response: The autonomous remediation loop.
- Predictive Observability: Hardening systems for the future.
- Fullstack Troubleshooting: Layers of defense.
Updated: January 22, 2026 - 18:30