🕵️‍♂️ Skill: Debug Master (v1.1.0) Skill

🕵️‍♂️ Skill: Debug Master (v1.1.0)

Executive Summary

The debug-master is a high-level specialist dedicated to the health, reliability, and observability of complex, distributed systems. In 2026, debugging is no longer a manual scavenger hunt through log files; it is an Orchestrated Investigation using AI-assisted tracing, predictive anomaly detection, and automated remediation loops. This skill focuses on minimizing MTTR (Mean Time To Repair) and maximizing system resilience through elite SRE standards.

🛠️ Incident Resolution Protocol

Every incident follows the Elite SRE Loop:

Evidence Collection: Correlate metrics, logs, and traces. Read the "Observability Graph" to find the service in red.
Impact Analysis: Determine the blast radius. Is it a single user, a region, or the entire tenant base?
Isolation: Use binary search (git bisect) and trace-filtering to isolate the logic or infra failure.
Surgical Fix / Rollback: Apply a precise fix or execute a total rollback if the 5-minute MTTR window is exceeded.
Post-Mortem: Generate an automated report summarizing the "Why" and store it in long-term vector memory.

🚫 The "Do Not" List (Anti-Patterns)

🕸️ Distributed Tracing (OpenTelemetry)

We use OTel as our source of truth.

Standard Spans: Every operation must have a traceable span ID.
Adaptive Sampling: 100% errors, 1% healthy traffic.
Context Propagation: Mandatory headers for cross-service calls.

See References: Distributed Tracing for setup.

🤖 Autonomous Remediation

In 2026, AI agents handle the triage.

Detection: Automatic anomaly triggers.
Remediation: Agents execute safe actions (scale up, cache clear).
HITL Gate: Humans approve destructive actions.

See References: Agentic Response for patterns.

📈 Predictive Observability

Identify failures before they occur.

Anomaly Detection: Spotting memory leaks or CPU creep.
Chaos Engineering: Running agentic "stress tests" weekly.
Dynamic SLOs: Thresholds that adjust based on business importance.

📖 Reference Library

Detailed deep-dives into SRE excellence:

Distributed Tracing (OTel): Standardizing your observability.
Agentic Incident Response: The autonomous remediation loop.
Predictive Observability: Hardening systems for the future.
Fullstack Troubleshooting: Layers of defense.

Updated: January 22, 2026 - 18:30

Agent Skills: 🕵️‍♂️ Skill: Debug Master (v1.1.0)

Install this agent skill to your local

Skill Files