Agent Skills: INVESTIGATE

|

UncategorizedID: phrazzld/claude-config/investigate

Install this agent skill to your local

pnpm dlx add-skill https://github.com/phrazzld/claude-config/tree/HEAD/skills/investigate

Skill Files

Browse the full folder contents for investigate.

Download Skill

Loading file tree…

skills/investigate/SKILL.md

Skill Metadata

Name
investigate
Description
|

INVESTIGATE

You're a senior SRE investigating a production incident.

The user's bug report: $ARGUMENTS

The Codex First-Draft Pattern

Codex does investigation. You review and verify.

codex exec "INVESTIGATE: $ERROR. Check env vars, logs, recent deploys. Report findings." \
  --output-last-message /tmp/codex-investigation.md 2>/dev/null

Then review Codex's findings. Don't investigate yourself first.

Multi-Hypothesis Mode (Agent Teams)

When >2 plausible root causes and single Codex investigation would anchor on one:

  1. Create agent team with 3-5 investigators
  2. Each teammate gets one hypothesis to prove/disprove
  3. Teammates challenge each other's findings via messages
  4. Lead synthesizes consensus root cause into incident doc

Use when: ambiguous stack trace, multiple services, flaky failures. Don't use when: obvious single cause, config issue, simple regression.

Investigation Protocol

Rule #1: Config Before Code

External service issues are usually config, not code. Check in this order:

  1. Env vars present? npx convex env list --prod | grep <SERVICE> or vercel env ls
  2. Env vars valid? No trailing whitespace, correct format (sk_, whsec_)
  3. Endpoints reachable? curl -I -X POST <webhook_url>
  4. Then examine code

Rule #2: Demand Observable Proof

Before declaring "fixed", show:

  • Log entry that proves the fix worked
  • Metric that changed (e.g., subscription status, webhook delivery)
  • Database state that confirms resolution

Mark investigation as UNVERIFIED until observables confirm. Never trust "should work" — demand proof.

Mission

Create a live investigation document (INCIDENT-{timestamp}.md) and systematically find root cause.

Bounded Shell Output (MANDATORY)

  • Never dump full production logs blindly
  • Start with counts and latest slices (tail -n 200)
  • For large artifacts, use ~/.claude/scripts/safe-read.sh
  • Add hard bounds (--limit, per_page, timeout) to external commands

Your Toolkit

  • Observability: sentry-cli, npx convex, vercel, whatever this project has
  • Git: Recent deploys, changes, bisect
  • Gemini CLI: Web-grounded research, hypothesis generation, similar incident lookup
  • Thinktank: Multi-model validation when you need a second opinion on hypotheses
  • Config: Check env vars and configs early - missing config is often the root cause

The Work Log

Update INCIDENT-{timestamp}.md as you go:

  • Timeline: What happened when (UTC)
  • Evidence: Logs, metrics, configs checked
  • Hypotheses: What you think is wrong, ranked by likelihood
  • Actions: What you tried, what you learned
  • Root cause: When you find it
  • Fix: What you did to resolve it

Root Cause Discipline

For each hypothesis, explicitly categorize:

  • ROOT: Fixing this removes the fundamental cause
  • SYMPTOM: Fixing this masks an underlying issue

Prefer investigating root hypotheses first. If you find yourself proposing a symptom fix, ask:

"What's the underlying architectural issue this symptom reveals?"

Post-fix question: "If we revert this change in 6 months, does the problem return?"

Investigation Philosophy

  • Config before code: Check env vars and configs before diving into code
  • Hypothesize explicitly: Write down what you think is wrong before testing
  • Binary search: Narrow the problem space with each experiment
  • Document as you go: The work log is for handoff, postmortem, and learning

When Done

  • Root cause documented
  • Fix applied (or proposed if too risky)
  • Postmortem section completed (what went wrong, lessons, follow-ups)
  • Consider if the pattern is worth codifying (regression test, agent update, etc.)

Trust your judgment. You don't need permission for read-only operations. If something doesn't work, try another approach.

Visual Deliverable

After completing the core workflow, generate a visual HTML summary:

  1. Read ~/.claude/skills/visualize/prompts/investigate-timeline.md
  2. Read the template(s) referenced in the prompt
  3. Read ~/.claude/skills/visualize/references/css-patterns.md
  4. Generate self-contained HTML capturing this session's output
  5. Write to ~/.agent/diagrams/investigate-{incident}-{date}.html
  6. Open in browser: open ~/.agent/diagrams/investigate-{incident}-{date}.html
  7. Tell the user the file path

Skip visual output if:

  • The session was trivial (single finding, quick fix)
  • The user explicitly opts out (--no-visual)
  • No browser available (SSH session)