Error Recovery Skill Skill

Error Recovery Skill

Pattern for handling subagent failures gracefully with appropriate retry strategies.

When to Load This Skill

You are spawning subagents that may fail
A subagent returned an error or unexpected output
You need to decide whether to retry, escalate, or abort

Failure Categories

| Category | Symptoms | Strategy | |----------|----------|----------| | Transient | Timeout, malformed output, parsing error | Simple Retry | | Context Gap | "I don't have enough information", unclear task | Context Enhancement | | Complexity | Partial completion, scope creep, tangents | Scope Reduction | | Boundary/Contract | status: blocked, boundary_violation, contract_change | Escalation | | Fatal | Repeated failures (3+), fundamental misunderstanding | Abort with Report |

Retry Strategies

Strategy 1: Simple Retry

For transient failures. Same prompt, up to 3 attempts.

# Track attempts
attempts: 0
max_attempts: 3

# On failure
IF attempts < max_attempts:
  attempts += 1
  Task(same_subagent_type, same_model, same_prompt)
ELSE:
  Mark as FAILED, move on

Use when:

Output was malformed or truncated
Timeout occurred
Agent returned empty/null response

Strategy 2: Context Enhancement

Add more information to help the agent succeed.

Task(
  subagent_type: "implementer",
  model: "sonnet",
  prompt: |
    ## PREVIOUS ATTEMPT FAILED

    Error: {error_message}
    Output received: {partial_output}

    ## ADDITIONAL CONTEXT

    Here is more information that may help:
    - Related file: @{additional_file_path}
    - Pattern to follow: {example_pattern}
    - Specific guidance: {clarification}

    ## ORIGINAL TASK

    {original_task_description}

    Output to: {output_path}
)

Use when:

Agent said "I don't understand" or "unclear requirements"
Agent made incorrect assumptions
Agent asked questions in output

Context to add:

Related code files the agent might need
Similar implementations as examples
Explicit clarification of ambiguous points
Error message from previous attempt

Strategy 3: Scope Reduction

Break the failing task into smaller, more manageable pieces.

# Original task failed
Task: "Implement full authentication system"

# Split into subtasks
Task(implementer, "Implement password hashing utility")
Task(implementer, "Implement session token generation")
Task(implementer, "Implement login endpoint")
Task(implementer, "Implement logout endpoint")

Use when:

Agent completed partial work then failed
Task description was too broad
Agent went off on tangents
Output shows confusion about scope

Splitting guidelines:

Each subtask should be independently completable
Each subtask should have clear boundaries
Subtasks can run in parallel if no dependencies
Recombine outputs after all subtasks complete

Strategy 4: Escalation

Route to specialized agent for resolution.

# For boundary violations
Task(
  subagent_type: "contract-resolver",
  model: "sonnet",
  prompt: |
    A task is blocked due to boundary/contract issues.

    Blocked task output: memory/tasks/{task_id}/output.json
    Blocked reason: {blocked_reason}
    Current contracts: {contract_paths}

    Analyze impact and provide resolution.
    Output to: memory/contracts/resolution_{task_id}.json
)

Escalation paths:

| Failure Type | Escalate To | Action | |--------------|-------------|--------| | blocked_reason: boundary_violation | contract-resolver | Expand boundaries or redesign | | blocked_reason: contract_change | contract-resolver | Modify contract, re-verify dependents | | blocked_reason: dependency_issue | executor (self) | Re-check dependency status | | Repeated implementation failures | architect | Reconsider design approach |

Strategy 5: Abort with Report

When recovery is not possible, fail gracefully.

{"tasks":[{"id":"{task_id}","status":"failed","failure_reason":"{specific reason}","attempts_made":3,"recovery_attempted":[{"strategy":"simple_retry","result":"same_error"},{"strategy":"context_enhancement","result":"different_error"},{"strategy":"scope_reduction","result":"subtasks_also_failed"}],"recommendation":"Task may need architectural redesign"}]}

Use when:

3+ retry attempts failed
Different strategies all failed
Fundamental misunderstanding of requirements
Task is actually impossible given constraints

Decision Tree

On Subagent Failure:
│
├─ Is output malformed/empty/timeout?
│  └─ YES → Strategy 1: Simple Retry (up to 3x)
│
├─ Did agent say "unclear" or ask questions?
│  └─ YES → Strategy 2: Context Enhancement
│
├─ Did agent complete partial work?
│  └─ YES → Strategy 3: Scope Reduction
│
├─ Is status "blocked" with boundary/contract reason?
│  └─ YES → Strategy 4: Escalation to contract-resolver
│
├─ Have we tried 3+ strategies already?
│  └─ YES → Strategy 5: Abort with Report
│
└─ Unknown error
   └─ Try Strategy 2 first, then escalate

Retry State Tracking

Track retry attempts in the execution state file:

{"tasks":[{"id":"task-001","status":"running","attempts":2,"last_error":"Timeout after 120s","retry_strategy":"simple_retry"},{"id":"task-002","status":"running","attempts":1,"last_error":"Needs access to src/config/db.ts","retry_strategy":"context_enhancement","context_added":["src/config/db.ts","src/types/config.ts"]}]}

Integration with Executor Loop

# Enhanced execution loop
WHILE tasks remain incomplete:
  1. Read state file
  2. Find ready tasks
  3. Spawn ready tasks
  4. Check completed tasks:
     FOR each completed task:
       IF status == pre_complete:
         spawn verifier
       ELIF status == blocked:
         apply Strategy 4 (Escalation)
       ELIF status == failed:
         determine_failure_category()
         apply_appropriate_strategy()
         update_retry_state()
  5. Update state file
  6. IF all verified: EXIT
  7. IF all failed with no recovery: EXIT with failure report

Principles

Fail fast, recover smart - Don't retry blindly; analyze the failure first
Preserve partial work - If agent completed 50%, don't discard it
Escalate early - Boundary/contract issues need resolver, not retries
Track everything - Log all attempts for reflection phase
Know when to quit - 3 failed strategies = abort, don't loop forever

Agent Skills: Error Recovery Skill

Install this agent skill to your local

Skill Files