Resilience Analysis Skill

Resilience Analysis

Assesses error handling and isolation boundaries.

Process

Trace error propagation — Map exception flow from tools to agent
Identify isolation — Sandbox mechanisms for dangerous operations
Catalog recovery — Retry logic, fallbacks, circuit breakers
Assess boundaries — What crashes propagate vs. are contained

Error Propagation Analysis

Questions to Answer

Does a tool exception terminate the agent?
Are LLM API errors retried automatically?
Is parsing failure (malformed output) recoverable?
What happens when state updates fail?

Propagation Patterns

Crash Propagation (Dangerous)

def run_tool(self, tool, args):
    return tool.execute(args)  # Exception bubbles up

Exception Wrapping

def run_tool(self, tool, args):
    try:
        return tool.execute(args)
    except Exception as e:
        raise ToolExecutionError(tool.name, e) from e

Error Containment

def run_tool(self, tool, args):
    try:
        return ToolResult(success=True, output=tool.execute(args))
    except Exception as e:
        return ToolResult(success=False, error=str(e))

Propagation Map Template

User Input
    ↓
┌─────────────────────────────────────────┐
│ Agent Loop                              │
│   ↓                                     │
│ ┌─────────────────────────────────────┐ │
│ │ LLM Call                            │ │
│ │ • APIError → [Retry 3x / Propagate] │ │
│ │ • RateLimit → [Backoff / Propagate] │ │
│ │ • Timeout → [Retry / Propagate]     │ │
│ └─────────────────────────────────────┘ │
│   ↓                                     │
│ ┌─────────────────────────────────────┐ │
│ │ Output Parsing                      │ │
│ │ • ParseError → [Retry / Contained]  │ │
│ │ • ValidationError → [Contained]     │ │
│ └─────────────────────────────────────┘ │
│   ↓                                     │
│ ┌─────────────────────────────────────┐ │
│ │ Tool Execution                      │ │
│ │ • ToolError → [Feedback to LLM]     │ │
│ │ • Timeout → [Kill / Continue]       │ │
│ │ • SecurityError → [Propagate]       │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────┘

Sandboxing Mechanisms

Code Execution Isolation

| Mechanism | Safety Level | Performance | Complexity | |-----------|-------------|-------------|------------| | None | ⚠️ Dangerous | Fast | None | | RestrictedPython | Medium | Fast | Low | | AST Validation | Low | Fast | Medium | | Subprocess | Medium | Overhead | Low | | Docker/Container | High | High overhead | Medium | | gVisor/Firecracker | Very High | Medium overhead | High |

Detection Patterns

No Sandboxing

exec(user_code)  # Direct execution
eval(expression)  # Direct eval
subprocess.run(cmd, shell=True)  # Shell injection risk

Basic Sandboxing

# RestrictedPython
from RestrictedPython import compile_restricted
code = compile_restricted(user_code, '<string>', 'exec')

# AST validation
tree = ast.parse(user_code)
if has_dangerous_nodes(tree):
    raise SecurityError()

Process Isolation

# Subprocess with limits
result = subprocess.run(
    ['python', '-c', user_code],
    timeout=30,
    capture_output=True,
    user='nobody'  # Drop privileges
)

Container Isolation

import docker
client = docker.from_env()
container = client.containers.run(
    'python:3.11-slim',
    command=['python', '-c', user_code],
    mem_limit='256m',
    network_disabled=True,
    remove=True
)

Recovery Patterns

Retry Logic

# Simple retry
@retry(max_attempts=3, backoff=exponential)
def call_llm(self, prompt):
    return self.client.generate(prompt)

# Retry with error feedback
def call_with_retry(self, prompt, max_retries=3):
    errors = []
    for i in range(max_retries):
        try:
            return self.llm.generate(prompt)
        except ParseError as e:
            errors.append(str(e))
            prompt = f"{prompt}\n\nPrevious errors: {errors}"
    raise MaxRetriesExceeded(errors)

Fallback Mechanisms

def generate(self, prompt):
    try:
        return self.primary_llm.generate(prompt)
    except APIError:
        return self.fallback_llm.generate(prompt)

Circuit Breaker

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.failures = 0
        self.state = 'closed'
        self.last_failure = None
    
    def call(self, func, *args):
        if self.state == 'open':
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = 'half-open'
            else:
                raise CircuitOpen()
        
        try:
            result = func(*args)
            self.failures = 0
            self.state = 'closed'
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure = time.time()
            if self.failures >= self.failure_threshold:
                self.state = 'open'
            raise

Output Template

## Resilience Analysis: [Framework Name]

### Error Propagation Map

| Error Source | Error Type | Handling | Propagates? |
|--------------|-----------|----------|-------------|
| LLM API | RateLimitError | Retry 3x with backoff | No |
| LLM API | APIError | Retry 1x | Yes |
| Parser | ParseError | Feed back to LLM | No |
| Tool | Exception | Wrap and feed to LLM | No |
| Tool | Timeout | Kill process | No |
| State | ValidationError | Propagate | Yes |

### Sandboxing Assessment
- **Code Execution**: [Mechanism or None]
- **File System**: [Isolated/Restricted/Open]
- **Network**: [Blocked/Filtered/Open]
- **Resource Limits**: [Memory/CPU/Time limits]

### Recovery Mechanisms

| Pattern | Implementation | Location |
|---------|---------------|----------|
| Retry | Exponential backoff, 3 attempts | llm.py:L45 |
| Fallback | Secondary model | agent.py:L120 |
| Circuit Breaker | None | - |

### Risk Assessment
- **Critical Gaps**: [List any missing protections]
- **Production Ready**: [Yes/No/Needs work]

Integration

Prerequisite: codebase-mapping to identify execution code
Feeds into: antipattern-catalog for error handling issues
Related: execution-engine-analysis for async error handling

Agent Skills: Resilience Analysis

Install this agent skill to your local

Skill Files