Resilience Analysis
Assesses error handling and isolation boundaries.
Process
- Trace error propagation — Map exception flow from tools to agent
- Identify isolation — Sandbox mechanisms for dangerous operations
- Catalog recovery — Retry logic, fallbacks, circuit breakers
- Assess boundaries — What crashes propagate vs. are contained
Error Propagation Analysis
Questions to Answer
- Does a tool exception terminate the agent?
- Are LLM API errors retried automatically?
- Is parsing failure (malformed output) recoverable?
- What happens when state updates fail?
Propagation Patterns
Crash Propagation (Dangerous)
def run_tool(self, tool, args):
return tool.execute(args) # Exception bubbles up
Exception Wrapping
def run_tool(self, tool, args):
try:
return tool.execute(args)
except Exception as e:
raise ToolExecutionError(tool.name, e) from e
Error Containment
def run_tool(self, tool, args):
try:
return ToolResult(success=True, output=tool.execute(args))
except Exception as e:
return ToolResult(success=False, error=str(e))
Propagation Map Template
User Input
↓
┌─────────────────────────────────────────┐
│ Agent Loop │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ LLM Call │ │
│ │ • APIError → [Retry 3x / Propagate] │ │
│ │ • RateLimit → [Backoff / Propagate] │ │
│ │ • Timeout → [Retry / Propagate] │ │
│ └─────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Output Parsing │ │
│ │ • ParseError → [Retry / Contained] │ │
│ │ • ValidationError → [Contained] │ │
│ └─────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Tool Execution │ │
│ │ • ToolError → [Feedback to LLM] │ │
│ │ • Timeout → [Kill / Continue] │ │
│ │ • SecurityError → [Propagate] │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────┘
Sandboxing Mechanisms
Code Execution Isolation
| Mechanism | Safety Level | Performance | Complexity | |-----------|-------------|-------------|------------| | None | ⚠️ Dangerous | Fast | None | | RestrictedPython | Medium | Fast | Low | | AST Validation | Low | Fast | Medium | | Subprocess | Medium | Overhead | Low | | Docker/Container | High | High overhead | Medium | | gVisor/Firecracker | Very High | Medium overhead | High |
Detection Patterns
No Sandboxing
exec(user_code) # Direct execution
eval(expression) # Direct eval
subprocess.run(cmd, shell=True) # Shell injection risk
Basic Sandboxing
# RestrictedPython
from RestrictedPython import compile_restricted
code = compile_restricted(user_code, '<string>', 'exec')
# AST validation
tree = ast.parse(user_code)
if has_dangerous_nodes(tree):
raise SecurityError()
Process Isolation
# Subprocess with limits
result = subprocess.run(
['python', '-c', user_code],
timeout=30,
capture_output=True,
user='nobody' # Drop privileges
)
Container Isolation
import docker
client = docker.from_env()
container = client.containers.run(
'python:3.11-slim',
command=['python', '-c', user_code],
mem_limit='256m',
network_disabled=True,
remove=True
)
Recovery Patterns
Retry Logic
# Simple retry
@retry(max_attempts=3, backoff=exponential)
def call_llm(self, prompt):
return self.client.generate(prompt)
# Retry with error feedback
def call_with_retry(self, prompt, max_retries=3):
errors = []
for i in range(max_retries):
try:
return self.llm.generate(prompt)
except ParseError as e:
errors.append(str(e))
prompt = f"{prompt}\n\nPrevious errors: {errors}"
raise MaxRetriesExceeded(errors)
Fallback Mechanisms
def generate(self, prompt):
try:
return self.primary_llm.generate(prompt)
except APIError:
return self.fallback_llm.generate(prompt)
Circuit Breaker
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failures = 0
self.state = 'closed'
self.last_failure = None
def call(self, func, *args):
if self.state == 'open':
if time.time() - self.last_failure > self.reset_timeout:
self.state = 'half-open'
else:
raise CircuitOpen()
try:
result = func(*args)
self.failures = 0
self.state = 'closed'
return result
except Exception as e:
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.failure_threshold:
self.state = 'open'
raise
Output Template
## Resilience Analysis: [Framework Name]
### Error Propagation Map
| Error Source | Error Type | Handling | Propagates? |
|--------------|-----------|----------|-------------|
| LLM API | RateLimitError | Retry 3x with backoff | No |
| LLM API | APIError | Retry 1x | Yes |
| Parser | ParseError | Feed back to LLM | No |
| Tool | Exception | Wrap and feed to LLM | No |
| Tool | Timeout | Kill process | No |
| State | ValidationError | Propagate | Yes |
### Sandboxing Assessment
- **Code Execution**: [Mechanism or None]
- **File System**: [Isolated/Restricted/Open]
- **Network**: [Blocked/Filtered/Open]
- **Resource Limits**: [Memory/CPU/Time limits]
### Recovery Mechanisms
| Pattern | Implementation | Location |
|---------|---------------|----------|
| Retry | Exponential backoff, 3 attempts | llm.py:L45 |
| Fallback | Secondary model | agent.py:L120 |
| Circuit Breaker | None | - |
### Risk Assessment
- **Critical Gaps**: [List any missing protections]
- **Production Ready**: [Yes/No/Needs work]
Integration
- Prerequisite:
codebase-mappingto identify execution code - Feeds into:
antipattern-catalogfor error handling issues - Related:
execution-engine-analysisfor async error handling