Resilience Patterns Skill

Resilience Patterns

Core Principle

Resilience is a composition concern, not a business logic concern. Add retry/timeout at the workflow level, not inside functions.

Workflows
  -> step.retry() and step.withTimeout()
  -> (resilience here)

Business Functions
  -> fn(args, deps): Result<T, E>
  -> (no retry logic here)

Infrastructure
  -> pg, redis, http
  -> (just transport)

Required Behaviors

1. Retry at Workflow Level Only

NEVER add retry logic inside business functions:

// WRONG - Retry inside function
async function getUser(args, deps) {
  let attempts = 0;
  while (attempts < 3) {
    try {
      return await deps.db.findUser(args.userId);
    } catch { attempts++; }
  }
}

// CORRECT - Clean function, workflow handles retry
async function getUser(args, deps) {
  const user = await deps.db.findUser(args.userId);
  return user ? ok(user) : err('NOT_FOUND');
}

// Workflow adds resilience
const result = await workflow(async (step) => {
  const user = await step.retry(
    () => getUser({ userId }, deps),
    { attempts: 3, backoff: 'exponential' }
  );
  return user;
});

2. Never Double-Retry

Retry at ONE level only. Multiple layers create retry explosion:

3 (API) × 3 (Service) × 3 (DB Client) = 27 attempts!

This DDoS's your own infrastructure.

3. Only Retry Transient Errors

| Error Type | Retry? | Why | |-----------|--------|-----| | TIMEOUT | Yes | Transient | | CONNECTION_ERROR | Yes | Network hiccup | | RATE_LIMITED | Yes | Wait and retry | | NOT_FOUND | NO | Resource doesn't exist | | UNAUTHORIZED | NO | Credentials wrong | | VALIDATION_FAILED | NO | Input invalid |

const data = await step.retry(
  () => fetchFromApi(),
  {
    attempts: 3,
    retryOn: (error) => {
      const retryable = ['TIMEOUT', 'CONNECTION_ERROR', 'RATE_LIMITED'];
      return retryable.includes(error);
    },
  }
);

4. Never Retry Non-Idempotent Writes

// DANGEROUS - May double-charge
await step.retry(() => chargeCard(amount), { attempts: 3 });

// SAFE - Read is idempotent
await step.retry(() => getUser(userId), { attempts: 3 });

// SAFE - With idempotency key
await step.retry(
  () => chargeCard(amount, { idempotencyKey }),
  { attempts: 3 }
);

5. Always Set Timeouts

Never let operations hang indefinitely:

const data = await step.withTimeout(
  () => slowOperation(),
  { ms: 2000 }
);

6. Always Use Jitter

Prevents thundering herd when multiple instances retry:

// Without jitter - all instances retry at same time
// With jitter - spread out, infrastructure can recover

step.retry(() => fetchData(), {
  attempts: 3,
  backoff: 'exponential',
  jitter: true,  // ALWAYS enable in production
});

7. Combine Retry and Timeout

Each attempt gets its own timeout:

const data = await step.retry(
  () => fetchData(),
  {
    attempts: 3,
    timeout: { ms: 2000 },  // 2s per attempt
  }
);
// Total max time: 3 × 2s = 6s

Recommended Defaults

| Operation | Attempts | Backoff | Initial Delay | Timeout | |-----------|----------|---------|---------------|---------| | DB read | 3 | exponential | 50ms | 5s | | DB write | 1 | - | - | 10s | | HTTP API | 3 | exponential | 100ms | 30s | | Cache | 2 | fixed | 10ms | 500ms |

Full Example

import { createWorkflow } from '@jagreehal/workflow';

// Clean business function
async function getUser(args, deps): AsyncResult<User, 'NOT_FOUND' | 'DB_ERROR'> {
  try {
    const user = await deps.db.findUser(args.userId);
    return user ? ok(user) : err('NOT_FOUND');
  } catch {
    return err('DB_ERROR');
  }
}

// Workflow adds resilience
const loadUser = createWorkflow({ getUser });

const result = await loadUser(async (step) => {
  const user = await step.retry(
    () => getUser({ userId }, deps),
    {
      attempts: 3,
      backoff: 'exponential',
      initialDelay: 100,
      maxDelay: 2000,
      jitter: true,
      timeout: { ms: 5000 },
    }
  );
  return user;
});

8. Retrying Multi-Step Operations

Sometimes you need to retry a multi-step operation. Use step.retry() to wrap the entire sequence:

const syncUserToProvider = createWorkflow({ findUser, syncUser, markSynced });

const result = await syncUserToProvider(async (step) => {
  // Retry the whole operation
  const user = await step.retry(
    async () => {
      const user = await step(() => findUser({ userId }, deps));
      await step(() => syncUser({ user }, deps));  // Must be idempotent!
      await step(() => markSynced({ userId }, deps));
      return user;
    },
    {
      attempts: 2,
      backoff: 'exponential',
    }
  );
  return user;
});

Important: The entire sequence must be idempotent. If syncUser is called twice, it should have the same effect as calling it once.

9. Circuit Breakers

When a service is down, stop hammering it. Circuit breakers prevent cascade failures:

// Circuit breaker states
// CLOSED: Normal operation, requests go through
// OPEN: Service down, fail fast without trying
// HALF_OPEN: Testing if service recovered

Circuit breakers are outside the scope of step.retry(), but consider libraries like opossum or cockatiel for production systems where dependencies fail frequently.

When to use circuit breakers:

External APIs that may be down for extended periods
Services with rate limits that trigger failures
Downstream dependencies in microservices

Don't use for:

Database calls (usually want retry instead)
Internal function calls

10. Handling Timeout Errors

Use helpers to detect and handle timeouts:

import { isStepTimeoutError, getStepTimeoutMeta } from '@jagreehal/workflow';

const result = await workflow(async (step) => {
  const data = await step.withTimeout(
    () => slowOperation(),
    { ms: 5000 }
  );
  return data;
});

if (!result.ok && isStepTimeoutError(result.error)) {
  const meta = getStepTimeoutMeta(result.error);
  deps.logger.warn('Operation timed out', {
    timeoutMs: meta?.timeoutMs,
    attempt: meta?.attempt,
  });
}

The Rules

| Failure Type | Where to Retry | |-------------|----------------| | Transport/network | Workflow level | | Idempotent reads | Workflow level | | Non-idempotent writes | NEVER (or with idempotency key) | | Multi-step operation | Workflow level (if idempotent) |

Retry at workflow level only
Never double-retry across layers
Only retry transient errors
Never retry non-idempotent writes without idempotency key
Always set timeouts
Always use jitter in production
Use circuit breakers for external services

Agent Skills: Resilience Patterns

Install this agent skill to your local

Skill Files