Agent Skills: Jepsen Testing

Jepsen-style correctness testing for distributed systems under faults (partitions, crashes, clock skew) using concurrent operation histories and formal checkers (linearizability/serializability and Elle-style anomalies). Use when designing, implementing, or running Jepsen tests, or interpreting histories/violations.

UncategorizedID: plurigrid/asi/jepsen-testing

Install this agent skill to your local

pnpm dlx add-skill https://github.com/plurigrid/asi/tree/HEAD/.claude/skills/jepsen-testing

Skill Files

Browse the full folder contents for jepsen-testing.

Download Skill

Loading file tree…

.claude/skills/jepsen-testing/SKILL.md

Skill Metadata

Name
jepsen-testing
Description
Jepsen-style correctness testing for distributed systems under faults (partitions, crashes, clock skew) using concurrent operation histories and formal checkers (linearizability/serializability and Elle-style anomalies). Use when designing, implementing, or running Jepsen tests, or interpreting histories/violations.

Jepsen Testing

Intake

  • Identify the system under test and the exact client surface (Redis, S3, Kafka, HTTP, gRPC).
  • Define what "acknowledged" means for each operation (what does the client treat as committed).
  • Write the claimed consistency guarantees as a checkable property (linearizable register, RYW, monotonic reads, serializable txns).
  • Specify the failure model to test (crash-stop, partitions, clock skew, disk stalls, restarts).
  • Decide whether the test must be multi-surface (write via Redis, read via S3) to validate cross-frontend coherence.

Workload Design

  • Prefer the smallest workload that can falsify the claim.
  • Use a mix of reads and writes that creates ambiguous interleavings.
  • Add a "witness" invariant that is easy to explain:
  • Lost acknowledged write.
  • Read sees a value that cannot be explained by any sequential execution respecting real-time order.
  • List-append: element lost/duplicated or observed order implies a cycle.

Checker Selection

  • Register or map semantics: use a linearizability checker.
  • Transactional / multi-object semantics: use Elle-style anomaly detection (write cycles, dirty reads, lost updates).
  • If linearizability is too strong for the product, explicitly select a weaker model and encode it (do not silently downgrade).

Fault (Nemesis) Selection

  • Partitions: majority/minority splits, bridge partitions, flapping partitions.
  • Process faults: kill and restart, node reboot, rolling restarts.
  • Time faults: clock offsets and jumps if the system relies on time.
  • Storage faults: fsync latency, I/O stalls, disk-full behavior (only if safe and reversible).

Run and Minimize

  • Start with a short, low-concurrency run until the harness is stable.
  • When a failure appears, minimize by reducing:
  • Keys, operation count, and concurrency.
  • Fault intensity and schedule complexity.
  • Preserve determinism (fixed seeds, fixed partition schedule) so a failing history can be reproduced.

Reporting

  • State the exact claim under test and the precise pass/fail property.
  • Include the workload, nemesis schedule, and a minimal failing history excerpt.
  • Distinguish availability failures (timeouts) from safety failures (incorrect ok results).