Jepsen Testing Skill | Agent Skills

Jepsen Testing

Identify the system under test and the exact client surface (Redis, S3, Kafka, HTTP, gRPC).
Define what "acknowledged" means for each operation (what does the client treat as committed).
Write the claimed consistency guarantees as a checkable property (linearizable register, RYW, monotonic reads, serializable txns).
Specify the failure model to test (crash-stop, partitions, clock skew, disk stalls, restarts).
Decide whether the test must be multi-surface (write via Redis, read via S3) to validate cross-frontend coherence.

Prefer the smallest workload that can falsify the claim.
Use a mix of reads and writes that creates ambiguous interleavings.
Add a "witness" invariant that is easy to explain:
Lost acknowledged write.
Read sees a value that cannot be explained by any sequential execution respecting real-time order.
List-append: element lost/duplicated or observed order implies a cycle.

Register or map semantics: use a linearizability checker.
Transactional / multi-object semantics: use Elle-style anomaly detection (write cycles, dirty reads, lost updates).
If linearizability is too strong for the product, explicitly select a weaker model and encode it (do not silently downgrade).

Partitions: majority/minority splits, bridge partitions, flapping partitions.
Process faults: kill and restart, node reboot, rolling restarts.
Time faults: clock offsets and jumps if the system relies on time.
Storage faults: fsync latency, I/O stalls, disk-full behavior (only if safe and reversible).

Start with a short, low-concurrency run until the harness is stable.
When a failure appears, minimize by reducing:
Keys, operation count, and concurrency.
Fault intensity and schedule complexity.
Preserve determinism (fixed seeds, fixed partition schedule) so a failing history can be reproduced.

State the exact claim under test and the precise pass/fail property.
Include the workload, nemesis schedule, and a minimal failing history excerpt.
Distinguish availability failures (timeouts) from safety failures (incorrect ok results).