Jepsen Testing
Intake
- Identify the system under test and the exact client surface (Redis, S3, Kafka, HTTP, gRPC).
- Define what "acknowledged" means for each operation (what does the client treat as committed).
- Write the claimed consistency guarantees as a checkable property (linearizable register, RYW, monotonic reads, serializable txns).
- Specify the failure model to test (crash-stop, partitions, clock skew, disk stalls, restarts).
- Decide whether the test must be multi-surface (write via Redis, read via S3) to validate cross-frontend coherence.
Workload Design
- Prefer the smallest workload that can falsify the claim.
- Use a mix of reads and writes that creates ambiguous interleavings.
- Add a "witness" invariant that is easy to explain:
- Lost acknowledged write.
- Read sees a value that cannot be explained by any sequential execution respecting real-time order.
- List-append: element lost/duplicated or observed order implies a cycle.
Checker Selection
- Register or map semantics: use a linearizability checker.
- Transactional / multi-object semantics: use Elle-style anomaly detection (write cycles, dirty reads, lost updates).
- If linearizability is too strong for the product, explicitly select a weaker model and encode it (do not silently downgrade).
Fault (Nemesis) Selection
- Partitions: majority/minority splits, bridge partitions, flapping partitions.
- Process faults: kill and restart, node reboot, rolling restarts.
- Time faults: clock offsets and jumps if the system relies on time.
- Storage faults: fsync latency, I/O stalls, disk-full behavior (only if safe and reversible).
Run and Minimize
- Start with a short, low-concurrency run until the harness is stable.
- When a failure appears, minimize by reducing:
- Keys, operation count, and concurrency.
- Fault intensity and schedule complexity.
- Preserve determinism (fixed seeds, fixed partition schedule) so a failing history can be reproduced.
Reporting
- State the exact claim under test and the precise pass/fail property.
- Include the workload, nemesis schedule, and a minimal failing history excerpt.
- Distinguish availability failures (timeouts) from safety failures (incorrect ok results).