Agent Skills: Mutation Testing

Validates test suite quality through mutation testing. Generates intelligent code mutations, runs tests to verify they catch the changes, and identifies gaps in test coverage. Use when evaluating test effectiveness, validating newly written tests, or improving test quality for mission-critical code.

UncategorizedID: roasbeef/claude-files/mutation-testing

Install this agent skill to your local

pnpm dlx add-skill https://github.com/Roasbeef/claude-files/tree/HEAD/skills/mutation-testing

Skill Files

Browse the full folder contents for mutation-testing.

Download Skill

Loading file tree…

skills/mutation-testing/SKILL.md

Skill Metadata

Name
mutation-testing
Description
'Validates Go test suite quality through mutation testing using go-gremlins/gremlins. Mutates production code, runs the test suite against each mutant, and reports which mutants the tests fail to kill — exposing weak assertions that line coverage cannot detect. Use when evaluating test effectiveness, validating newly written tests, or improving test quality for mission-critical code (consensus, channel state, payment flows, crypto). Triggers: "mutation test", "are these tests strong", "validate test quality", "/mutation-testing".'

Mutation Testing

Mutation testing evaluates test quality by introducing small, deliberate bugs into production code (mutants) and checking whether the test suite fails. A test that passes on a mutant did not actually verify the behavior the mutant changed.

This skill is a thin orchestrator over go-gremlins/gremlins — a maintained Go mutation testing tool. The skill provides install, run, and analysis wrappers that produce machine-readable JSON for downstream tooling (notably the test-refine skill).

Why Mutation Testing

A test suite can hit 100% line coverage and still be useless: tests can execute code without asserting on its results, or assert only on side-irrelevant fields. Mutation testing closes this gap by checking whether the test suite distinguishes the original code from a mutant. See references/coverage-pitfalls.md (in the test-refine skill) for the broader context.

When to Use

  • After generating tests with test-forge or by hand — verify they have real assertions.
  • Before merging consensus / payment / crypto code — quality gate on critical paths.
  • During code review — surface weak tests in the diff.
  • As a signal source for test-refine — survivors map to weak-assertion findings.

Target efficacy (gremlins terminology: test_efficacy = killed / (killed + lived)):

| Code class | Target | |---|---| | Mission-critical (consensus, wallet, channel, crypto) | 90%+ | | Core business logic | 80–90% | | General code | 70–80% | | Trivial/glue code | run only if cheap |

Workflow

1. Install gremlins (once)

~/.claude/skills/mutation-testing/scripts/install-gremlins.sh

The script pins to a known-good version (override with GREMLINS_VERSION=...). Requires go on PATH and $(go env GOPATH)/bin on PATH.

2. Run mutations

# Default: cwd, JSON to .reviews/mutations/<slug>.json
~/.claude/skills/mutation-testing/scripts/unleash.sh

# Targeted package
~/.claude/skills/mutation-testing/scripts/unleash.sh \
    --pkg ./internal/wallet \
    --output .reviews/mutations/wallet.json

# With integration tests and a config file
~/.claude/skills/mutation-testing/scripts/unleash.sh \
    --pkg ./internal/channel \
    --integration \
    --config .gremlins.yaml \
    --silent

3. Analyze survivors

~/.claude/skills/mutation-testing/scripts/analyze-survivors.sh \
    --input .reviews/mutations/wallet.json \
    --output .reviews/mutations/wallet.md

Produces a markdown report with: efficacy/coverage summary, survivors ranked by file (consensus/channel/wallet paths bubble to the top), and mutator-type breakdown.

Gremlins JSON Schema

gremlins unleash --output <file> emits a single JSON document:

{
  "go_module": "github.com/example/foo",
  "test_efficacy": 82.00,
  "mutations_coverage": 80.00,
  "mutants_total": 100,
  "mutants_killed": 82,
  "mutants_lived": 8,
  "mutants_not_viable": 2,
  "mutants_not_covered": 10,
  "elapsed_time": 123.456,
  "files": [
    {
      "file_name": "wallet.go",
      "mutations": [
        { "line": 42, "column": 8, "type": "CONDITIONALS_NEGATION", "status": "KILLED" }
      ]
    }
  ]
}

Mutation status values:

| Status | Meaning | Action | |---|---|---| | KILLED | Test suite caught the mutation | Good — no action | | LIVED | Tests passed despite mutation | Survivor — strengthen tests | | NOT COVERED | Mutation in code no test exercises | Add a test for that path | | TIMED OUT | Tests timed out — implicit kill | Investigate (might be perf bug) | | NOT VIABLE | Mutation produced uncompilable code | Excluded from score | | RUNNABLE | Dry-run only; would be tested | (only in --dry-run) |

Key metrics:

  • test_efficacy = killed / (killed + lived) — quality of assertions on covered code.
  • mutations_coverage = (killed + lived) / (killed + lived + not_covered) — how much code is exercised at all.

A high mutations_coverage with low test_efficacy means tests run code without verifying its behavior — the classic "100% line coverage, 0% real testing" failure mode.

Configuration

Gremlins is configured via .gremlins.yaml (or --config <path>). Mutators ship default-on for safe operators and default-off for aggressive ones.

Default-on mutators (always enabled):

  • arithmetic-base+ - * / %
  • conditionals-boundary< <= > >=
  • conditionals-negation== !=, boolean conditions
  • increment-decrement++ --
  • invert-negatives-x+x

Default-off mutators — enable for critical packages:

  • invert-assignments+= -= *= /= etc. swaps
  • invert-bitwise& | ^ swaps
  • invert-bwassign&= |= ^= swaps
  • invert-logical&& ↔ || (security-critical: catches auth bypass mutations)
  • invert-loopctrlbreak ↔ continue
  • remove-self-assignments — drop x = x op y updates

Recommended config for consensus/wallet/payment code:

silent: false
unleash:
  workers: 0          # use all CPUs
  test-cpu: 0         # no per-test CPU pinning
  threshold:
    efficacy: 90      # fail if below 90%
    mutant-coverage: 85
mutants:
  arithmetic-base:        { enabled: true }
  conditionals-boundary:  { enabled: true }
  conditionals-negation:  { enabled: true }
  increment-decrement:    { enabled: true }
  invert-negatives:       { enabled: true }
  invert-assignments:     { enabled: true }
  invert-bitwise:         { enabled: true }
  invert-bwassign:        { enabled: true }
  invert-logical:         { enabled: true }   # critical for && / || in auth
  invert-loopctrl:        { enabled: true }
  remove-self-assignments:{ enabled: true }

See gremlins.dev configuration docs for the full schema.

Threshold Gating (CI)

For CI, use --silent and set thresholds in config or via env vars:

gremlins unleash --silent --output mutations.json ./...
# Exit nonzero if efficacy < threshold.

The unleash.threshold.efficacy and unleash.threshold.mutant-coverage keys cause gremlins to exit nonzero when the run falls below the configured percentages — wire this into your PR check.

Integration with Other Skills

test-refine

The test-refine skill consumes gremlins JSON to identify weak-assertion zones (smell S12: mutation-survivor). When invoked with --use-mutations, it calls unleash.sh and cross-references LIVED mutants with the AST smell scan.

test-forge

After test-forge generates tests, run mutation testing to validate them. LIVED mutants are direct evidence of weak assertions in the generated tests.

code-review

Include the test_efficacy delta in PR review — regression of >5% in covered code is a strong signal of weakening test quality.

Interpreting Results

High efficacy (≥90%): Tests have strong assertions. Focus remaining work on NOT COVERED mutants (uncovered code paths).

Medium (75–90%): Tests cover main paths. Survivors usually indicate boundary or error-path gaps.

Low (<75%): Significant gaps — tests likely run code without checking outputs. Pair with test-refine to identify the specific smells.

Mutator breakdown tells you the kind of weakness:

  • conditionals-boundary LIVED → missing edge tests at thresholds.
  • invert-logical LIVED → missing truth-table coverage for &&/||.
  • arithmetic-base LIVED → tests don't verify calculation results.
  • remove-self-assignments LIVED → state mutations not asserted.

Equivalent Mutants

Some LIVED mutants are semantically equivalent to the original — no test could kill them. Common cases:

  • Mutated value immediately overwritten before being read.
  • Mutation in unreachable code.
  • Operator swap in associative/commutative context with no observable difference.

When you identify an equivalent mutant, document it (e.g., a comment near the mutation site, or a project-level EQUIVALENT_MUTANTS.md) so reviewers don't waste time on it. Gremlins doesn't filter equivalents automatically.

Gremlins Limitations

From the upstream README: gremlins targets smallish Go modules (microservices). On very large modules, runs can take hours. Mitigations:

  • Per-package runs via --pkg ./internal/wallet. Don't pass ./... on a 500k-LOC monorepo.
  • Skip generated code by using build tags or running on hand-written packages only.
  • Use --workers to bound parallelism if memory is tight.
  • Use --dry-run first to preview the mutation count and skip if it's too large.

Further Reading