Test Improvement
Improve tests by making them behavioral, lean, and useful.
Role-gated action
Detect your capability from your tools, not from prose:
- Write-capable role (engineer): apply the test changes and run the verification command.
- Read-only role (reviewer): identify the weak/missing/brittle tests and emit the changes in the Proposed Changes contract under Verify and report. Apply nothing; run nothing — a reviewer has no edit or Bash tools, so the
reviewmode is its natural fit.
Language detection and references
Detect the language from the file extensions in scope and load the matching reference for language-specific test patterns and tooling:
- Go → references/go.md
- Python → references/python.md
- TypeScript → references/typescript.md
- Web → references/web.md
Mixed languages: load each matching reference. Unknown language: use the generic principles below only.
Modes
review→ identify weak, duplicate, brittle, or missing testsrefactor→ combine to table-driven/parametrized/test.each, remove wastecoverage→ add tests for uncovered business behaviortdd→ red-green-refactor loop for a feature or bugfull→ review + refactor + coverage
Testing principles
- Test behavior through public interfaces, not implementation details.
- The module interface is the test surface.
- Mock only system boundaries: external APIs, network, time, randomness, filesystem, subprocesses.
- Do not mock your own internal collaborators just to make tests easy.
- Prefer integration-style tests when they give a clear, stable signal.
- Delete old shallow tests once deeper interface tests cover the behavior.
- No pointless tests for getters, constructors, default props, or generated glue.
TDD workflow
Use this for test-first or red-green-refactor work.
- Confirm the public interface and first behavior.
- Write one failing test for one behavior.
- Run it and watch it fail for the expected reason.
- Implement the smallest code that passes.
- Run the narrow test.
- Repeat one vertical slice at a time.
- Refactor only when green.
Do not write all tests first. Bulk RED creates imagined tests coupled to guessed implementation.
Review workflow
Explore the existing test suite. Engineer runs these commands to gather coverage; reviewer (no Bash) works from the test files in scope plus any coverage output the caller supplies — ask for that context if missing, do not run the commands:
# Go
go test -coverprofile=/tmp/cc-cov.out ./... && go tool cover -func=/tmp/cc-cov.out
# Python
pytest --cov=. --cov-report=term-missing
# TypeScript
bun test --coverage
Look for:
- tests coupled to private helpers or call counts
- tests that should be table-driven / parametrized /
test.each - duplicate scenarios
- weak mocks hiding real behavior
- missing success, error, and edge cases on business logic
- no usable seam for testing real behavior
Preferred consolidation patterns
For refactoring brittle private-helper tests, state the public behavior surface first. Example: create_user(payload) is the primary test surface; _normalize_user_payload() is not. Replace duplicate helper tests and internal call-count assertions with behavior checks through the public API. Mock only system boundaries. Delete shallow duplicates once the public behavior tests cover them.
- Go — table-driven with
t.Run(tc.name, ...) - Python —
@pytest.mark.parametrizewithpytest.param() - TypeScript —
it.each([{ input, expected, name }])
Extract helpers only after 3+ repetitions and only when the helper improves readability.
Verify and report
Engineer runs and names the relevant verification command for the project after applying. Reviewer names the command in the Proposed Changes rationale and does not run it (no Bash). Examples:
go test ./...
pytest -v
bun test
For Python, mention pytest or the project-specific equivalent explicitly. For refactor plans in Python projects, include pytest -v or the repository's configured uv run pytest command by name instead of only saying "run tests." For other stacks, name the equivalent test command instead of saying only "tests passed."
Engineer (applied the changes):
TEST IMPROVEMENT COMPLETE
=========================
Mode: review | refactor | coverage | tdd | full
Tests changed: N
Waste removed: N
Coverage: before → after (if measured)
Key improvements:
- file:line — change
Verification:
- <command> — pass/fail
Reviewer (identified only — emit the changes as a proposal, apply nothing):
## Proposed Changes
### Change 1: <brief description>
File: `path/to/test_file`
Action: CREATE | MODIFY | DELETE
Code:
<complete test code, in the file's language>
Rationale: <weak/missing/brittle test this addresses>
If no tests or framework exist, report that and ask before creating a new testing stack.