Test-Driven Development (TDD)
Strict Red-Green-Refactor workflow for robust, self-documenting, production-ready code.
Quick Navigation
| Situation | Go To | |-----------|-------| | New to this codebase | Step 1: Explore Environment | | Know the framework, starting work | Step 2: Select Mode | | Need the core loop reference | Step 3: Core TDD Loop | | Complex edge cases to cover | Property-Based Testing | | Tests are flaky/unreliable | Flaky Test Management | | Need isolated test environment | Hermetic Testing | | Measuring test quality | Mutation Testing |
The Three Rules (Robert C. Martin)
- No Production Code without a failing test
- Write Only Enough Test to Fail (compilation errors count)
- Write Only Enough Code to Pass (no optimizations yet)
The Loop: π΄ RED (write failing test) β π’ GREEN (minimal code to pass) β π΅ REFACTOR (clean up) β Repeat
Step 1: Explore Test Environment
Do NOT assume anything. Explore the codebase first.
Checklist:
- [ ] Search for test files:
glob("**/*.test.*"),glob("**/*.spec.*"),glob("**/test_*.py") - [ ] Check
package.jsonscripts,Makefile, or CI workflows - [ ] Look for config:
vitest.config.*,jest.config.*,pytest.ini,Cargo.toml
Framework Detection:
| Language | Config Files | Test Command |
|----------|--------------|--------------|
| Node.js | package.json, vitest.config.* | npm test, bun test |
| Python | pyproject.toml, pytest.ini | pytest |
| Go | go.mod, *_test.go | go test ./... |
| Rust | Cargo.toml | cargo test |
Step 2: Select Mode
| Mode | When | First Action | |------|------|--------------| | New Feature | Adding functionality | Read existing module tests, confirm green baseline | | Bug Fix | Reproducing issue | Write failing reproduction test FIRST | | Refactor | Cleaning code | Ensure β₯80% coverage on target code | | Legacy | No tests exist | Add characterization tests before changing |
Tie-breaker: If coverage <20% or tests absent β use Legacy Mode first.
Mode: New Feature
- Read existing tests for the module
- Run tests to confirm green baseline
- Enter Core Loop for new behavior
- Commits:
test(module): add test for Xβfeat(module): implement X
Mode: Bug Fix
- Write failing reproduction test (MUST fail before fix)
- Confirm failure is assertion error, not syntax error
- Write minimal fix
- Run full test suite
- Commits:
test: add failing test for bug #123βfix: description (#123)
Mode: Refactor
- Run coverage on the specific function you'll refactor
- If coverage <80% β add characterization tests first
- Refactor in small steps (ONE change β run tests β repeat)
- Never change behavior during refactor
Mode: Legacy Code
- Find Seams - insertion points for tests (Sensing Seams, Separation Seams)
- Break Dependencies - use Sprout Method or Wrap Method
- Add characterization tests (capture current behavior)
- Build safety net: happy path + error cases + boundaries
- Then apply TDD for your changes
β See references/examples.md for full code examples of each mode.
Step 3: The Core TDD Loop
Before Starting: Scenario List
List all behaviors to cover:
- [ ] Happy path cases
- [ ] Edge cases and boundaries
- [ ] Error/failure cases
- [ ] Pessimism: 3 ways this could fail (network, null, invalid state)
π΄ RED Phase
- Write ONE test (single behavior or edge case)
- Use AAA: Arrange β Act β Assert
- Run test, verify it FAILS for expected reason
Checks:
- Is failure an assertion error? (Not
SyntaxError/ModuleNotFoundError) - Can I explain why this should fail?
- If test passes immediately β STOP. Test is broken or feature exists.
π’ GREEN Phase
- Write minimal code to pass
- Do NOT implement "perfect" solution
- Verify test passes
Checks:
- Is this the simplest solution?
- Can I delete any of this code and still pass?
π΅ REFACTOR Phase
- Look for duplication, unclear names, magic values
- Clean up without changing behavior
- Verify tests still pass
Repeat
Select next scenario, return to RED.
Triangulation: If implementation is too specific (hardcoded), write another test with different inputs to force generalization.
Stop Conditions
| Signal | Response | |--------|----------| | Test passes immediately | Check assertions, verify feature isn't already built | | Test fails for wrong reason | Fix setup/imports first | | Flaky test | STOP. Fix non-determinism immediately | | Slow feedback (>5s) | Optimize or mock external calls | | Coverage decreased | Add tests for uncovered paths |
Test Distribution: The Testing Trophy
The Testing Trophy (Kent C. Dodds) reflects modern testing reality: integration tests give the best confidence-to-effort ratio.
_____________
/ System \ β Few, slow, high confidence; brittle (E2E)
/_______________\
/ \
/ Integration \ β Real interactions between units β **BEST ROI** (Integration)
\ /
\_________________/
\ Unit / β Fast & cheap but test in isolation (Unit)
\___________/
/ Static \ β Typecheck, linting β typos/types (Static)
/_____________\
Layer Breakdown
| Layer | What | Tools | When | |-------|------|-------|------| | Static | Type errors, syntax, linting | TypeScript, ESLint | Always on, catches 50%+ of bugs for free | | Unit | Pure functions, algorithms, utilities | vitest, jest, pytest | Isolated logic with no dependencies | | Integration | Components + hooks + services together | Testing Library, MSW, Testcontainers | Real user flows, real(ish) data | | E2E | Full app in browser | Playwright, Cypress | Critical paths only (login, checkout) |
Why Integration Tests Win
Unit tests prove code works in isolation. Integration tests prove code works together.
| Concern | Unit Test | Integration Test | |---------|-----------|------------------| | Component renders | β | β | | Component + hook works | β | β | | Component + API works | β | β | | User flow works | β | β | | Catches real bugs | Sometimes | Usually |
The insight: Most bugs live in the seams between modules, not inside pure functions. Integration tests catch seam bugs; unit tests don't.
Practical Guidance
- Start with integration tests - Test the way users use your code
- Drop to unit tests for complex algorithms or edge cases
- Use E2E sparingly - Slow, flaky, expensive to maintain
- Let static analysis do the heavy lifting - TypeScript catches more bugs than most unit tests
- Prefer fakes over mocks - Fakes have real behavior; mocks just return canned data
- SMURF quality: Sustainable, Maintainable, Useful, Resilient, Fast
Anti-Patterns
| Pattern | Problem | Fix |
|---------|---------|-----|
| Mirror Blindness | Same agent writes test AND code | State test intent before GREEN |
| Happy Path Bias | Only success scenarios | Include errors in Scenario List |
| Refactoring While Red | Changing structure with failing tests | Get to GREEN first |
| The Mockery | Over-mocking hides bugs | Prefer fakes or real implementations |
| Coverage Theater | Tests without meaningful assertions | Assert behavior, not lines |
| Multi-Test Step | Multiple tests before implementing | One test at a time |
| Verification Trap π€ | AI tests what code does not what it should do | State intent in plain language; separate agent review |
| Test Exploitation π€ | LLMs exploit weak assertions or overload operators | Use PBT alongside examples; strict equality |
| Assertion Omission π€ | Missing edge cases (null, undefined, boundaries) | Scenario list with errors; test.each |
| Hallucinated Mock π€ | AI generates fake mocks without proper setup | Testcontainers for integration; real Fakes for unit |
Critical: Verify tests by (1) running them, (2) having separate agent review, (3) never trusting generated tests blindly.
Advanced Techniques
Use these techniques at specific points in your workflow:
| Technique | Use During | Purpose | |-----------|------------|---------| | Test Doubles | π΄ RED phase | Isolate dependencies when writing tests | | Property-Based Testing | π΄ RED phase | Cover edge cases for complex logic | | Contract Testing | π΄ RED phase | Define API expectations between services | | Snapshot Testing | π΄ RED phase | Capture UI/response structure | | Hermetic Testing | π΅ Setup | Ensure test isolation and determinism | | Mutation Testing | β After GREEN | Validate test suite effectiveness | | Coverage Analysis | β After GREEN | Find untested code paths | | Flaky Test Management | π§ Maintenance | Fix unreliable tests blocking CI |
Test Doubles (Use: Writing Tests with Dependencies)
When: Your code depends on something slow, unreliable, or complex (DB, API, filesystem).
| Type | Purpose | When | |------|---------|------| | Stub | Returns canned answers | Need specific return values | | Mock | Verifies interactions | Need to verify calls made | | Fake | Simplified implementation | Need real behavior without cost | | Spy | Records calls | Need to observe without changing |
Decision: Dependency slow/unreliable? β Fake (complex) or Stub (simple). Need to verify calls? β Mock/Spy. Otherwise β real implementation.
β See references/examples.md β Test Double Examples
Hermetic Testing (Use: Test Environment Setup)
When: Setting up test infrastructure. Tests must be isolated and deterministic.
Principles:
- Isolation: Unique temp directories/state per test
- Reset: Clean up in setUp/tearDown
- Determinism: No time-based logic or shared mutable state
Database Strategies:
| Strategy | Speed | Fidelity | Use When | |----------|-------|----------|----------| | In-memory (SQLite) | Fast | Low | Unit tests, simple queries | | Testcontainers | Medium | High | Integration tests | | Transactional Rollback | Fast | High | Tests sharing schema (80x faster than TRUNCATE) |
β See references/examples.md β Hermetic Testing Examples
Property-Based Testing (Use: Writing Tests for Complex Logic)
When: Writing tests for algorithms, state machines, serialization, or code with many edge cases.
Tools: fast-check (JS/TS), Hypothesis (Python), proptest (Rust)
Properties to Test:
- Commutativity:
f(a, b) == f(b, a) - Associativity:
f(f(a, b), c) == f(a, f(b, c)) - Identity:
f(a, identity) == a - Round-trip:
decode(encode(x)) == x - Metamorphic: If input changes by X, output changes by Y (useful when you don't know expected output)
How: Replace multiple example-based tests with one property test that generates random inputs.
Critical: Always log the seed on failure. Without it, you cannot reproduce the failing case.
β See references/examples.md β Property-Based Testing Examples
Mutation Testing (Use: Validating Test Quality)
When: After tests pass, to verify they actually catch bugs. Use for critical code (auth, payments) or before major refactors.
Tools: Stryker (JS/TS), PIT (Java), mutmut (Python)
How: Tool mutates your code (e.g., changes > to >=). If tests still pass β your tests are weak.
Interpretation:
- >80% mutation score = good test suite
- Survived mutants = tests don't catch those changes β add tests for these
Equivalent Mutant Problem: Some mutants change syntax but not behavior (e.g., i < 10 β i != 10 in a loop where i only increments). These can't be killedβ100% score is often impossible. Focus on surviving mutants in critical paths, not chasing perfect scores.
When NOT to use: Tool-generated code (OpenAPI clients, Protobuf stubs, ORM models), simple DTOs/getters, legacy code with slow tests, or CI pipelines that must finish in <5 minutes. Use --incremental --since main for PR-focused runs. Note: This does NOT mean skip mutation testing on code you (the agent) wroteβalways validate your own work.
β See references/examples.md β Mutation Testing Examples
Flaky Test Management (Use: CI/CD Maintenance)
When: Tests fail intermittently, blocking CI or eroding trust in the test suite.
Root Causes:
| Cause | Fix |
|-------|-----|
| Timing (setTimeout, races) | Fake timers, await properly |
| Shared state | Isolate per test |
| Randomness | Seed or mock |
| Network | Use MSW or fakes |
| Order dependency | Make tests independent |
| Parallel transaction conflicts | Isolate DB connections per worker |
How: Detect (--repeat 10) β Quarantine (separate suite) β Fix root cause β Restore
Quarantine Rules:
- Issue-linked: Every quarantined test MUST link to a tracking issue. Prevents "quarantine-and-forget."
- Mute, don't skip: Prefer muting (runs but doesn't fail build) over skipping. You still collect failure data.
- Reintroduction criteria: Test must pass N consecutive runs (e.g., 100) on main before leaving quarantine.
β See references/examples.md β Flaky Test Examples
Contract Testing (Use: Writing Tests for Service Boundaries)
When: Writing tests for code that calls or exposes APIs. Prevents integration breakage.
How (Pact): Consumer defines expected interactions β Contract published β Provider verifies β CI fails if contract broken.
β See references/examples.md β Contract Testing Examples
Coverage Analysis (Use: Finding Gaps After Tests Pass)
When: After writing tests, to find untested code paths. NOT a goal in itself.
| Metric | Measures | Threshold | |--------|----------|-----------| | Line | Lines executed | 70-80% | | Branch | Decision paths | 60-70% | | Mutation | Test effectiveness | >80% |
Risk-Based Prioritization: P0 (auth, payments) β P1 (core logic) β P2 (helpers) β P3 (config)
Warning: High coverage β good tests. Tests must assert meaningful behavior.
Snapshot Testing (Use: Writing Tests for UI/Output Structure)
When: Writing tests for UI components, API responses, or error message formats.
Appropriate: UI structure, API response shapes, error formats. Avoid: Behavior testing, dynamic content, entire pages.
How: Capture output once, verify it doesn't change unexpectedly. Always review diffs carefully.
β See references/examples.md β Snapshot Testing Examples
Integration with Other Skills
| Task | Skill | Usage |
|------|-------|-------|
| Committing | git-commit | test: for RED, feat: for GREEN |
| Code Quality | code-quality | Run during REFACTOR phase |
| Documentation | docs-check | Check if behavior changes need docs |
References
Foundational:
- Three Rules of TDD - Robert C. Martin
- Test Pyramid - Martin Fowler
- Testing Trophy - Kent C. Dodds
- Working Effectively with Legacy Code - Michael Feathers
Tools: Testcontainers | fast-check | Stryker | MSW | Pact