Test-Driven Development (TDD) Skill

Test-Driven Development (TDD)

Strict Red-Green-Refactor workflow for robust, self-documenting, production-ready code.

Quick Navigation

| Situation | Go To | |-----------|-------| | New to this codebase | Step 1: Explore Environment | | Know the framework, starting work | Step 2: Select Mode | | Need the core loop reference | Step 3: Core TDD Loop | | Complex edge cases to cover | Property-Based Testing | | Tests are flaky/unreliable | Flaky Test Management | | Need isolated test environment | Hermetic Testing | | Measuring test quality | Mutation Testing |

The Three Rules (Robert C. Martin)

No Production Code without a failing test
Write Only Enough Test to Fail (compilation errors count)
Write Only Enough Code to Pass (no optimizations yet)

The Loop: 🔴 RED (write failing test) → 🟢 GREEN (minimal code to pass) → 🔵 REFACTOR (clean up) → Repeat

Step 1: Explore Test Environment

Do NOT assume anything. Explore the codebase first.

Checklist:

[ ] Search for test files: glob("**/*.test.*"), glob("**/*.spec.*"), glob("**/test_*.py")
[ ] Check package.json scripts, Makefile, or CI workflows
[ ] Look for config: vitest.config.*, jest.config.*, pytest.ini, Cargo.toml

Framework Detection:

| Language | Config Files | Test Command | |----------|--------------|--------------| | Node.js | package.json, vitest.config.* | npm test, bun test | | Python | pyproject.toml, pytest.ini | pytest | | Go | go.mod, *_test.go | go test ./... | | Rust | Cargo.toml | cargo test |

Step 2: Select Mode

| Mode | When | First Action | |------|------|--------------| | New Feature | Adding functionality | Read existing module tests, confirm green baseline | | Bug Fix | Reproducing issue | Write failing reproduction test FIRST | | Refactor | Cleaning code | Ensure ≥80% coverage on target code | | Legacy | No tests exist | Add characterization tests before changing |

Tie-breaker: If coverage <20% or tests absent → use Legacy Mode first.

Mode: New Feature

Read existing tests for the module
Run tests to confirm green baseline
Enter Core Loop for new behavior
Commits: test(module): add test for X → feat(module): implement X

Mode: Bug Fix

Write failing reproduction test (MUST fail before fix)
Confirm failure is assertion error, not syntax error
Write minimal fix
Run full test suite
Commits: test: add failing test for bug #123 → fix: description (#123)

Mode: Refactor

Run coverage on the specific function you'll refactor
If coverage <80% → add characterization tests first
Refactor in small steps (ONE change → run tests → repeat)
Never change behavior during refactor

Mode: Legacy Code

Find Seams - insertion points for tests (Sensing Seams, Separation Seams)
Break Dependencies - use Sprout Method or Wrap Method
Add characterization tests (capture current behavior)
Build safety net: happy path + error cases + boundaries
Then apply TDD for your changes

→ See references/examples.md for full code examples of each mode.

Step 3: The Core TDD Loop

Before Starting: Scenario List

List all behaviors to cover:

[ ] Happy path cases
[ ] Edge cases and boundaries
[ ] Error/failure cases
[ ] Pessimism: 3 ways this could fail (network, null, invalid state)

🔴 RED Phase

Write ONE test (single behavior or edge case)
Use AAA: Arrange → Act → Assert
Run test, verify it FAILS for expected reason

Checks:

Is failure an assertion error? (Not SyntaxError/ModuleNotFoundError)
Can I explain why this should fail?
If test passes immediately → STOP. Test is broken or feature exists.

🟢 GREEN Phase

Write minimal code to pass
Do NOT implement "perfect" solution
Verify test passes

Checks:

Is this the simplest solution?
Can I delete any of this code and still pass?

🔵 REFACTOR Phase

Look for duplication, unclear names, magic values
Clean up without changing behavior
Verify tests still pass

Repeat

Select next scenario, return to RED.

Triangulation: If implementation is too specific (hardcoded), write another test with different inputs to force generalization.

Stop Conditions

| Signal | Response | |--------|----------| | Test passes immediately | Check assertions, verify feature isn't already built | | Test fails for wrong reason | Fix setup/imports first | | Flaky test | STOP. Fix non-determinism immediately | | Slow feedback (>5s) | Optimize or mock external calls | | Coverage decreased | Add tests for uncovered paths |

Test Distribution: The Testing Trophy

The Testing Trophy (Kent C. Dodds) reflects modern testing reality: integration tests give the best confidence-to-effort ratio.

          _____________
         /   System    \      ← Few, slow, high confidence; brittle (E2E)
        /_______________\
       /                 \
      /    Integration    \   ← Real interactions between units — **BEST ROI** (Integration)
      \                   /
       \_________________/
         \    Unit     /      ← Fast & cheap but test in isolation (Unit) 
          \___________/
          /   Static  \       ← Typecheck, linting — typos/types (Static)
         /_____________\

Layer Breakdown

| Layer | What | Tools | When | |-------|------|-------|------| | Static | Type errors, syntax, linting | TypeScript, ESLint | Always on, catches 50%+ of bugs for free | | Unit | Pure functions, algorithms, utilities | vitest, jest, pytest | Isolated logic with no dependencies | | Integration | Components + hooks + services together | Testing Library, MSW, Testcontainers | Real user flows, real(ish) data | | E2E | Full app in browser | Playwright, Cypress | Critical paths only (login, checkout) |

Why Integration Tests Win

Unit tests prove code works in isolation. Integration tests prove code works together.

| Concern | Unit Test | Integration Test | |---------|-----------|------------------| | Component renders | ✅ | ✅ | | Component + hook works | ❌ | ✅ | | Component + API works | ❌ | ✅ | | User flow works | ❌ | ✅ | | Catches real bugs | Sometimes | Usually |

The insight: Most bugs live in the seams between modules, not inside pure functions. Integration tests catch seam bugs; unit tests don't.

Practical Guidance

Start with integration tests - Test the way users use your code
Drop to unit tests for complex algorithms or edge cases
Use E2E sparingly - Slow, flaky, expensive to maintain
Let static analysis do the heavy lifting - TypeScript catches more bugs than most unit tests
Prefer fakes over mocks - Fakes have real behavior; mocks just return canned data
SMURF quality: Sustainable, Maintainable, Useful, Resilient, Fast

Anti-Patterns

| Pattern | Problem | Fix | |---------|---------|-----| | Mirror Blindness | Same agent writes test AND code | State test intent before GREEN | | Happy Path Bias | Only success scenarios | Include errors in Scenario List | | Refactoring While Red | Changing structure with failing tests | Get to GREEN first | | The Mockery | Over-mocking hides bugs | Prefer fakes or real implementations | | Coverage Theater | Tests without meaningful assertions | Assert behavior, not lines | | Multi-Test Step | Multiple tests before implementing | One test at a time | | Verification Trap 🤖 | AI tests what code does not what it should do | State intent in plain language; separate agent review | | Test Exploitation 🤖 | LLMs exploit weak assertions or overload operators | Use PBT alongside examples; strict equality | | Assertion Omission 🤖 | Missing edge cases (null, undefined, boundaries) | Scenario list with errors; test.each | | Hallucinated Mock 🤖 | AI generates fake mocks without proper setup | Testcontainers for integration; real Fakes for unit |

Critical: Verify tests by (1) running them, (2) having separate agent review, (3) never trusting generated tests blindly.

Advanced Techniques

Use these techniques at specific points in your workflow:

| Technique | Use During | Purpose | |-----------|------------|---------| | Test Doubles | 🔴 RED phase | Isolate dependencies when writing tests | | Property-Based Testing | 🔴 RED phase | Cover edge cases for complex logic | | Contract Testing | 🔴 RED phase | Define API expectations between services | | Snapshot Testing | 🔴 RED phase | Capture UI/response structure | | Hermetic Testing | 🔵 Setup | Ensure test isolation and determinism | | Mutation Testing | ✅ After GREEN | Validate test suite effectiveness | | Coverage Analysis | ✅ After GREEN | Find untested code paths | | Flaky Test Management | 🔧 Maintenance | Fix unreliable tests blocking CI |

Test Doubles (Use: Writing Tests with Dependencies)

When: Your code depends on something slow, unreliable, or complex (DB, API, filesystem).

| Type | Purpose | When | |------|---------|------| | Stub | Returns canned answers | Need specific return values | | Mock | Verifies interactions | Need to verify calls made | | Fake | Simplified implementation | Need real behavior without cost | | Spy | Records calls | Need to observe without changing |

Decision: Dependency slow/unreliable? → Fake (complex) or Stub (simple). Need to verify calls? → Mock/Spy. Otherwise → real implementation.

→ See references/examples.md → Test Double Examples

Hermetic Testing (Use: Test Environment Setup)

When: Setting up test infrastructure. Tests must be isolated and deterministic.

Principles:

Isolation: Unique temp directories/state per test
Reset: Clean up in setUp/tearDown
Determinism: No time-based logic or shared mutable state

Database Strategies:

| Strategy | Speed | Fidelity | Use When | |----------|-------|----------|----------| | In-memory (SQLite) | Fast | Low | Unit tests, simple queries | | Testcontainers | Medium | High | Integration tests | | Transactional Rollback | Fast | High | Tests sharing schema (80x faster than TRUNCATE) |

→ See references/examples.md → Hermetic Testing Examples

Property-Based Testing (Use: Writing Tests for Complex Logic)

When: Writing tests for algorithms, state machines, serialization, or code with many edge cases.

Tools: fast-check (JS/TS), Hypothesis (Python), proptest (Rust)

Properties to Test:

Commutativity: f(a, b) == f(b, a)
Associativity: f(f(a, b), c) == f(a, f(b, c))
Identity: f(a, identity) == a
Round-trip: decode(encode(x)) == x
Metamorphic: If input changes by X, output changes by Y (useful when you don't know expected output)

How: Replace multiple example-based tests with one property test that generates random inputs.

Critical: Always log the seed on failure. Without it, you cannot reproduce the failing case.

→ See references/examples.md → Property-Based Testing Examples

Mutation Testing (Use: Validating Test Quality)

When: After tests pass, to verify they actually catch bugs. Use for critical code (auth, payments) or before major refactors.

Tools: Stryker (JS/TS), PIT (Java), mutmut (Python)

How: Tool mutates your code (e.g., changes > to >=). If tests still pass → your tests are weak.

Interpretation:

>80% mutation score = good test suite
Survived mutants = tests don't catch those changes → add tests for these

Equivalent Mutant Problem: Some mutants change syntax but not behavior (e.g., i < 10 → i != 10 in a loop where i only increments). These can't be killed—100% score is often impossible. Focus on surviving mutants in critical paths, not chasing perfect scores.

When NOT to use: Tool-generated code (OpenAPI clients, Protobuf stubs, ORM models), simple DTOs/getters, legacy code with slow tests, or CI pipelines that must finish in <5 minutes. Use --incremental --since main for PR-focused runs. Note: This does NOT mean skip mutation testing on code you (the agent) wrote—always validate your own work.

→ See references/examples.md → Mutation Testing Examples

Flaky Test Management (Use: CI/CD Maintenance)

When: Tests fail intermittently, blocking CI or eroding trust in the test suite.

Root Causes:

| Cause | Fix | |-------|-----| | Timing (setTimeout, races) | Fake timers, await properly | | Shared state | Isolate per test | | Randomness | Seed or mock | | Network | Use MSW or fakes | | Order dependency | Make tests independent | | Parallel transaction conflicts | Isolate DB connections per worker |

How: Detect (--repeat 10) → Quarantine (separate suite) → Fix root cause → Restore

Quarantine Rules:

Issue-linked: Every quarantined test MUST link to a tracking issue. Prevents "quarantine-and-forget."
Mute, don't skip: Prefer muting (runs but doesn't fail build) over skipping. You still collect failure data.
Reintroduction criteria: Test must pass N consecutive runs (e.g., 100) on main before leaving quarantine.

→ See references/examples.md → Flaky Test Examples

Contract Testing (Use: Writing Tests for Service Boundaries)

When: Writing tests for code that calls or exposes APIs. Prevents integration breakage.

How (Pact): Consumer defines expected interactions → Contract published → Provider verifies → CI fails if contract broken.

→ See references/examples.md → Contract Testing Examples

Coverage Analysis (Use: Finding Gaps After Tests Pass)

When: After writing tests, to find untested code paths. NOT a goal in itself.

| Metric | Measures | Threshold | |--------|----------|-----------| | Line | Lines executed | 70-80% | | Branch | Decision paths | 60-70% | | Mutation | Test effectiveness | >80% |

Risk-Based Prioritization: P0 (auth, payments) → P1 (core logic) → P2 (helpers) → P3 (config)

Warning: High coverage ≠ good tests. Tests must assert meaningful behavior.

Snapshot Testing (Use: Writing Tests for UI/Output Structure)

When: Writing tests for UI components, API responses, or error message formats.

Appropriate: UI structure, API response shapes, error formats. Avoid: Behavior testing, dynamic content, entire pages.

How: Capture output once, verify it doesn't change unexpectedly. Always review diffs carefully.

→ See references/examples.md → Snapshot Testing Examples

Integration with Other Skills

| Task | Skill | Usage | |------|-------|-------| | Committing | git-commit | test: for RED, feat: for GREEN | | Code Quality | code-quality | Run during REFACTOR phase | | Documentation | docs-check | Check if behavior changes need docs |

References

Foundational:

Three Rules of TDD - Robert C. Martin
Test Pyramid - Martin Fowler
Testing Trophy - Kent C. Dodds
Working Effectively with Legacy Code - Michael Feathers

Tools: Testcontainers | fast-check | Stryker | MSW | Pact

Agent Skills: Test-Driven Development (TDD)

Install this agent skill to your local

Skill Files