Test-Driven Development Skill

The Iron Law
The TDD Cycle
What Counts as a Test
Logging TDD Progress
Red Flags - Thoughts That Mean STOP
Delete & Restart
E2E Test Requirement

Test-Driven Development

<EXTREMELY-IMPORTANT> ## Task Reframing: What Your Job Actually Is

Your job is NOT to implement features. Your job is to write tests that prove features work.

Reframe every task:

❌ "Implement user login"
✅ "Write a test that proves user login works. Then make it pass."
❌ "Fix the icon rendering bug"
✅ "Write a test that fails when icons render wrong. Then fix it."

The test IS your deliverable. The implementation just makes the test pass. </EXTREMELY-IMPORTANT>

<EXTREMELY-IMPORTANT> ## File-Based Logging (MANDATORY)

ALL CODE MUST USE FILE-BASED LOGGING.

Every application you write MUST redirect output to a log file:

CLI apps: ./app > /tmp/app.log 2>&1 &
GUI apps: ./app --log-file=/tmp/app.log 2>&1 &
Test runners: pytest -v > /tmp/test.log 2>&1

Why: Without log files, you have NO EVIDENCE of what happened. "I saw it in terminal" is not verification.

Read the full requirements: @references/logging-requirements.md </EXTREMELY-IMPORTANT>

<EXTREMELY-IMPORTANT> ## The Execution Gate (MANDATORY)

NO E2E TESTS WITHOUT PASSING THE EXECUTION GATE FIRST.

Before running E2E tests or taking screenshots, you MUST complete all 6 gates in order:

GATE 1: BUILD
GATE 2: LAUNCH (with file-based logging)
GATE 3: WAIT
GATE 4: CHECK PROCESS
GATE 5: READ LOGS ← MANDATORY, CANNOT SKIP
GATE 6: VERIFY LOGS
THEN AND ONLY THEN: E2E tests/screenshots

Key enforcement:

If you catch yourself thinking "let me take a screenshot" → STOP, you skipped gates 1-6
If process is running → READ LOGS (GATE 5) before testing
Logs come BEFORE screenshots, not after

For GUI applications:

Screenshot WINDOW ONLY (not whole screen)
When testing specific feature (toolbar icons), crop to THAT REGION only
Whole screen = false conclusions from other apps

Read the complete gate sequence: @references/execution-gates.md </EXTREMELY-IMPORTANT>

<EXTREMELY-IMPORTANT> ## The Iron Law of TDD

YOU MUST WRITE THE FAILING TEST FIRST. YOU MUST SEE IT FAIL. This is not negotiable.

Before writing ANY implementation code:

You write a test that will fail (because the feature doesn't exist yet)
You run the test and SEE THE FAILURE OUTPUT (RED)
You document in LEARNINGS.md: "RED: [test name] fails with [error message]"
Only THEN you write implementation code
You run the test again, SEE IT PASS (GREEN)
You document: "GREEN: [test name] now passes"

The RED step is not optional. If the test hasn't failed, you haven't practiced TDD. </EXTREMELY-IMPORTANT>

The TDD Cycle

RED → Write test → Run through GATES → See failure → Read logs → Document
GREEN → Minimal code → Run through GATES → See pass → Read logs → Document
REFACTOR → Clean up while staying green

Step 1: RED - Write Failing Test

# Write the test FIRST
def test_user_can_login():
    result = login("user@example.com", "password123")
    assert result.success == True
    assert result.token is not None

Run the test through the execution gates:

# For unit tests, minimum gates are: EXECUTE + READ OUTPUT
pytest tests/test_auth.py::test_user_can_login -v 2>&1 | tee /tmp/test.log
# pytest: run specific test and see RED failure

# READ the output (MANDATORY)
cat /tmp/test.log

Output will show:

FAILED - NameError: name 'login' is not defined

Log to LEARNINGS.md:

## RED: test_user_can_login
- Test written
- Ran through gates (pytest executed, output read)
- Fails with: NameError: name 'login' is not defined
- Expected: function doesn't exist yet

Step 2: GREEN - Minimal Implementation

Write the minimum code to make the test pass:

def login(email: str, password: str) -> LoginResult:
    # Minimal implementation
    return LoginResult(success=True, token="dummy-token")

Run the test through gates again:

pytest tests/test_auth.py::test_user_can_login -v 2>&1 | tee /tmp/test.log
# pytest: run test again and see GREEN success

# READ the output (MANDATORY)
cat /tmp/test.log

Output will show:

PASSED

Log to LEARNINGS.md:

## GREEN: test_user_can_login
- Minimal login() implemented
- Ran through gates (pytest executed, output read)
- Test passes
- No errors in output
- Ready for refactor

Step 3: REFACTOR - Improve While Green

Clean up the code while keeping tests passing:

def login(email: str, password: str) -> LoginResult:
    user = User.find_by_email(email)
    if user and user.check_password(password):
        return LoginResult(success=True, token=generate_token(user))
    return LoginResult(success=False, token=None)

Verify tests remain green after refactoring:

pytest tests/test_auth.py -v
# pytest: run all tests and verify GREEN after refactor

Output will show:

All tests PASSED

What Counts as a Test

<EXTREMELY-IMPORTANT> ### REAL Tests vs FAKE "Tests"

| REAL TEST (execute + verify) | FAKE "TEST" (NEVER ACCEPTABLE) | |------------------------------|--------------------------------| | pytest calls function, asserts return | grep for function exists | | Playwright clicks button, checks DOM | ast-grep finds pattern | | ydotool types input, screenshot verifies | Log says "success" | | CLI invocation checks stdout | "Code looks correct" | | API request verifies response body | "I'm confident it works" |

THE TEST MUST EXECUTE THE CODE AND VERIFY RUNTIME BEHAVIOR.

Grepping is NOT testing. Log reading is NOT testing. Code review is NOT testing. </EXTREMELY-IMPORTANT>

Fake Tests That Look Like Tests (THE INSIDIOUS FAILURE)

<EXTREMELY-IMPORTANT> **A test can EXECUTE code and still be FAKE if it tests the wrong thing.**

This is MORE dangerous than no tests because it creates FALSE CONFIDENCE.

| LOOKS LIKE A TEST | WHY IT'S FAKE | REAL TEST MUST DO | |-------------------|---------------|-------------------| | Tests different protocol | Wrong code path | Use same protocol as production | | Calls function directly | Skips user workflow | Simulate actual user action | | Checks internal state | User doesn't see that | Verify user-visible output | | Uses mock/stub for SUT | Defeats the purpose | Test actual behavior | | Ignores specified skill | "I know better" | Use the specified testing skill | | Changes assertion to pass | Hides bugs | Question if test is valid | | Skips async when prod is async | Race conditions hidden | Match async behavior |

The Iron Law of REAL Tests

If the test doesn't replicate what the user does, it's a FAKE test.

Before running any test, verify:

Does test use SAME protocol as production?
Does test follow EXACT user workflow?
Does test verify what USER sees?
Does test use the SPECIFIED testing skill?

If ANY answer is "no" → DELETE THE TEST. Write a REAL one.

Fake Test Detection (Red Flags)

If you catch yourself doing these, STOP - you're writing a FAKE test:

| What You're Doing | Why It's Fake | Do Instead | |-------------------|---------------|------------| | Testing different protocol | Wrong code path | Use production protocol | | Calling function instead of user action | Skipping user workflow | Simulate actual user action | | Changing assertion to make test pass | Hiding bugs, not finding them | Question if test is valid | | Ignoring the testing skill user specified | Arrogance: "I know better" | Use the specified skill | | Testing internal state | User doesn't see that | Test user-visible output | | Mocking the System Under Test | Defeats the purpose | Test actual behavior | | Using sync when production is async | Race conditions hidden | Match async behavior | | Testing unit when integration needed | Boundary bugs hidden | Test across boundaries |

When Tests Fail, Question the Test First

If a test fails, don't immediately fix the assertion. Ask:

Is this test testing the right thing?
Is this test using the right protocol?
Is this test replicating the user's workflow?
Did I use the specified testing skill?

If any answer is "no" → The test is wrong, not the assertion. </EXTREMELY-IMPORTANT>

Why Grepping is Not Testing

| Fake Approach | Why It's Worthless | What Happens | |---------------|-------------------|--------------| | grep "function_name" | Proves function exists, not that it works | Bug ships | | ast-grep pattern | Proves structure matches, not behavior | Runtime crash | | "Log says success" | Log was written, code might not run | Silent failure | | "Code review passed" | Human opinion, not execution | Edge cases missed |

Logging TDD Progress

Document every TDD cycle in .claude/LEARNINGS.md:

## TDD Cycle: [Feature/Test Name]

### RED
- **Test:** `test_feature_works()`
- **Command:**

```bash
pytest tests/test_feature.py::test_feature_works -v
# pytest: run test and observe RED failure

Output:

FAILED - AssertionError: expected True, got None

Expected: Feature not implemented yet

GREEN

Implementation: Added feature_works() function
Command:

pytest tests/test_feature.py::test_feature_works -v
# pytest: run test and verify GREEN success

Output:

PASSED

REFACTOR

Extracted helper function
Added type hints
Verify tests still pass:

pytest tests/test_feature.py -v
# pytest: run all tests and confirm GREEN after refactor


## Red Flags - Thoughts That Mean STOP

If you catch yourself thinking these thoughts—STOP. They're indicators you're about to skip TDD:

| Thought | Reality |
|---------|---------|
| "Write the test after" | You're about to do verification, not TDD. You MUST test FIRST. |
| "This is too simple for TDD" | Your simple code benefits MOST from TDD. |
| "Just fix this quickly" | Your speed isn't the goal. Your correctness is. |
| "Know the test will fail" | You knowing isn't the same as you seeing it fail. You MUST RUN it, see RED. |
| "Grep confirms it exists" | Your existence check ≠ working code. You MUST execute the code. |
| "Already have the code" | You MUST DELETE IT. You write test first, then reimplement. |
| "Test passed on first run" | Suspicious. Did you actually see RED first? |

### Red Flags - Fake Test Indicators

If you catch yourself thinking these thoughts—STOP. They're indicators you're writing a FAKE test:

| Thought | Reality |
|---------|---------|
| "This protocol is easier to test" | But production uses different protocol. Test that. |
| "I can call the function directly" | User doesn't do that. Simulate user action. |
| "I'll mock this dependency" | Then you're not testing real behavior. |
| "Internal state is more direct" | User doesn't see internal state. Test what user sees. |
| "I know a better way" | User specified a skill. Use it. |
| "Let me fix this assertion" | Maybe the test is wrong. Question it first. |
| "User workflow is complex" | That's what user does. Replicate it. |
| "This shortcut tests the same thing" | No it doesn't. Follow exact workflow. |
| "Async is hard to test" | Production is async. Test it async. |
| "Integration tests are slow" | Unit tests hide boundary bugs. Test integration. |

**If your test doesn't replicate what the user does, it's a FAKE test.**

**If your test doesn't fail first, you haven't practiced TDD.**

## Delete & Restart

<EXTREMELY-IMPORTANT>
**Wrote implementation code before test? You MUST DELETE IT. No exceptions.**

When you discover implementation code that wasn't driven by a test:
1. **DELETE** your implementation code
2. **WRITE** the test first
3. **RUN** it, **SEE RED**
4. **REWRITE** the implementation

"But it works" is not an excuse. "But it would waste your time" is not an excuse.

**Code you wrote without TDD is UNTRUSTED code. You delete it and do it right.**
</EXTREMELY-IMPORTANT>

## E2E Test Requirement

<EXTREMELY-IMPORTANT>
### The Iron Law of E2E in TDD

**USER-FACING FEATURES REQUIRE E2E TESTS IN ADDITION TO UNIT TESTS.**

TDD cycle for user-facing changes:

Unit TDD: RED → GREEN → REFACTOR ↓ E2E TDD: RED → GREEN → REFACTOR


**Both cycles must complete. Unit GREEN does not mean DONE.**

### When E2E is Required

| Change Type | Unit Tests | E2E Required? |
|-------------|------------|---------------|
| Internal logic | Yes | No |
| API endpoint | Yes | Yes (test full request/response) |
| UI component | Yes | **Yes** (Playwright/automation) |
| CLI command | Yes | Yes (test actual invocation) |
| User workflow | Yes | **Yes** (simulate user actions) |
| Visual change | Yes | **Yes** (screenshot comparison) |

### E2E TDD Cycle

1. **RED**: Write E2E test simulating user action
   - Run through ALL 6 GATES (BUILD → LAUNCH → WAIT → CHECK → READ LOGS → VERIFY LOGS)
   - Only after GATE 6: Run the E2E test
   - Observe the failure (feature doesn't exist)
   - Document: "E2E RED: [test] fails with [error]. All gates passed, logs clean."

2. **GREEN**: Implement to make E2E pass (unit tests already green)
   - Run through ALL 6 GATES again
   - Only after GATE 6: Run the E2E test
   - Verify the pass
   - Document: "E2E GREEN: [test] passes. All gates passed, logs clean."

3. **REFACTOR**: Ensure both unit and E2E stay green
   - Continue running through gates for each test run

### Delete & Restart (E2E)

**You shipped user-facing code without E2E test? You MUST WRITE ONE NOW.**

Retroactive E2E is better than no E2E. But next time: You write E2E FIRST.
</EXTREMELY-IMPORTANT>

## Integration

This skill is invoked by:
- `dev-implement` - for TDD during implementation
- `dev-debug` - for regression tests during debugging

For testing tool options (Playwright, ydotool, etc.), see:

Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/dev-test/SKILL.md")

Agent Skills: Test-Driven Development

Install this agent skill to your local

Skill Files

Contents