Spec Tests: Intent-Based Testing for LLM Development
Spec tests are intent-based specifications that Claude evaluates as judge. They capture WHY something matters—making them cheat-proof for LLM-driven development.
The TDD Flow
1. Plan → Define what you're building
2. Spec (red) → Write intent tests (they fail - no implementation yet)
3. Implement → Build the feature
4. Spec (green) → Tests pass (Claude confirms intent is satisfied)
Test File Format
# Feature Name
## Test Group
### Test Case Name
Intent statement explaining WHY this test matters. What user need does it serve?
What breaks if this doesn't work?
\`\`\`
Given [precondition]
When [action]
Then [expected outcome]
\`\`\`
Structure: H2 = test group, H3 = test case, intent = required statement, code block = expected behavior.
Critical: Intent statement must appear immediately above the code block, between the H3 header and the assertion block. Section-level intent does not count—each test case needs its own WHY directly before its code block.
Each test must include a fenced code block. Missing code blocks fail with [missing-assertion].
Test Location & Targets
Spec tests live in specs/tests/ and declare their target(s) via frontmatter.
Single target:
---
target: src/auth.py
---
# Authentication Tests
Multiple targets:
---
target:
- src/auth.py
- src/session.py
---
# Authentication Flow
Directory structure — name files by feature/spec, not by target path:
specs/tests/
authentication.md ← target: [src/auth.py, src/session.py]
intent-requirement.md ← target: [SKILL.md]
api-validation.md ← target: [src/api/validate.py]
Frontmatter is required. Missing target: causes immediate failure with [missing-target].
Running Tests
Copy the runner files to your project:
cp "${CLAUDE_PLUGIN_ROOT}/scripts/run_tests_claude.py" specs/tests/
cp "${CLAUDE_PLUGIN_ROOT}/scripts/judge_prompt.md" specs/tests/
Run tests:
python specs/tests/run_tests_claude.py specs/tests/authentication.md # Single spec
python specs/tests/run_tests_claude.py specs/tests/ # All specs
python specs/tests/run_tests_claude.py specs/tests/auth.md --test "Valid Credentials" # Single test
Uses claude -p (your subscription, no API key needed).
Options:
| Flag | Purpose |
|------|---------|
| --target FILE | Override frontmatter target |
| --model MODEL | Claude model (default: sonnet) |
| --test "Name" | Run only named test |
Timeout: 60-300 seconds per test.
Why Intent Matters
LLMs can "game" tests by changing them instead of fixing code.
Without intent (fails with `[missing-intent]}):
### Completes Quickly
\`\`\`
elapsed < 50ms
\`\`\`
LLM thinks: "50 seems arbitrary, change to 100." User gets laggy editor.
With intent:
### Completes Quickly
Users perceive delays over 50ms as laggy. This runs on every keystroke.
The 50ms target is a UX requirement, not negotiable.
\`\`\`
Given a keystroke event
When process_keystroke() is called
Then it completes in under 50ms
\`\`\`
Claude-as-judge evaluates: Does it satisfy the UX requirement? Relaxing threshold → [intent-violated].
Intent properties:
- Required — Missing intent →
[missing-intent]before evaluation - Per-test — Each test needs its own WHY above the code block
- Business-focused — Why users/product care, not technical details
- Evaluative — Catches "legal but wrong" solutions
Reference Files
For detailed patterns, consult:
references/evaluation.md— Error codes, response format, strictness rules, alternative runners, template variablesreferences/multi-target.md— Writing tests for multiple targets, multi-file Given syntaxreferences/examples.md— Complete examples, porting tests across languagesreferences/meta-content.md— Testing prompt files and directive-like content
Checklist
- [ ] Each test has intent statement explaining WHY
- [ ] Intent is business/user focused
- [ ] Expected behavior is clear
- [ ] Each test includes a fenced assertion code block
- [ ] One behavior per test case
- [ ] Multi-target specs: each test starts with
Given the <target> file
Missing intent = immediate failure. The runner rejects tests without intent statements before evaluating behavior.