Agent Skills: Spec Tests: Intent-Based Testing for LLM Development

>

UncategorizedID: ianphil/my-skills/spec-tests

Skill Files

Browse the full folder contents for spec-tests.

Download Skill

Loading file tree…

skills/spec-tests/SKILL.md

Skill Metadata

Name
spec-tests
Description
>

Spec Tests: Intent-Based Testing for LLM Development

Spec tests are intent-based specifications that Claude evaluates as judge. They capture WHY something matters—making them cheat-proof for LLM-driven development.

The TDD Flow

1. Plan       → Define what you're building
2. Spec (red) → Write intent tests (they fail - no implementation yet)
3. Implement  → Build the feature
4. Spec (green) → Tests pass (Claude confirms intent is satisfied)

Test File Format

# Feature Name

## Test Group

### Test Case Name

Intent statement explaining WHY this test matters. What user need does it serve?
What breaks if this doesn't work?

\`\`\`
Given [precondition]
When [action]
Then [expected outcome]
\`\`\`

Structure: H2 = test group, H3 = test case, intent = required statement, code block = expected behavior.

Critical: Intent statement must appear immediately above the code block, between the H3 header and the assertion block. Section-level intent does not count—each test case needs its own WHY directly before its code block.

Each test must include a fenced code block. Missing code blocks fail with [missing-assertion].


Test Location & Targets

Spec tests live in specs/tests/ and declare their target(s) via frontmatter.

Single target:

---
target: src/auth.py
---
# Authentication Tests

Multiple targets:

---
target:
  - src/auth.py
  - src/session.py
---
# Authentication Flow

Directory structure — name files by feature/spec, not by target path:

specs/tests/
  authentication.md      ← target: [src/auth.py, src/session.py]
  intent-requirement.md  ← target: [SKILL.md]
  api-validation.md      ← target: [src/api/validate.py]

Frontmatter is required. Missing target: causes immediate failure with [missing-target].


Running Tests

Copy the runner files to your project:

cp "${CLAUDE_PLUGIN_ROOT}/scripts/run_tests_claude.py" specs/tests/
cp "${CLAUDE_PLUGIN_ROOT}/scripts/judge_prompt.md" specs/tests/

Run tests:

python specs/tests/run_tests_claude.py specs/tests/authentication.md  # Single spec
python specs/tests/run_tests_claude.py specs/tests/                   # All specs
python specs/tests/run_tests_claude.py specs/tests/auth.md --test "Valid Credentials"  # Single test

Uses claude -p (your subscription, no API key needed).

Options: | Flag | Purpose | |------|---------| | --target FILE | Override frontmatter target | | --model MODEL | Claude model (default: sonnet) | | --test "Name" | Run only named test |

Timeout: 60-300 seconds per test.


Why Intent Matters

LLMs can "game" tests by changing them instead of fixing code.

Without intent (fails with `[missing-intent]}):

### Completes Quickly
\`\`\`
elapsed < 50ms
\`\`\`

LLM thinks: "50 seems arbitrary, change to 100." User gets laggy editor.

With intent:

### Completes Quickly

Users perceive delays over 50ms as laggy. This runs on every keystroke.
The 50ms target is a UX requirement, not negotiable.

\`\`\`
Given a keystroke event
When process_keystroke() is called
Then it completes in under 50ms
\`\`\`

Claude-as-judge evaluates: Does it satisfy the UX requirement? Relaxing threshold → [intent-violated].

Intent properties:

  • Required — Missing intent → [missing-intent] before evaluation
  • Per-test — Each test needs its own WHY above the code block
  • Business-focused — Why users/product care, not technical details
  • Evaluative — Catches "legal but wrong" solutions

Reference Files

For detailed patterns, consult:

  • references/evaluation.md — Error codes, response format, strictness rules, alternative runners, template variables
  • references/multi-target.md — Writing tests for multiple targets, multi-file Given syntax
  • references/examples.md — Complete examples, porting tests across languages
  • references/meta-content.md — Testing prompt files and directive-like content

Checklist

  • [ ] Each test has intent statement explaining WHY
  • [ ] Intent is business/user focused
  • [ ] Expected behavior is clear
  • [ ] Each test includes a fenced assertion code block
  • [ ] One behavior per test case
  • [ ] Multi-target specs: each test starts with Given the <target> file

Missing intent = immediate failure. The runner rejects tests without intent statements before evaluating behavior.