Eval Creator
Turns promoted learnings into permanent eval cases. Runs regression checks to verify promoted rules hold. This is the outer loop's regress-test step.
The blog says: "If a failure taught you something important, it should become a permanent test case. Otherwise the knowledge is still fragile."
When to Use
- After harness-updater promotes a pattern — create an eval for it
- On cadence — run all evals to check for regression
- Before major releases — verify the harness is holding
- When a promoted rule seems to have stopped working — diagnose with targeted eval run
Eval Directory Structure
.evals/
EVAL_INDEX.md # Index of all eval cases with status
cases/
eval-YYYYMMDD-001.md # Individual eval case
eval-YYYYMMDD-002.md
...
Creating an Eval Case
Input
From harness-updater or manually:
- Pattern-Key of the promoted learning
- The rule that was added to CLAUDE.md / AGENTS.md
- What to test (the assertion)
- Verification method
Eval Case Format
---
id: eval-YYYYMMDD-NNN
pattern-key: [from learning]
source: [LRN-YYYYMMDD-001, ERR-YYYYMMDD-003]
promoted-rule: "[the rule text in CLAUDE.md]"
promoted-to: CLAUDE.md
created: YYYY-MM-DD
last-run: YYYY-MM-DD
last-result: pass | fail | skip
---
## What This Tests
[One sentence: what failure this eval prevents from recurring]
## Precondition
[What must be true for this eval to be runnable]
- File X exists
- Project uses framework Y
- etc.
## Verification Method
[One of: grep-check, command-check, file-check, rule-check]
### grep-check
Search for a pattern that should (or should not) exist:
target: src/**/*.ts pattern: "hardcoded-secret-pattern" expect: not_found
### command-check
Run a command and check the exit code or output:
command: npm run typecheck expect_exit: 0
### file-check
Verify a file or section exists:
target: CLAUDE.md section: "## Verification" expect: exists
### rule-check
Verify a rule exists in an instruction file:
target: CLAUDE.md contains: "[the promoted rule text or key phrase]" expect: found
## Expected Result
**Pass:** [What "good" looks like]
**Fail:** [What regression looks like]
## Recovery Action
If this eval fails:
1. [Specific step to diagnose]
2. [Specific step to fix]
3. Re-run this eval to verify
Running Evals
Run All
Read .evals/EVAL_INDEX.md, iterate through all cases, execute each verification method.
Run by Pattern-Key
Filter to evals matching a specific pattern.
Run by Area
Filter to evals whose source files match an area (frontend, backend, etc.).
Execution
For each eval case:
- Check precondition — if not met, mark as
skip - Execute verification method:
grep-check: Use Grep tool to search target files for the patterncommand-check: Run the command via Bash, check exit code and/or outputfile-check: Use Read/Glob to verify file/section existencerule-check: Read the target file, search for the expected contentskill-check: Runquick_validate.pyon a skill directory (see Skill Validation below)script-check: Run a custom mcp-script by name (see Custom Verification Methods)
- Compare result to expected
- Update
last-runandlast-resultin the eval case file - Update
EVAL_INDEX.mdwith the result
Regression Report
## Eval Run: YYYY-MM-DD
**Total:** N evals
**Passed:** N
**Failed:** N
**Skipped:** N
### Failures
#### eval-YYYYMMDD-001 — [pattern-key]
- **What regressed:** [description]
- **Expected:** [X]
- **Got:** [Y]
- **Recovery action:** [from eval case]
### Summary
[All green / N regressions need attention]
Eval Index Format
.evals/EVAL_INDEX.md:
# Eval Index
| ID | Pattern-Key | Rule Summary | Last Run | Result | Created |
|----|-------------|-------------|----------|--------|---------|
| eval-YYYYMMDD-001 | auth-middleware-lock | Run migrations on test DB first | YYYY-MM-DD | pass | YYYY-MM-DD |
| eval-YYYYMMDD-002 | pnpm-not-npm | Use pnpm in this repo | YYYY-MM-DD | fail | YYYY-MM-DD |
Integration
Upstream
- harness-updater flags eval candidates after promoting a pattern
- learning-aggregator identifies patterns with clear pass/fail conditions
Downstream
- Regression failures feed back into self-improvement as new error entries
- Persistent failures may indicate the promoted rule needs refinement → feed back to harness-updater
Scheduled Use
For projects with a CI pipeline, eval-creator can run as a scheduled check:
- Weekly: run all evals
- Per-PR: run evals related to changed files
- Post-promotion: run the newly created eval immediately
Custom Verification Methods (mcp-scripts) — Extension Point
These are examples of what's possible with gh-aw mcp-scripts, not features shipped with this plugin. Projects define their own scripts.
Beyond the four built-in methods (grep-check, command-check, file-check, rule-check), projects can define custom verification tools as mcp-scripts for complex assertions that the built-ins can't express.
Example — an eval that verifies a promoted auth rule is enforced:
# In gh-aw workflow config
mcp-scripts:
check-auth-middleware:
lang: javascript
description: "Verify all /admin routes have auth middleware"
run: |
const routes = require('./src/routes/admin');
const unprotected = routes.filter(r => !r.auth);
if (unprotected.length) {
console.error('Unprotected admin routes:', unprotected.map(r => r.path));
process.exit(1);
}
Reference the script in an eval case as verification_method: script-check with the mcp-script name. This is an extension point — the built-in methods cover most cases, but mcp-scripts handle project-specific behavioral assertions.
Persistence — Extension Point
repo-memory is a gh-aw feature for durable cross-environment storage. This plugin works without it — .evals/ in the working directory is the default.
By default, eval cases live in .evals/ in the working directory. For teams using repo-memory, eval cases can be stored on a memory branch — making them available across environments and surviving ephemeral workspace teardown.
Skill Validation (skill-check)
The Anthropic /skill-creator skill includes two validation systems that eval-creator can use:
Structural validation via quick_validate.py
The skill-check verification method runs the skill-creator's quick_validate.py script on a skill directory. It checks:
- SKILL.md exists with valid YAML frontmatter
- Only allowed frontmatter keys (
name,description,license,allowed-tools,metadata,compatibility) - Name is kebab-case, max 64 chars, no leading/trailing/consecutive hyphens
- Description has no angle brackets, max 1024 chars
- Compatibility field max 500 chars if present
Eval case example:
---
id: eval-YYYYMMDD-NNN
pattern-key: skill-quality.verify-gate
verification_method: skill-check
target: skills/verify-gate
expect: valid
---
## What This Tests
Verify that the verify-gate skill passes structural validation after harness updates.
Execution: python .claude/skills/skill-creator/scripts/quick_validate.py <target>. Exit 0 = pass, exit 1 = fail.
Behavioral validation via run_eval.py
For deeper validation, the skill-creator's run_eval.py tests whether a skill's description causes Claude to invoke it for given queries. This is useful when harness-updater modifies a skill's description or the outer loop creates a new skill — the eval verifies the skill still triggers correctly.
This requires Claude CLI access and is expensive. Use it for high-value skills only, not as a routine CI check.
When to create skill-check evals
Two scenarios connect the outer loop to skill validation:
-
Harness-updater modifies a skill: When a promoted rule is inserted into a SKILL.md (rather than CLAUDE.md), create a
skill-checkeval to verify the skill remains structurally valid after the edit. -
Self-improvement identifies a skill gap: When learning-aggregator classifies a pattern as
skill_gapand recommends "create a new skill", the new skill should passquick_validate.pybefore being committed. Create askill-checkeval for it that persists as a regression test.
This closes the loop: failure → learning → new/updated skill → eval verifies skill quality → regression prevents quality drift.
What This Skill Does NOT Do
- Does not fix regressions (reports them for the agent or human to fix)
- Does not promote learnings (that's harness-updater)
- Does not analyze patterns (that's learning-aggregator)
- Does not replace project test suites — evals test the harness, not the code