Eval Harness Skill | Agent Skills

Eval Harness

A skill for building and running evaluation harnesses on AI features. Covers two eval types (capability, regression), three grader types (code, model, human), and the metrics that summarize them (pass@k, pass^k).

Evaluation-driven development (EDD)

Evals are defined before or alongside implementation, not after. Success criteria become explicit, measurable, and testable from day one.

Core principles:

Define success first — before implementing a feature, define what "working correctly" means through explicit evaluations.
Measure continuously — run evals throughout development, not just at the end.
Automate where possible — prefer automated graders for speed and consistency.
Human review for nuance — use human graders when quality judgments require context or subjectivity.
Track regressions — every capability added becomes a regression test.
Iterate on failures — failed evals provide specific, actionable feedback for improvement.

Benefits: clear success criteria (no ambiguity about what "done" means), early problem detection, confidence in changes, tests double as executable specifications, quantitative progress tracking.

The mindset shift:

Traditional: "Build it, then figure out if it works"
EDD:         "Define what 'works' means, then build to pass those evals"

Eval types (summary)

| Type | Purpose | When to use | |---|---|---| | Capability | Verify new functionality works | Adding features, improving functionality, testing edge cases | | Regression | Protect against unintended breakage | After code changes, before PR merge, after dependency updates |

Full structures and YAML examples → references/eval-types.md.

Grader types (summary)

| Type | Strength | When to use | |---|---|---| | Code (exact match, regex, function output) | Fast, deterministic, cheap | Any codifiable correctness check | | Model (LLM judge with rubric) | Handles semantic variation, tone, quality | Natural-language outputs, multiple valid answers | | Human (Likert, binary, ranking) | Catches what machines miss | Creative content, safety-critical judgments, calibrating auto-graders |

Full configs, implementation sketches, and rubric templates → references/graders.md.

Metrics (summary)

| Metric | Question answered | Use case | |---|---|---| | pass@1 | "Can it solve this at all?" | Single-shot capability | | pass@k | "Can it ever solve this?" | Best-of-k capability, retries cheap | | pass^k | "Can it always solve this?" | Consistency, reliability, production |

Formulas, implementations, and the picking-guide → references/metrics.md.

Eval workflow

Four phases. Run them in order when bringing up a new eval suite; rerun phase 3 every time you want a fresh measurement.

Phase 1 — Define

Create the evaluation specification before or alongside implementation.

eval_suite:
  name: "Feature X Evaluation Suite"
  version: "1.0.0"
  description: "Comprehensive evaluation for Feature X"

  config:
    parallel_execution: true
    max_workers: 4
    timeout_global: 3600
    retry_failed: true
    retry_count: 2

  evals:
    - $ref: "./evals/capability/cap-001.yaml"
    - $ref: "./evals/capability/cap-002.yaml"
    - $ref: "./evals/regression/reg-001.yaml"

  reporting:
    format: ["json", "markdown", "html"]
    output_dir: "./eval-results"
    include_details: true
    notify_on_failure: true

Checklist:

Define success criteria clearly
Identify capability evals needed
Identify regression evals to include
Choose appropriate grader types
Set realistic thresholds
Document edge cases

Phase 2 — Implement

Build the feature, using evals as guidance.

Implementation loop:
1. Write/modify code
2. Run relevant evals locally
3. Fix failures
4. Repeat until evals pass
5. Commit changes

Best practices:

Run evals frequently during development
Start with simple cases, add complexity
Use failing evals to guide implementation
Don't modify evals to pass (unless the actual requirements changed — and then update baselines explicitly)

Phase 3 — Evaluate

Run the full evaluation suite and collect results.

# Run all evals
eval-harness run --suite ./eval-suite.yaml

# Run specific category
eval-harness run --suite ./eval-suite.yaml --category capability

# Run with verbose output
eval-harness run --suite ./eval-suite.yaml --verbose

# Run with specific grader override
eval-harness run --suite ./eval-suite.yaml --grader-type model

Execution flow: load suite → validate eval definitions → initialize graders (code/model/human queues) → execute evals (parallel where possible) → collect results → aggregate scores and metrics → generate reports.

Phase 4 — Report

Generate comprehensive evaluation reports.

eval_report:
  metadata:
    suite_name: "Feature X Evaluation Suite"
    run_id: "eval-20240120-143052"
    timestamp: "2024-01-20T14:30:52Z"
    duration_seconds: 127
    environment:
      model: "claude-3-5-sonnet"
      version: "1.2.3"

  summary:
    total_evals: 25
    passed: 22
    failed: 3
    skipped: 0
    pass_rate: 0.88

    by_category:
      capability:   { total: 15, passed: 13, pass_rate: 0.867 }
      regression:   { total: 10, passed: 9,  pass_rate: 0.90 }

    by_grader:
      code:  { total: 18, passed: 17 }
      model: { total: 5,  passed: 4 }
      human: { total: 2,  passed: 1 }

  metrics:
    pass_at_1: 0.88
    pass_at_3: 0.96
    pass_hat_3: 0.68

  failures:
    - eval_id: "cap-003"
      name: "Complex nested extraction"
      reason: "Missed nested array handling"
    - eval_id: "cap-011"
      name: "Edge case: empty input"
      reason: "Threw exception instead of returning empty"
    - eval_id: "reg-007"
      name: "API response time regression"
      reason: "Response time 2.3s exceeds 2.0s threshold"

  recommendations:
    - "Improve nested data structure handling"
    - "Add empty input validation"
    - "Investigate response time regression"

Eval file format

Suites and evals can be authored as YAML (preferred) or JSON. Full structures, schemas, and directory layouts → references/file-format.md.

Integration with implement-phase

The eval harness integrates as an optional quality gate. Configure it in the plan's phase definition:

phase:
  name: "Implement user authentication"
  steps:
    - type: "implement"
      description: "Build authentication logic"

    - type: "test"
      description: "Run unit tests"

    - type: "eval"
      description: "Run capability evals"
      config:
        suite: "./evals/auth-capability.yaml"
        required_pass_rate: 0.9
        blocking: true                       # Fail phase if evals fail
        metrics:
          - "pass@1"
          - "pass^3"

Triggering from implement-phase:

eval_trigger:
  when: "after_tests_pass"
  suite: "./evals/feature-x.yaml"

  filters:
    categories: ["capability"]
    tags: ["critical", "feature-x"]

  thresholds:
    pass_at_1: 0.85
    pass_hat_3: 0.70

  on_failure:
    action: "block"
    notification: true
    report_path: "./eval-results/phase-{phase_id}.md"

  on_success:
    action: "continue"
    archive_results: true

In the phase completion report (auto-included when eval step ran):

### Eval Results
| Metric | Value | Threshold | Status |
|--------|-------|-----------|--------|
| pass@1 | 0.92 | 0.85 | PASS |
| pass^3 | 0.78 | 0.70 | PASS |

#### Capability Evals (12/13 passed)
- [x] cap-auth-001: Login flow
- [x] cap-auth-002: Token refresh
- [ ] cap-auth-003: Session timeout (FAILED)
  - Expected: Session expires after 30 min
  - Actual: Session persists indefinitely

#### Regression Evals (8/8 passed)
- [x] reg-auth-001: Password hashing unchanged
- [x] reg-auth-002: Token format stable

Quick reference

Grader selection guide

| Output type | Recommended grader | Rationale | |---|---|---| | Exact values | Code (exact match) | Fast, deterministic | | Formatted text | Code (regex) | Validates structure | | Code output | Code (function) | Tests correctness | | Natural language | Model | Handles variation | | Creative content | Human | Subjective judgment | | Safety-critical | Human + Model | Double verification |

Metric selection guide

| Scenario | Recommended metric | |---|---| | Code generation (any solution works) | pass@k | | Production system reliability | pass^k | | Single-shot capability | pass@1 | | Best-effort with retries | pass@5 or pass@10 | | Consistency requirements | pass^3 or pass^5 |

Common thresholds

| Context | pass@1 | pass@k | pass^k | |---|---|---|---| | Development | 0.70 | 0.90 | 0.50 | | Staging | 0.80 | 0.95 | 0.65 | | Production | 0.90 | 0.99 | 0.80 | | Critical | 0.95 | 0.999 | 0.90 |

References (loaded on demand)

references/eval-types.md — capability and regression eval structures with YAML examples
references/graders.md — code / model / human grader configs, prompt templates, rubric examples, calibration
references/metrics.md — pass@k and pass^k formulas, Python implementations, picking guide
references/file-format.md — full YAML/JSON schemas and recommended directory layout

Agent Skills: Eval Harness

Install this agent skill to your local

Skill Files