Eval Harness Updater Skill

Eval Harness Updater

Refresh eval harnesses to keep live + fallback modes actionable under unstable environments.

Focus Areas

Prompt and parser drift
Timeout/partial-stream handling
SLO and regression gates
Dual-run fallback consistency

Workflow

Resolve harness path.
Research test/eval best practices (Exa + arXiv — see Research Gate below).
Add RED regressions for parsing and timeout edge cases.
Patch minimal harness logic.
Validate eval outputs and CI gates.
Resolve companion artifact gaps (see Cross-Reference table below).

Research Gate (Exa + arXiv — BOTH MANDATORY)

Before proposing harness changes, gather current best practices:

Use Exa for implementation and ecosystem patterns:
- mcp__Exa__web_search_exa({ query: 'LLM eval harness 2025 best practices' })
- mcp__Exa__get_code_context_exa({ query: 'eval harness parser reliability timeout handling' })
Search arXiv for academic research on evaluation methodology (mandatory):
- Via Exa: mcp__Exa__web_search_exa({ query: 'site:arxiv.org LLM evaluation harness 2024 2025' })
- Direct API: WebFetch({ url: 'https://arxiv.org/search/?query=LLM+evaluation+harness&searchtype=all&start=0' })
Record decisions, constraints, and non-goals in memory learnings.

arXiv is mandatory (not fallback) when topic involves: LLM evaluation, agent evaluation, SLO gates, regression testing methodology, or parser reliability.

Cross-Reference: Creator Ecosystem

This skill is part of the Creator Ecosystem. When research uncovers gaps, trigger the appropriate companion creator:

| Gap Discovered | Required Artifact | Creator to Invoke | When | | ---------------------------------------- | ----------------- | -------------------------------------- | --------------------------------- | | Domain knowledge needs a reusable skill | skill | Skill({ skill: 'skill-creator' }) | Gap is a full skill domain | | Existing skill has incomplete coverage | skill update | Skill({ skill: 'skill-updater' }) | Close skill exists but incomplete | | Capability needs a dedicated agent | agent | Skill({ skill: 'agent-creator' }) | Agent to own the capability | | Existing agent needs capability update | agent update | Skill({ skill: 'agent-updater' }) | Close agent exists but incomplete | | Domain needs code/project scaffolding | template | Skill({ skill: 'template-creator' }) | Reusable code patterns needed | | Behavior needs pre/post execution guards | hook | Skill({ skill: 'hook-creator' }) | Enforcement behavior required | | Process needs multi-phase orchestration | workflow | Skill({ skill: 'workflow-creator' }) | Multi-step coordination needed | | Artifact needs structured I/O validation | schema | Skill({ skill: 'schema-creator' }) | JSON schema for artifact I/O | | User interaction needs a slash command | command | Skill({ skill: 'command-creator' }) | User-facing shortcut needed | | Repeated logic needs a reusable CLI tool | tool | Skill({ skill: 'tool-creator' }) | CLI utility needed | | Narrow/single-artifact capability only | inline | Document within this artifact only | Too specific to generalize |

Iron Laws

ALWAYS run the Exa + arXiv research gate before updating any eval harness — updating without current external knowledge produces stale evaluation criteria.
NEVER remove existing evaluation criteria without replacing them with equivalent or better ones — reducing test coverage in an eval harness is a regression.
ALWAYS cross-reference the creator ecosystem for gaps before declaring the harness complete — missing companion artifacts (skills, agents, schemas) leave the harness unable to test new capabilities.
NEVER update eval harness in isolation from the skill/agent it evaluates — harness and artifact must stay synchronized or the harness tests the wrong behavior.
ALWAYS preserve backward compatibility in eval scoring — changing scoring semantics without migrating historical baselines makes trend analysis impossible.

Anti-Patterns

| Anti-Pattern | Why It Fails | Correct Approach | | -------------------------------------------------- | ----------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | | Updating eval harness without research gate | Criteria based on outdated knowledge; misses recent evaluation methodology advances | Always run Exa + arXiv research before updating any eval criteria | | Removing test cases to simplify the harness | Silently reduces coverage; regressions pass undetected | Only remove test cases when the behavior they tested has been deliberately removed | | Harness and artifact in separate PRs | Harness tests wrong behavior the moment artifact changes; immediate test drift | Always update harness and artifact in the same commit | | Changing scoring scale mid-project | Historical baselines become incomparable; trend analysis breaks | Define scoring scale once; create a migration if it must change | | Declaring harness complete without companion check | Missing skills or schemas leave evaluation gaps | Always run companion artifact check before marking harness update complete |

Memory Protocol (MANDATORY)

Before starting: Read .claude/context/memory/learnings.md

After completing:

New evaluation pattern → .claude/context/memory/learnings.md
Evaluation gap found → .claude/context/memory/issues.md
Scoring decision made → .claude/context/memory/decisions.md

ASSUME INTERRUPTION: If it's not in memory, it didn't happen.

Agent Skills: Eval Harness Updater

Install this agent skill to your local

Skill Files