/evaluate:improve Skill

/evaluate:improve

Analyze evaluation results and suggest concrete improvements to a skill.

When to Use This Skill

| Use this skill when... | Use alternative when... | |------------------------|------------------------| | Have eval results and want to improve the skill | Need to run evals first -> /evaluate:skill | | Want to improve skill description for better triggering | Want to view raw results -> /evaluate:report | | Iterating on a skill to increase pass rate | Want to file a bug -> /feedback:session | | Optimizing skill instructions after benchmarking | Need structural fixes -> plugin-compliance-check.sh |

Parameters

Parse these from $ARGUMENTS:

| Parameter | Default | Description | |-----------|---------|-------------| | <plugin/skill-name> | required | Path as plugin-name/skill-name | | --apply | false | Apply approved changes to SKILL.md | | --description-only | false | Focus on description improvements only | | --best-of N | 1 | Generate N candidate revisions and apply the eval-ranked winner (requires --apply) | | --force-apply | false | Apply even when the delta-verify gate shows the edit does not shrink the source-failure set (override; requires --apply) |

Execution

Step 1: Load eval results

Read the most recent benchmark from:

<plugin-name>/skills/<skill-name>/eval-results/benchmark.json

If no results exist, suggest running /evaluate:skill first and stop.

Also read the current SKILL.md to understand the skill.

Capture the source-failure set. From the benchmark, record the set of eval-case IDs that failed with the skill active — these are the cases the forthcoming edit is meant to fix, and they are the input to the delta-verify gate below:

cat <plugin>/skills/<skill>/eval-results/benchmark.json \
  | jq -r '[.cases[] | select(.with_skill.passed == false) | .id]'

This set is distinct from the golden evals.json suite as a whole: the golden set measures overall pass rate, the source-failure set measures whether the edit fixed the specific failures that motivated it (AEGIS delta-verify). If the set is empty (a clean benchmark, or no per-case data), there is nothing for the gate to verify — skip it and proceed.

Step 2: Analyze results

Delegate analysis to the eval-analyzer agent via Task:

Task subagent_type: eval-analyzer
Prompt: Analyze these evaluation results and identify improvement opportunities.
  Skill: <path to SKILL.md>
  Benchmark: <benchmark.json contents>
  Mode: comparison (if baseline data exists) or benchmark (otherwise)

The analyzer produces categorized suggestions:

instructions: Execution flow improvements
description: Better intent-matching text
examples: Missing or insufficient examples
error_handling: Missing edge cases
tools: Better tool configurations
structure: Organizational improvements

Step 3: Filter suggestions

If --description-only, filter to only description category suggestions.

Sort remaining suggestions by priority (high > medium > low).

Step 4: Present suggestions

Present the categorized suggestions to the user:

## Improvement Suggestions: <plugin/skill-name>

Current pass rate: 72%

### High Priority

1. **[instructions]** Add explicit error handling for missing git config
   Evidence: eval-003 fails because the skill doesn't check for git user.name

2. **[description]** Add "conventional commit" as trigger phrase
   Evidence: Skill not selected when user says "make a conventional commit"

### Medium Priority

3. **[examples]** Add breaking change example to execution steps
   Evidence: eval-004 inconsistently handles breaking changes

### Low Priority

4. **[structure]** Move flag reference to Quick Reference table
   Evidence: Flags scattered across multiple sections

If --apply is NOT set, stop here.

Delta-verify gate (AEGIS source-cases — required before any apply)

Before any edit is written to the live SKILL.md — both the plain --apply path (Step 5) and the --best-of path (Step 5a) — confirm the edit actually shrinks the source-failure set captured in Step 1, not merely that the overall golden-set pass rate is higher. Ranking by aggregate pass rate can reward a candidate that fixes unrelated cases while leaving the motivating failures broken; this gate closes that gap (HarnessX/AEGIS: re-run on the source cases, confirm the failure count shrinks before applying).

Run the gate against the drafted candidate (a candidate file under eval-results/candidates/, or for plain --apply a draft written there first), never the live SKILL.md:

Re-run only the source-failure cases against the candidate — spawn one Task subagent (subagent_type: general-purpose) per case with the candidate content as the skill context (the same rollout machinery as Step 5a; use prepare_run.sh), and grade each transcript with python3 evaluate-plugin/scripts/grade_deterministic.py.
Compute delta = (source failures before) − (source failures after).
Gate: apply only when delta > 0 (the candidate fixes at least one motivating failure and regresses none of the others). When delta <= 0, do not write the edit — report which source cases still fail and suggest revising the suggestions. --force-apply overrides the gate (records the override in history). When the source-failure set is empty, the gate is a no-op and the apply proceeds.

Step 5: Apply changes (if --apply)

Use AskUserQuestion to let the user select which suggestions to apply:

Which improvements should I apply?
[x] Add error handling for missing git config
[x] Add trigger phrases to description
[ ] Add breaking change example
[ ] Restructure flag reference

If --best-of N with N > 1, follow Step 5a to pick the winning revision first, then continue with the apply flow below using the winner's content.

Draft the approved edits into a candidate file and run them through the Delta-verify gate above. Only proceed to write the live SKILL.md when the gate passes (or --force-apply is set). For each approved suggestion:

Read the current SKILL.md
Apply the change using Edit
Update the modified date in frontmatter

Step 5a: Generate and rank candidates (if --best-of N > 1)

Instead of drafting the approved edits once, generate N alternative drafts and let evaluation pick the winner.

Generate candidates. Write N complete candidate revisions of the SKILL.md to <plugin>/skills/<skill>/eval-results/candidates/candidate-<i>.md (the eval-results/ tree is gitignored). Each candidate implements the approved suggestions with a genuinely different strategy — different instruction placement, phrasing, or example choice — not paraphrases of one draft.
Rank with real grading when evals exist. If the skill has evals.json:
- For each candidate, run one pass per eval case: spawn a Task subagent (subagent_type: general-purpose) that receives the candidate content as the skill context and executes the eval prompt (mirrors /evaluate:skill Step 4; use prepare_run.sh for the run directories).
- Grade each transcript with python3 evaluate-plugin/scripts/grade_deterministic.py — typed checks grade for zero judge tokens; defer fuzzy assertions to the eval-grader agent.
- Rank candidates by source-failure delta first (how many of the Step 1 source-failure cases each candidate fixes — the Delta-verify gate signal), then by mean golden-set pass rate, so a candidate that lifts the aggregate while leaving the motivating failures broken never wins. Break remaining ties with the eval-comparator agent: blind pairwise comparison of the tied candidates' transcripts. Discard any candidate with delta <= 0 unless --force-apply is set.
Fall back to blind self-preference when no evals exist. Without evals.json there are no prompts to roll out. Rank via the eval-comparator agent — pairwise, candidates presented as Output A/B, the analyzer's weakness list passed as the assertions. Flag this in the report as a weaker signal and suggest re-running /evaluate:skill with --create-evals first.
Apply the winner through the Step 5 apply flow, and record the ranking in the history entry (Step 5b below): a candidates array with each candidate's id, pass rate (or comparison score), and a selected flag.

Token cost is bounded at N × eval cases × 1 run plus grading; treat --best-of without a number as N=3. Prefer this mode for skills that have evals.json — deterministic ranking of real rollouts is the point; text-only self-preference is the fallback.

Step 5b: Record history

After applying changes, update (or create) the history file at:

<plugin-name>/skills/<skill-name>/eval-results/history.json

Add a new iteration entry recording:

Version number (increment from previous)
Timestamp
Pass rate from current benchmark
Summary of changes made
Delta-verify result: source_failures_before, source_failures_after, and the resulting source_failure_delta (and whether --force-apply overrode a non-positive delta)
Candidate ranking when --best-of was used: a candidates array of {id, pass_rate, source_failure_delta, selected} (use comparison_score instead of pass_rate for the no-evals fallback)

Step 6: Suggest re-evaluation

After applying changes, suggest:

Changes applied. Run `/evaluate:skill <plugin/skill-name>` to measure improvement.

Agentic Optimizations

| Context | Command | |---------|---------| | Read benchmark | cat <plugin>/skills/<skill>/eval-results/benchmark.json \| jq .summary | | Read skill | cat <plugin>/skills/<skill>/SKILL.md | | Read history | cat <plugin>/skills/<skill>/eval-results/history.json \| jq '.iterations[-1]' | | Check pass rate | cat <plugin>/skills/<skill>/eval-results/benchmark.json \| jq '.summary.with_skill.mean_pass_rate' | | Source-failure set | cat <plugin>/skills/<skill>/eval-results/benchmark.json \| jq -r '[.cases[] \| select(.with_skill.passed == false) \| .id]' |

Quick Reference

| Flag | Description | |------|-------------| | --apply | Apply approved changes to SKILL.md | | --description-only | Focus on description improvements only | | --best-of N | Generate N candidate revisions, rank by source-failure delta then pass rate, apply winner | | --force-apply | Apply even when the delta-verify gate shows the edit does not shrink the source-failure set |

Agent Skills: /evaluate:improve

Install this agent skill to your local

Skill Files