Paths: File paths (
shared/,references/) are relative to skills repo root. Locate this SKILL.md directory and go up one level for repo root.
Benchmark Compare
Type: L3 Worker Category: 8XX Optimization -> 840 Benchmark
Run a clean A/B benchmark in Claude Code: one session with built-in tools only, one with hex-line. The benchmark is scenario-based, diff-validated, and manifest-driven. It measures activation, correctness, time, cost, and tokens.
Input / Output
| Direction | Content |
|-----------|----------|
| Input | Repo checkout containing mcp/hex-line-mcp/, optional references/goals.md, optional references/expectations.json |
| Output | Comparison report in skills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md |
Prerequisites
claude --versionsucceedsgitsucceedsmcp/hex-line-mcp/server.mjsexistsmcp/hex-line-mcp/hook.mjsexistsskills-catalog/ln-840-benchmark-compare/references/goals.mdexistsskills-catalog/ln-840-benchmark-compare/references/expectations.jsonexistsskills-catalog/ln-840-benchmark-compare/references/mcp-bench.jsonexists
Quick Run
bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh \
[skills-catalog/ln-840-benchmark-compare/references/goals.md] \
[skills-catalog/ln-840-benchmark-compare/references/expectations.json]
The runner handles:
- syntax preflight
- SessionStart preflight
- scenario extraction from
goals.md - isolated worktrees per scenario/session
- per-scenario diffs
- final comparison report
Workflow
Phase 1: Define The Canonical Suite
Use one canonical pair owned by this skill:
skills-catalog/ln-840-benchmark-compare/references/goals.mdskills-catalog/ln-840-benchmark-compare/references/expectations.json
Rules:
- The suite must be a balanced mix of common engineering scenarios.
- Do not design the suite to favor
hex-line. - Every scenario in
goals.mdmust have a matching entry inexpectations.json. expectations.jsonis the source of truth for correctness.
Supported expectation fields per scenario:
| Field | Meaning |
|-------|---------|
| id | Scenario identifier used in result filenames |
| expectedChangedFiles | Files that must change |
| forbiddenChangedFiles | Files that must not change |
| requiredDiffPatterns | Regex patterns required in the saved diff |
| forbiddenDiffPatterns | Regex patterns that must not appear in the diff |
| requiredResultPatterns | Regex patterns required in the final assistant result text |
| requiredCommands | Regex patterns that must match at least one Bash command |
| exactChangedFiles | If true, no extra changed files are allowed |
Phase 2: Preflight
The runner must pass:
node --check server.mjsnode --check hook.mjsnode --check extract-scenarios.mjsnode --check parse-results.mjs- SessionStart smoke check from
hook.mjs
If preflight fails, the benchmark is invalid and must stop before scenarios run.
Phase 3: Execute Per Scenario
For each ## scenario in goals.md:
- generate a standalone prompt file
- create two clean worktrees from the same commit
- run built-in Claude session
- run hex-line Claude session
- save
.jsonllogs and.diff.txtartifacts - remove both worktrees
Built-in session:
- no MCP
- hooks disabled
Hex-line session:
- resolved MCP config pointing to
server.mjs outputStyle: "hex-line"PreToolUsehook throughhook.mjs
Phase 4: Parse Results
parse-results.mjs evaluates each scenario for both sessions.
Scenario pass requires:
- valid run
- successful session completion
- changed files match expectations
- diff patterns match expectations
- result text patterns match expectations
- required commands were actually executed
Phase 5: Read The Report
The final report has these sections:
- Scenario Outcomes
- Activation
- Time
- Cost
- Tokens
- Tool Totals
- Validity
Interpretation rules:
invalid runmeans setup/adoption failure, not product performance- scenario
FAILmeans correctness contract was not met - activation is part of product quality for
hex-line, not external noise
Report Contract
skills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md must answer:
- Did each scenario complete correctly?
- Did
hex-lineactivate cleanly without discovery drift? - What changed in wall time, API time, cost, output tokens, and total tool calls?
- Was the run valid?
Do not treat raw time/cost as sufficient without scenario correctness.
Known Pitfalls
| Pitfall | Solution |
|---------|----------|
| SessionStart not present in hex-line run | Fail preflight and stop |
| Agent drifts into ToolSearch before hex-line use | Treat as activation problem and capture in report |
| Worktree already exists from prior crash | Remove it before adding a new one |
| Diff artifacts missing | Treat scenario correctness as failed |
| Simple scenario favors built-ins | Keep it in the suite if it is common; honesty beats cherry-picking |
Definition of Done
- [ ]
goals.mddefines the canonical balanced suite - [ ]
expectations.jsonfully describes scenario correctness - [ ] Runner passes syntax and SessionStart preflight
- [ ] Each scenario runs in two clean worktrees from the same commit
- [ ] Parser evaluates activation and scenario correctness from logs plus diffs
- [ ] Final report is saved to
skills-catalog/ln-840-benchmark-compare/results/ - [ ] Temporary worktrees are removed
Version: 2.0.0 Last Updated: 2026-03-24