Agent Skills: Benchmark Compare

Runs built-in vs hex-line benchmark with scenario manifests, activation checks, and diff-based correctness. Use when measuring hex-line MCP performance against built-in tools.

UncategorizedID: levnikolaevich/claude-code-skills/ln-840-benchmark-compare

Install this agent skill to your local

pnpm dlx add-skill https://github.com/levnikolaevich/claude-code-skills/tree/HEAD/plugins/optimization-suite/skills/ln-840-benchmark-compare

Skill Files

Browse the full folder contents for ln-840-benchmark-compare.

Download Skill

Loading file tree…

plugins/optimization-suite/skills/ln-840-benchmark-compare/SKILL.md

Skill Metadata

Name
ln-840-benchmark-compare
Description
"Use when benchmarking hex-line MCP against Claude built-in tools with scenario manifests, activation checks, and diff-based correctness."

Paths: File paths (references/) are relative to this skill directory.

Benchmark Compare

Type: L3 Worker Category: 8XX Optimization -> 840 Benchmark

Run a clean A/B benchmark in Claude Code: one session with built-in tools only, one with hex-line. The benchmark is scenario-based, diff-validated, manifest-driven, and runtime-backed. It measures activation, correctness, time, cost, and tokens. The current runner is intentionally scoped to this internal A/B. It does not, by itself, prove best-in-class against external alternatives.


Input / Output

| Direction | Content | |-----------|----------| | Input | Repo checkout containing mcp/hex-line-mcp/, optional references/goals.md, optional references/expectations.json | | Output | Comparison report in plugins/optimization-suite/skills/ln-840-benchmark-compare/results/{date}-comparison.md plus machine-readable benchmark summary artifact |


Prerequisites

  • claude --version succeeds
  • git succeeds
  • mcp/hex-line-mcp/server.mjs exists
  • mcp/hex-line-mcp/hook.mjs exists
  • plugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.md exists
  • plugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.json exists
  • plugins/optimization-suite/skills/ln-840-benchmark-compare/references/mcp-bench.json exists

Quick Run

bash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh \
  [plugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.md] \
  [plugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.json]

Optional extra session profile:

EXTRA_SESSION_ID=other-mcp \
EXTRA_SESSION_LABEL="Other MCP" \
EXTRA_MCP_CONFIG=/abs/path/to/other-mcp.json \
EXTRA_SETTINGS='{"disableAllHooks":true}' \
bash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh

Monitor Integration (Claude Code 2.1.98+)

MANDATORY READ: Load references/monitor_integration_pattern.md

Stream benchmark progress: Monitor(command="bash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh 2>&1 | grep --line-buffered -E 'scenario|PASS|FAIL|error|session'", timeout_ms=3600000, description="benchmark run")

Fallback: Bash(run_in_background=true).

The runner handles:

  • syntax preflight
  • SessionStart preflight
  • scenario extraction from goals.md
  • isolated worktrees per scenario/session
  • per-scenario diffs
  • final comparison report

Current scope:

  • built-in Claude session
  • Claude plus hex-line
  • optional third Claude-compatible session profile through EXTRA_SESSION_* environment variables

External baseline note:

  • use the same goals.md and expectations.json
  • do not rewrite scenarios to fit the external tool
  • do not make "top tool" claims from the internal A/B alone
  • the optional third session profile is only valid when it can emit the same stream-json log shape and diff artifacts

Workflow

Phase 1: Define The Canonical Suite

Use one canonical pair owned by this skill:

  • plugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.md
  • plugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.json

Rules:

  • The suite must be a balanced mix of common engineering scenarios.
  • Do not design the suite to favor hex-line.
  • Every scenario in goals.md must have a matching entry in expectations.json.
  • expectations.json is the source of truth for correctness.
  • The same pair must be reused unchanged for any future external baseline.

Supported expectation fields per scenario:

| Field | Meaning | |-------|---------| | id | Scenario identifier used in result filenames | | expectedChangedFiles | Files that must change | | forbiddenChangedFiles | Files that must not change | | requiredDiffPatterns | Regex patterns required in the saved diff | | forbiddenDiffPatterns | Regex patterns that must not appear in the diff | | requiredResultPatterns | Regex patterns required in the final assistant result text | | requiredCommands | Regex patterns that must match at least one Bash command | | exactChangedFiles | If true, no extra changed files are allowed |

Phase 2: Preflight

The runner must pass:

  • node --check server.mjs
  • node --check hook.mjs
  • node --check extract-scenarios.mjs
  • node --check parse-results.mjs
  • SessionStart smoke check from hook.mjs

If preflight fails, the benchmark is invalid and must stop before scenarios run.

Phase 3: Execute Per Scenario

For each ## scenario in goals.md:

  1. generate a standalone prompt file
  2. create two clean worktrees from the same commit
  3. run built-in Claude session
  4. run hex-line Claude session
  5. save .jsonl logs and .diff.txt artifacts
  6. remove both worktrees

Built-in session:

  • no MCP
  • hooks disabled

Hex-line session:

  • resolved MCP config pointing to server.mjs
  • outputStyle: "hex-line"
  • PreToolUse hook through hook.mjs

Phase 4: Parse Results

parse-results.mjs evaluates each scenario for both sessions.

Scenario pass requires:

  • valid run
  • successful session completion
  • changed files match expectations
  • diff patterns match expectations
  • result text patterns match expectations
  • required commands were actually executed

Phase 5: Read The Report

The final report has these sections:

  • Scenario Outcomes
  • Activation
  • Time
  • Cost
  • Tokens
  • Tool Totals
  • Validity

Interpretation rules:

  • invalid run means setup/adoption failure, not product performance
  • scenario FAIL means correctness contract was not met
  • activation is part of product quality for hex-line, not external noise
  • this report is necessary for internal A/B evaluation, but not sufficient for best-alternative claims

Report Contract

plugins/optimization-suite/skills/ln-840-benchmark-compare/results/{date}-comparison.md must answer:

  • Did each scenario complete correctly?
  • Did hex-line activate cleanly without discovery drift?
  • What changed in wall time, API time, cost, output tokens, and total tool calls?
  • Was the run valid?

Do not treat raw time/cost as sufficient without scenario correctness.

External Baseline Policy

  • This skill owns the canonical suite, not a universal leaderboard.
  • If maintainers compare hex-line against external alternatives, they must reuse the same goals.md, expectations.json, and diff-based evaluation rules.
  • External runs may use different harnesses, but they must preserve the same task text, starting commit, and correctness contract.
  • If an external tool cannot satisfy the contract format, record that as a harness limitation instead of rewriting the suite to accommodate it.
  • A report that only covers built-in Claude vs hex-line must say so explicitly.

Runtime Contract

MANDATORY READ: Load references/benchmark_worker_runtime_contract.md, references/coordinator_summary_contract.md

Runtime CLI:

node references/scripts/benchmark-worker-runtime/cli.mjs start --skill ln-840-benchmark-compare --identifier suite-default --manifest-file <file>
node references/scripts/benchmark-worker-runtime/cli.mjs checkpoint --skill ln-840-benchmark-compare --identifier suite-default --phase PHASE_0_CONFIG --payload '{...}'
node references/scripts/benchmark-worker-runtime/cli.mjs record-summary --skill ln-840-benchmark-compare --identifier suite-default --payload '{...}'
node references/scripts/benchmark-worker-runtime/cli.mjs complete --skill ln-840-benchmark-compare --identifier suite-default

Required state fields:

  • report_ready
  • summary_recorded
  • final_result
  • self_check_passed

Domain checkpoints:

  • PHASE_0_CONFIG
  • PHASE_1_PREFLIGHT
  • PHASE_2_LOAD_SUITE
  • PHASE_3_RUN_SCENARIOS
  • PHASE_4_PARSE_RESULTS
  • PHASE_5_WRITE_REPORT
  • PHASE_6_WRITE_SUMMARY
  • PHASE_7_SELF_CHECK

Guard rules:

  • do not advance without checkpointing the current phase
  • do not complete before benchmark-worker summary is recorded
  • do not complete before self-check passes

Runtime Coordination

  • Managed runs may pass deterministic runId and exact summaryArtifactPath.
  • Standalone runs are supported. If both are omitted, runtime creates a standalone run and writes the default summary artifact path for the benchmark-worker family.

Runtime Summary Artifact

MANDATORY READ: Load references/coordinator_summary_contract.md

Emit a benchmark-worker summary envelope after the comparison report is written.

Managed mode:

  • write to the exact summaryArtifactPath

Standalone mode:

  • write .hex-skills/runtime-artifacts/runs/{run_id}/benchmark-worker/ln-840-benchmark-compare--{identifier}.json

Recommended payload:

  • scenarios_total
  • scenarios_passed
  • scenarios_failed
  • activation_valid
  • validity_verdict
  • report_path
  • warnings
  • metrics

Known Pitfalls

| Pitfall | Solution | |---------|----------| | SessionStart not present in hex-line run | Fail preflight and stop | | Agent drifts into ToolSearch before hex-line use | Treat as activation problem and capture in report | | Worktree already exists from prior crash | Remove it before adding a new one | | Diff artifacts missing | Treat scenario correctness as failed | | Simple scenario favors built-ins | Keep it in the suite if it is common; honesty beats cherry-picking | | External comparison uses edited scenarios or relaxed expectations | Treat the comparison as invalid |


Definition of Done

  • [ ] goals.md defines the canonical balanced suite
  • [ ] expectations.json fully describes scenario correctness
  • [ ] Runner passes syntax and SessionStart preflight
  • [ ] Each scenario runs in two clean worktrees from the same commit
  • [ ] Parser evaluates activation and scenario correctness from logs plus diffs
  • [ ] Final report is saved to plugins/optimization-suite/skills/ln-840-benchmark-compare/results/
  • [ ] benchmark-worker summary artifact is written to the managed or standalone runtime path
  • [ ] Temporary worktrees are removed
  • [ ] Report states clearly whether it is internal A/B only or includes additional external baselines

Version: 2.0.0 Last Updated: 2026-03-24