Bare Eval — Isolated Evaluation Calls
Run claude -p --bare for fast, clean eval/grading without plugin overhead.
CC 2.1.81 required. The --bare flag skips hooks, LSP, plugin sync, and skill directory walks.
When to Use
- Grading skill outputs against assertions
- Trigger classification (which skill matches a prompt)
- Description optimization iterations
- Any scripted
-pcall that doesn't need plugins
When NOT to Use
- Testing skill routing (needs
--plugin-dir) - Testing agent orchestration (needs full plugin context)
- Interactive sessions
Prerequisites
# --bare requires ANTHROPIC_API_KEY (OAuth/keychain disabled)
export ANTHROPIC_API_KEY="sk-ant-..."
# Verify CC version
claude --version # Must be >= 2.1.81
Quick Reference
| Call Type | Command Pattern |
|-----------|----------------|
| Grading | claude -p "$prompt" --bare --max-turns 1 --output-format text |
| Trigger | claude -p "$prompt" --bare --json-schema "$schema" --output-format json |
| Streaming grade | claude -p "$prompt" --bare --max-turns 1 --output-format stream-json |
| Optimize | echo "$prompt" \| claude -p --bare --max-turns 1 --output-format text |
| Force-skill | claude -p "$prompt" --bare --print --append-system-prompt "$content" |
| @-file in prompt | claude -p "grade @fixtures/case-1.md against rubric" --bare (CC 2.1.113 Remote Control autocomplete) |
--output-format stream-json
Newline-delimited JSON events (one per token/tool-call) — lets a runner score partial output or abort early on a failing probe without waiting for the full response.
claude -p "$prompt" --bare --max-turns 1 --output-format stream-json \
| while IFS= read -r line; do
# line is a single JSON event; inspect $.type == "content_block_delta"
jq -r 'select(.type == "content_block_delta") | .delta.text' <<< "$line"
done
Use stream-json over json when:
- grading long outputs and you want incremental scoring,
- piping into another CLI step-by-step (e.g.
ork:eval-runner), - you need per-token timing data alongside the content.
Invocation Patterns
Load detailed patterns and examples:
Read("${CLAUDE_SKILL_DIR}/references/invocation-patterns.md")
Grading Schemas
JSON schemas for structured eval output:
Read("${CLAUDE_SKILL_DIR}/references/grading-schemas.md")
Pipeline Integration
OrchestKit's eval scripts (npm run eval:skill) auto-detect bare mode:
# eval-common.sh detects ANTHROPIC_API_KEY → sets BARE_MODE=true
# Scripts add --bare to all non-plugin calls automatically
Bare calls: Trigger classification, force-skill, baseline, all grading.
Never bare: run_with_skill (needs plugin context for routing tests).
CC 2.1.119: --print honors agent tools: / disallowedTools: (M122)
Before CC 2.1.119, --print mode ran with the full default tool set regardless of the agent's frontmatter tools: and disallowedTools:. Bare-eval grading was effectively ungated — graders could call any tool they wanted, even if the agent definition restricted them.
As of 2.1.119, --print enforces the agent's declared tool surface. Implications for eval design:
| Consequence | Action |
|---|---|
| Eval graders that relied on unrestricted tool access may now fail | Audit grader prompts for tools they actually need; whitelist explicitly via the agent's tools: frontmatter |
| Eval results match interactive runs | Reproducibility improves — grading what the model can actually do, not what it could do in an unsandboxed --print |
| --agent <name> also honors permissionMode in --print | Permission-gated tools (Bash, Edit) require either permissionMode: acceptEdits or explicit allowlists in the agent definition |
Migration test:
# Run an eval against an agent with a deliberately tight tools: list.
# Graders that previously called Read/Bash freely will now fail unless those
# tools are declared on the agent.
claude -p "$prompt" --bare --print --agent grader-test
If the grader fails with a "tool not permitted" error, add the required tool to the agent's tools: frontmatter and re-run.
CC 2.1.121: CLAUDE_CODE_FORK_SUBAGENT=1 for grader determinism (#1545)
Before CC 2.1.121, the env var only worked in interactive sessions. As of 2.1.121, non-interactive paths (claude -p, SDK) honor it too — each grader invocation gets a fresh forked subagent context.
The cross-eval state-leak problem this fixes:
Without forking, sequential claude -p --bare graders inherit harness state:
| Inherited | Symptom |
|---|---|
| memory MCP query cache | grader sees stale hit from previous run; same fixture grades differently |
| .claude/chain/*.json on disk | grader for "implement" thinks "explore" already ran (file is from previous test) |
| ToolSearch deferred-tool cache | first grader's MCP loads bleed into next grader's tool registry |
| model picker pref | grader N inherits --model=opus from grader N-1 |
This produced ~5–10% retry rate and non-reproducible scores — the eval baseline drifted between runs, engineers chased phantom regressions.
Fix: tests/evals/scripts/lib/eval-common.sh exports CLAUDE_CODE_FORK_SUBAGENT=1, so every script that sources it (run-trigger-eval, run-quality-eval, run-agent-eval, optimize-description, etc.) gets forked graders automatically. The CI workflow .github/workflows/orchestkit-eval.yml also sets it at the workflow level. Older CC silently ignores the env var (no-op).
Determinism contract: running the same grader on the same fixture twice in a row produces the same score. Verified by tests/evals/scripts/test-grader-determinism.sh.
Performance
| Scenario | Without --bare | With --bare | Savings | |----------|---------------|-------------|---------| | Single grading call | ~3-5s startup | ~0.5-1s | 2-4x | | Trigger (per prompt) | ~3-5s | ~0.5-1s | 2-4x | | Full eval (50 calls) | ~150-250s overhead | ~25-50s | 3-5x |
Rules
Read("${CLAUDE_SKILL_DIR}/rules/_sections.md")
Troubleshooting
Read("${CLAUDE_SKILL_DIR}/references/troubleshooting.md")
Related
eval:skillnpm script — unified skill evaluation runnereval:trigger— trigger accuracy testingeval:quality— A/B quality comparisonoptimize-description.sh— iterative description improvement- Version compatibility:
doctor/references/version-compatibility.md