Behavioral Evals
Overview
Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.
[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.
π Workflow Decision Tree
- Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
- Is it UI/Interaction heavy?
- Yes -> Use
appEvalTest(AppRig). See creating.md. - No -> Use
evalTest(TestRig). See creating.md.
- Yes -> Use
- Is it a new test?
- Yes -> Set policy to
USUALLY_PASSES. - No ->
ALWAYS_PASSES(locks in regression).
- Yes -> Set policy to
- Are you fixing a failure or promoting a test?
- Fixing -> See fixing.md.
- Promoting -> See promoting.md.
π Quick Checklist
1. Setup Workspace
Seed the workspace with necessary files using the files object to simulate a realistic scenario (e.g., NodeJS project with package.json).
- Details in creating.md
2. Write Assertions
Audit agent decisions using rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().
- Details in creating.md
3. Verify
Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.
- See evals/README.md for running commands.
π¦ Bundled Resources
Detailed procedural guides:
- creating.md: Assertion strategies, Rig selection, Mock MCPs.
- fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
- promoting.md: Candidate identification criteria and threshold guidelines.