Agent Evaluation Methods
Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.
Key Finding: 95% Performance Drivers
Research on BrowseComp found three factors explain 95% of variance:
| Factor | Variance | Implication | |--------|----------|-------------| | Token usage | 80% | More tokens = better performance | | Tool calls | ~10% | More exploration helps | | Model choice | ~5% | Better models multiply efficiency |
Implications: Model upgrades beat token increases. Multi-agent architectures validate.
Multi-Dimensional Rubric
| Dimension | Excellent | Good | Acceptable | Failed | |-----------|-----------|------|------------|--------| | Factual accuracy | All correct | Minor errors | Some errors | Wrong | | Completeness | All aspects | Most aspects | Key aspects | Missing | | Citation accuracy | All match | Most match | Some match | Wrong | | Tool efficiency | Optimal | Good | Adequate | Wasteful |
LLM-as-Judge
evaluation_prompt = """
Task: {task_description}
Agent Output: {agent_output}
Ground Truth: {ground_truth}
Evaluate on:
1. Factual accuracy (0-1)
2. Completeness (0-1)
3. Citation accuracy (0-1)
4. Tool efficiency (0-1)
Provide scores and reasoning.
"""
Test Set Design
test_set = [
{"name": "simple", "complexity": "simple",
"input": "What is capital of France?"},
{"name": "medium", "complexity": "medium",
"input": "Compare Apple and Microsoft revenue"},
{"name": "complex", "complexity": "complex",
"input": "Analyze Q1-Q4 sales trends"},
{"name": "very_complex", "complexity": "very_complex",
"input": "Research AI tech, evaluate impact, recommend strategy"}
]
Evaluation Pipeline
def evaluate_agent(agent, test_set):
results = []
for test in test_set:
output = agent.run(test["input"])
scores = llm_judge(output, test)
results.append({
"test": test["name"],
"scores": scores,
"passed": scores["overall"] >= 0.7
})
return results
Complexity Stratification
| Level | Characteristics | |-------|-----------------| | Simple | Single tool call | | Medium | Multiple tool calls | | Complex | Many calls, ambiguity | | Very Complex | Extended interaction, deep reasoning |
Context Engineering Evaluation
Test context strategies systematically:
- Run agents with different strategies on same tests
- Compare quality scores, token usage, efficiency
- Identify degradation cliffs at different context sizes
Continuous Evaluation
- Run evaluations on all agent changes
- Track metrics over time
- Set alerts for quality drops
- Sample production interactions
Avoiding Pitfalls
| Pitfall | Solution | |---------|----------| | Path overfitting | Evaluate outcomes, not steps | | Ignoring edge cases | Include diverse scenarios | | Single metric | Multi-dimensional rubrics | | Ignoring context | Test realistic context sizes | | No human review | Supplement automated eval |
Best Practices
- Use multi-dimensional rubrics
- Evaluate outcomes, not specific paths
- Cover complexity levels
- Test with realistic context sizes
- Run evaluations continuously
- Supplement LLM with human review
- Track metrics for trends
- Set clear pass/fail thresholds