LLM Evaluation Skill Skill

LLM Evaluation Skill

Run LLM pipeline evaluation against gold standard datasets using oracle LLM-as-judge scoring. Measures output quality across weighted dimensions, identifies weak steps, and suggests prompt improvements.

Quick Start

# Full evaluation (all test cases, all steps)
/sc:evaluate

# Quick spot check
/sc:evaluate --cases=case_1,case_2 --steps=1,2,3

# Re-evaluate existing results without re-running pipeline
/sc:evaluate --skip-pipeline

# Generate outputs only (no evaluation)
/sc:evaluate --skip-eval

# Specify judge model
/sc:evaluate --judge-model=gpt-4o

# Dry run to preview plan
/sc:evaluate --dry-run

Behavioral Flow

Discover - Find evaluation script, gold standards, and prompt files
Configure - Parse scope (cases, steps, model overrides)
Execute - Run pipeline on gold standard inputs
Evaluate - Score outputs against gold standards via LLM-as-judge
Analyze - Identify weak steps, dimension breakdowns, patterns
Recommend - Suggest specific prompt improvements for low-scoring steps
Report - Generate JSON + Markdown evaluation reports

Flags

| Flag | Type | Default | Description | |------|------|---------|-------------| | --cases | string | all | Comma-separated test case IDs to evaluate | | --steps | string | all | Comma-separated step numbers to evaluate | | --model | string | env default | Override pipeline model | | --judge-model | string | env default | Override judge/oracle model | | --skip-pipeline | bool | false | Skip pipeline execution, evaluate existing results | | --skip-eval | bool | false | Run pipeline only, skip evaluation | | --dry-run | bool | false | Preview execution plan without API calls | | --output | string | eval_runs/YYYYMMDD_HHMMSS/ | Output directory | | --concurrency | int | 5 | Parallel judge calls | | --threshold | int | 70 | Score threshold for "needs improvement" |

Phase 1: Discover Project Structure

Locate evaluation components:

| Component | Common Locations | Purpose | |-----------|-----------------|---------| | Evaluation script | scripts/run_eval.py, eval/run.py | Orchestrates pipeline + scoring | | Gold standards | gold_standards/, test_data/, fixtures/ | Expected outputs | | Prompts | prompts/, templates/ | Pipeline prompt templates | | Rubrics | eval/rubrics.py, config/rubrics.yaml | Scoring dimensions and weights |

If no standard structure found, ask the user to specify paths.

Phase 2: Configure Scope

Parse arguments to determine:

Which test cases to run (default: all discovered)
Which pipeline steps to evaluate (default: all)
Model overrides for pipeline and judge
Output directory (default: timestamped)

Create output directory:

OUTPUT_DIR="${output:-eval_runs/$(date +%Y%m%d_%H%M%S)}"
mkdir -p "$OUTPUT_DIR"

Phase 3: Execute Pipeline

Run the pipeline on gold standard inputs:

python <eval_script> \
  --output "$OUTPUT_DIR" \
  --verbose \
  [--cases CASES] \
  [--steps STEPS] \
  [--model MODEL] \
  [--skip-pipeline] \
  [--skip-eval]

API call estimation:

Pipeline: steps x cases API calls
Evaluation: scored_dimensions x cases judge calls

For quick validation, suggest running on 1-2 cases with 2-3 steps first.

Phase 4: Evaluate with LLM-as-Judge

For each step output, compare against gold standard using oracle LLM-as-judge:

Evaluation dimensions (customizable per project):

| Dimension | What It Measures | |-----------|-----------------| | Content Agreement | Do outputs cover the same key points? | | Structure Match | Is the organization/format similar? | | Detail Accuracy | Are specific claims and data correct? | | Completeness | Are all expected elements present? |

Each dimension has a weight (0.0-1.0) summing to 1.0 per step.

Phase 5: Analyze Results

Read and analyze evaluation report:

Overall similarity score across all cases and steps
Per-step scores — highlight any below threshold (default: 70/100)
Per-case scores — identify consistently weak test cases
Dimension breakdowns for weak steps

Score interpretation:

| Score Range | Assessment | Action | |-------------|-----------|--------| | 85-100 | Excellent | No changes needed | | 70-84 | Good | Minor tuning possible | | 60-69 | Needs improvement | Prompt revision recommended | | Below 60 | Poor | Prompt likely needs rewrite |

Phase 6: Recommend Improvements

For each step scoring below threshold:

Read the current prompt template
Read the gold standard output (expected)
Read the pipeline output (actual)
Compare and identify gaps:
- Missing instructions that gold standard captures
- Overly broad instructions causing divergent output
- Format/structure differences
- Specificity gaps

Present actionable suggestions:

### Step N: <step_name> (Score: XX/100)

**Weakest Dimension**: <dimension> (XX/100)

**Gap Analysis**:
- Gold standard includes <X> but prompt doesn't instruct it
- Output format diverges: gold uses <format>, output uses <other>

**Suggested Prompt Changes**:
1. Add instruction: "<specific instruction>"
2. Clarify format: "<format guidance>"
3. Add example: "<example output snippet>"

Output Structure

eval_runs/YYYYMMDD_HHMMSS/
  results/                    # Pipeline outputs
    case_1/
      step_01_<name>.md
      step_02_<name>.md
      ...
    case_2/
      ...
  evaluation/                 # Judge scores
    evaluation_report.json
    evaluation_report.md
    per_step_scores.csv
    per_case_scores.csv

MCP Integration

PAL MCP (Optional)

| Tool | When | Purpose | |------|------|---------| | mcp__pal__thinkdeep | Low-scoring steps | Deep analysis of why outputs diverge | | mcp__pal__consensus | Prompt revision | Multi-model validation of proposed changes | | mcp__pal__codereview | Eval script | Review evaluation pipeline code |

Rube MCP (Optional)

| Tool | When | Purpose | |------|------|---------| | mcp__rube__RUBE_REMOTE_WORKBENCH | Large eval runs | Process results in Python sandbox | | mcp__rube__RUBE_MULTI_EXECUTE_TOOL | Notifications | Report results to Slack/email |

Error Handling

| Scenario | Action | |----------|--------| | No eval script found | Ask user for script path | | No gold standards found | Ask user for gold standard directory | | API rate limit | Reduce concurrency, add delays | | Pipeline step fails | Log error, continue with remaining steps | | Judge returns invalid score | Retry once, then flag for manual review | | Output directory exists | Append timestamp suffix |

Guardrails

Always pass --verbose for progress visibility
Warn about API call counts before full runs
Suggest quick validation on subset before full evaluation
Preserve all intermediate outputs for debugging
Never modify gold standard files

Tool Coordination

Bash - Run evaluation scripts
Read - Inspect prompts, gold standards, outputs, reports
Write - Generate reports
Grep - Search for patterns in outputs
PAL MCP - Deep analysis of score gaps

Agent Skills: LLM Evaluation Skill

Install this agent skill to your local

Skill Files