Skill Grader Skill | Agent Skills

Skill Grader

Evaluate skill test run outputs against expectations and extract implicit claims.

Input Schema

expectations:        # List of verifiable statements
  - "Output includes X"
  - "Skill used script Y"
transcript_path:     # Path to execution transcript
outputs_dir:         # Directory containing output files
eval_prompt:         # Original task prompt

Grading Process

Step 1: Read Context

TRANSCRIPT = Read(transcript_path)
OUTPUT_FILES = Glob(outputs_dir + "/**/*")
For each FILE in OUTPUT_FILES:
  CONTENT[FILE] = Read(FILE)

Note: eval_prompt, execution steps, errors, final result.

Step 2: Grade Expectations

For each EXPECTATION in expectations:

EVIDENCE = search TRANSCRIPT and CONTENT for EXPECTATION
If EVIDENCE confirms EXPECTATION genuinely (not superficially):
  verdict = PASS
Else:
  verdict = FAIL

PASS criteria:

Clear evidence in transcript or outputs
Evidence reflects genuine task completion, not surface compliance
A correct filename with wrong content is FAIL, not PASS

FAIL criteria:

No evidence found
Evidence contradicts expectation
Evidence is superficial (right format, wrong substance)
Cannot be verified from available information

When uncertain: burden of proof is on the expectation to pass.

Step 3: Extract Claims

Beyond predefined expectations, find implicit claims:

For each CLAIM in (TRANSCRIPT + CONTENT):
  CLAIM.type = "factual" | "process" | "quality"
  CLAIM.verified = verify(CLAIM, available_evidence)
  CLAIM.evidence = supporting_or_contradicting_text

Factual: "The form has 12 fields" — check against outputs
Process: "Used pypdf to fill the form" — verify from transcript
Quality: "All fields filled correctly" — evaluate if justified

Flag unverifiable claims.

Step 4: Critique the Evals

After grading, assess whether the evals themselves could improve. Only surface suggestions when there's a clear gap:

Assertion that passed but would also pass for clearly wrong output
Important outcome (good or bad) that no assertion covers
Assertion that can't actually be verified from available outputs

Keep bar high. Flag things the eval author would say "good catch" about.

Step 5: Write Results

Write(outputs_dir + "/../grading.json", RESULTS)

Output Schema

{
  "expectations": [
    {
      "text": "The output includes X",
      "passed": true,
      "evidence": "Found in transcript Step 3: '...'"
    }
  ],
  "summary": {
    "passed": 2,
    "failed": 1,
    "total": 3,
    "pass_rate": 0.67
  },
  "claims": [
    {
      "claim": "The form has 12 fillable fields",
      "type": "factual",
      "verified": true,
      "evidence": "Counted 12 fields in output"
    }
  ],
  "eval_feedback": {
    "suggestions": [
      {
        "assertion": "Output includes name",
        "reason": "A hallucinated doc mentioning the name would also pass"
      }
    ],
    "overall": "No suggestions, evals look solid."
  }
}

Field requirements:

expectations[].text, .passed, .evidence — all required (viewer depends on exact names)
summary.pass_rate — float 0.0 to 1.0
claims[] — optional but encouraged
eval_feedback — include only when warranted; "No suggestions" is fine

Error Handling

| Condition | Action | |-----------|--------| | Transcript not found | FAIL all expectations, note in evidence | | Output files empty | FAIL expectations requiring output content | | Binary files in outputs | Note as unreadable, skip content check | | Malformed JSON in outputs | FAIL expectations about JSON structure |

Agent Skills: Skill Grader

Install this agent skill to your local

Skill Files