Agent Skills: Skill Grader

Evaluate skill test run outputs against expectations and extract implicit claims.

UncategorizedID: majesticlabs-dev/majestic-marketplace/skill-grader

Install this agent skill to your local

pnpm dlx add-skill https://github.com/majesticlabs-dev/majestic-marketplace/tree/HEAD/plugins/majestic-tools/skills/skill-grader

Skill Files

Browse the full folder contents for skill-grader.

Download Skill

Loading file tree…

plugins/majestic-tools/skills/skill-grader/SKILL.md

Skill Metadata

Name
skill-grader
Description
Evaluate skill test run outputs against expectations and extract implicit claims.

Skill Grader

Evaluate skill test run outputs against expectations and extract implicit claims.

Input Schema

expectations:        # List of verifiable statements
  - "Output includes X"
  - "Skill used script Y"
transcript_path:     # Path to execution transcript
outputs_dir:         # Directory containing output files
eval_prompt:         # Original task prompt

Grading Process

Step 1: Read Context

TRANSCRIPT = Read(transcript_path)
OUTPUT_FILES = Glob(outputs_dir + "/**/*")
For each FILE in OUTPUT_FILES:
  CONTENT[FILE] = Read(FILE)

Note: eval_prompt, execution steps, errors, final result.

Step 2: Grade Expectations

For each EXPECTATION in expectations:

EVIDENCE = search TRANSCRIPT and CONTENT for EXPECTATION
If EVIDENCE confirms EXPECTATION genuinely (not superficially):
  verdict = PASS
Else:
  verdict = FAIL

PASS criteria:

  • Clear evidence in transcript or outputs
  • Evidence reflects genuine task completion, not surface compliance
  • A correct filename with wrong content is FAIL, not PASS

FAIL criteria:

  • No evidence found
  • Evidence contradicts expectation
  • Evidence is superficial (right format, wrong substance)
  • Cannot be verified from available information

When uncertain: burden of proof is on the expectation to pass.

Step 3: Extract Claims

Beyond predefined expectations, find implicit claims:

For each CLAIM in (TRANSCRIPT + CONTENT):
  CLAIM.type = "factual" | "process" | "quality"
  CLAIM.verified = verify(CLAIM, available_evidence)
  CLAIM.evidence = supporting_or_contradicting_text
  • Factual: "The form has 12 fields" — check against outputs
  • Process: "Used pypdf to fill the form" — verify from transcript
  • Quality: "All fields filled correctly" — evaluate if justified

Flag unverifiable claims.

Step 4: Critique the Evals

After grading, assess whether the evals themselves could improve. Only surface suggestions when there's a clear gap:

  • Assertion that passed but would also pass for clearly wrong output
  • Important outcome (good or bad) that no assertion covers
  • Assertion that can't actually be verified from available outputs

Keep bar high. Flag things the eval author would say "good catch" about.

Step 5: Write Results

Write(outputs_dir + "/../grading.json", RESULTS)

Output Schema

{
  "expectations": [
    {
      "text": "The output includes X",
      "passed": true,
      "evidence": "Found in transcript Step 3: '...'"
    }
  ],
  "summary": {
    "passed": 2,
    "failed": 1,
    "total": 3,
    "pass_rate": 0.67
  },
  "claims": [
    {
      "claim": "The form has 12 fillable fields",
      "type": "factual",
      "verified": true,
      "evidence": "Counted 12 fields in output"
    }
  ],
  "eval_feedback": {
    "suggestions": [
      {
        "assertion": "Output includes name",
        "reason": "A hallucinated doc mentioning the name would also pass"
      }
    ],
    "overall": "No suggestions, evals look solid."
  }
}

Field requirements:

  • expectations[].text, .passed, .evidence — all required (viewer depends on exact names)
  • summary.pass_rate — float 0.0 to 1.0
  • claims[] — optional but encouraged
  • eval_feedback — include only when warranted; "No suggestions" is fine

Error Handling

| Condition | Action | |-----------|--------| | Transcript not found | FAIL all expectations, note in evidence | | Output files empty | FAIL expectations requiring output content | | Binary files in outputs | Note as unreadable, skip content check | | Malformed JSON in outputs | FAIL expectations about JSON structure |