Agent Skills: Subagent Testing - TDD for Skills

TDD-style testing methodology for skills using fresh subagent instances

UncategorizedID: athola/claude-night-market/subagent-testing

Install this agent skill to your local

pnpm dlx add-skill https://github.com/athola/claude-night-market/tree/HEAD/plugins/abstract/skills/subagent-testing

Skill Files

Browse the full folder contents for subagent-testing.

Download Skill

Loading file tree…

plugins/abstract/skills/subagent-testing/SKILL.md

Skill Metadata

Name
subagent-testing
Description
'Test skills via TDD in fresh subagents. Use when validating behavior or preventing bias.'

Subagent Testing - TDD for Skills

Test skills with fresh subagent instances to prevent priming bias and validate effectiveness.

Table of Contents

  1. Overview
  2. Why Fresh Instances Matter
  3. Testing Methodology
  4. Quick Start
  5. Detailed Testing Guide
  6. Success Criteria

Overview

Fresh instances prevent priming: Each test uses a new Claude conversation to verify the skill's impact is measured, not conversation history effects.

Why Fresh Instances Matter

The Priming Problem

Running tests in the same conversation creates bias:

  • Prior context influences responses
  • Skill effects get mixed with conversation history
  • Can't isolate skill's true impact

Fresh Instance Benefits

  • Isolation: Each test starts clean
  • Reproducibility: Consistent baseline state
  • Measurement: Clear before/after comparison
  • Validation: Proves skill effectiveness, not priming

Testing Methodology

Three-phase TDD-style approach:

Phase 1: Baseline Testing (RED)

Test without skill to establish baseline behavior.

Phase 2: With-Skill Testing (GREEN)

Test with skill loaded to measure improvements.

Phase 3: Rationalization Testing (REFACTOR)

Test skill's anti-rationalization guardrails.

Quick Start

# 1. Create baseline tests (without skill)
# Use 5 diverse scenarios
# Document full responses

# 2. Create with-skill tests (fresh instances)
# Load skill explicitly
# Use identical prompts
# Compare to baseline

# 3. Create rationalization tests
# Test anti-rationalization patterns
# Verify guardrails work

Detailed Testing Guide

For complete testing patterns, examples, and templates:

Success Criteria

  • Baseline: Document 5+ diverse baseline scenarios
  • Improvement: ≥50% improvement in skill-related metrics
  • Consistency: Results reproducible across fresh instances
  • Rationalization Defense: Guardrails prevent ≥80% of rationalization attempts

See Also

  • skill-authoring: Creating effective skills
  • bulletproof-skill: Anti-rationalization patterns
  • test-skill: Automated skill testing command

Exit Criteria

  • [ ] Baseline (RED) phase documents at least 5 diverse scenarios run in fresh Claude instances without the skill active, with full response text recorded.
  • [ ] With-skill (GREEN) phase uses identical prompts in new fresh instances (not continuations of the baseline conversation) and shows >= 50% improvement on skill-related metrics.
  • [ ] Rationalization (REFACTOR) phase shows skill guardrails blocking >= 80% of rationalization attempts tested across at least 3 pressure scenarios.
  • [ ] Results are reproducible: the same prompts in a new fresh instance produce consistent outcomes, confirming the effect is not conversation-history priming.