❌ BAD: Too abstract, first person Skill

<skill_overview> Writing skills IS test-driven development applied to process documentation; write test (pressure scenario), watch fail (baseline), write skill, watch pass, refactor (close loopholes). </skill_overview>

<rigidity_level> LOW FREEDOM - Follow the RED-GREEN-REFACTOR cycle exactly when creating skills. No skill without failing test first. Same Iron Law as TDD. </rigidity_level>

<quick_reference> | Phase | Action | Verify | |-------|--------|--------| | RED | Create pressure scenarios | Document baseline failures | | RED | Run WITHOUT skill | Agent violates rule | | GREEN | Write minimal skill | Addresses baseline failures | | GREEN | Run WITH skill | Agent now complies | | REFACTOR | Find new rationalizations | Agent still complies | | REFACTOR | Add explicit counters | Bulletproof against excuses | | DEPLOY | Commit and optionally PR | Skill ready for use |

Iron Law: NO SKILL WITHOUT FAILING TEST FIRST (applies to new skills AND edits) </quick_reference>

<when_to_use> Create skill when:

Technique wasn't intuitively obvious to you
You'd reference this again across projects
Pattern applies broadly (not project-specific)
Others would benefit from this knowledge

Never create for:

One-off solutions
Standard practices well-documented elsewhere
Project-specific conventions (put in CLAUDE.md instead)

Edit existing skill when:

Found new rationalization agents use
Discovered loophole in current guidance
Need to add clarifying examples

ALWAYS test before writing or editing. No exceptions. </when_to_use>

<tdd_mapping> Skills use the exact same TDD cycle as code:

| TDD Concept | Skill Creation | |-------------|----------------| | Test case | Pressure scenario with subagent | | Production code | Skill document (SKILL.md) | | Test fails (RED) | Agent violates rule without skill | | Test passes (GREEN) | Agent complies with skill present | | Refactor | Close loopholes while maintaining compliance | | Write test first | Run baseline scenario BEFORE writing skill | | Watch it fail | Document exact rationalizations agent uses | | Minimal code | Write skill addressing those specific violations | | Watch it pass | Verify agent now complies | | Refactor cycle | Find new rationalizations → plug → re-verify |

REQUIRED BACKGROUND: You MUST understand hyperpowers:test-driven-development before using this skill. </tdd_mapping>

<the_process>

1. RED Phase - Create Failing Test

Create pressure scenarios for subagent:

Task tool with general-purpose agent:

"You are implementing a payment processing feature. User requirements:
- Process credit card payments
- Handle retries on failure
- Log all transactions

[PRESSURE 1: Time] You have 10 minutes before deployment.
[PRESSURE 2: Sunk Cost] You've already written 200 lines of code.
[PRESSURE 3: Authority] Senior engineer said 'just make it work, tests can wait.'

Implement this feature."

Run WITHOUT skill present.

Document baseline behavior:

Exact rationalizations agent uses ("tests can wait," "simple feature," etc.)
What agent skips (tests, verification, bd task, etc.)
Patterns in failure modes

Example baseline result:

Agent response:
"I'll implement the payment processing quickly since time is tight..."
[Skips TDD]
[Skips verification-before-completion]
[Claims done without evidence]

This is your failing test. Agent doesn't follow the workflow without guidance.

2. GREEN Phase - Write Minimal Skill

Write skill that addresses the SPECIFIC failures from baseline:

Structure:

---
name: skill-name-with-hyphens
description: Use when [specific triggers] - [what skill does]
---

<skill_overview>
One sentence core principle
</skill_overview>

<rigidity_level>
LOW | MEDIUM | HIGH FREEDOM - [What this means]
</rigidity_level>

[Rest of standard XML structure]

Frontmatter rules:

Only name and description fields (max 1024 chars total)
Name: letters, numbers, hyphens only (no parentheses/special chars)
Description: Start with "Use when...", third person, includes triggers

Description format:

# ❌ BAD: Too abstract, first person
description: I can help with async tests when they're flaky

# ✅ GOOD: Starts with "Use when", describes problem
description: Use when tests have race conditions or pass/fail inconsistently - replaces arbitrary timeouts with condition polling for reliable async tests

Write skill addressing baseline failures:

Add explicit counters for rationalizations ("tests can wait" → "NO EXCEPTIONS: tests first")
Create quick reference table for scanning
Add concrete examples showing failure modes
Use XML structure for all sections

Run WITH skill present.

Verify agent now complies:

Same pressure scenario
Agent now follows workflow
No rationalizations from baseline appear

This is your passing test.

3. REFACTOR Phase - Close Loopholes

Find NEW rationalizations:

Run skill with DIFFERENT pressures:

Combine 3+ pressures (time + sunk cost + exhaustion)
Try meta-rationalizations ("this skill doesn't apply because...")
Test with edge cases

Document new failures:

What rationalizations appear NOW?
What loopholes did agent find?
What explicit counters are needed?

Add counters to skill:

<critical_rules>
## Common Excuses

All of these mean: [Action to take]
- "Test can wait" (NO, test first always)
- "Simple feature" (Simple breaks too, test first)
- "Time pressure" (Broken code wastes more time)
[Add ALL rationalizations found during testing]
</critical_rules>

Re-test until bulletproof:

Run scenarios again
Verify new counters work
Agent complies even under combined pressures

4. Quality Checks

Before deployment, verify:

[ ] Has <quick_reference> section (scannable table)
[ ] Has <rigidity_level> explicit
[ ] Has 2-3 <example> tags showing failure modes
[ ] Description <500 chars, starts with "Use when..."
[ ] Keywords throughout for search (error messages, symptoms, tools)
[ ] One excellent code example (not multi-language)
[ ] Supporting files only for tools or heavy reference (>100 lines)

Token efficiency:

Frequently-loaded skills: <200 words ideally
Other skills: <500 words
Move heavy content to resources/ files

5. Deploy

Commit to git:

git add skills/skill-name/
git commit -m "feat: add [skill-name] skill

Tested with subagents under [pressures used].
Addresses [baseline failures found].

Closes rationalizations:
- [Rationalization 1]
- [Rationalization 2]"

Personal skills: Write to ~/.claude/skills/ for cross-project use

Plugin skills: PR to plugin repository if broadly useful

STOP: Before moving to next skill, complete this entire process. No batching untested skills. </the_process>

<examples> <example> <scenario>Developer writes skill without testing first</scenario> <code> # Developer writes skill: "--- name: always-use-tdd description: Always write tests first ---

Write tests first. No exceptions."

Then tries to deploy it

</code>

<why_it_fails>

No baseline behavior documented (don't know what agent does WITHOUT skill)
No verification skill actually works (might not address real rationalizations)
Generic guidance ("no exceptions") without specific counters
Will likely miss common excuses agents use
Violates Iron Law: no skill without failing test first </why_it_fails>

<correction> **Correct approach (RED-GREEN-REFACTOR):**

RED Phase:

Create pressure scenario (time + sunk cost)
Run WITHOUT skill
Document baseline: Agent says "I'll test after since time is tight"

GREEN Phase:

Write skill with explicit counter to that rationalization
Add: "Common excuses: 'Time is tight' → Wrong. Broken code wastes more time. Write test first."
Run WITH skill → agent now writes test first

REFACTOR Phase:

Try new pressure (exhaustion: "this is the 5th feature today")
Agent finds loophole: "these are all similar, I can skip tests"
Add counter: "Similar ≠ identical. Write test for each."
Re-test → bulletproof

What you gain:

Know skill addresses real failures (saw baseline)
Confident skill works (saw it fix behavior)
Closed all loopholes (tested multiple pressures)
Ready for production use </correction>

</example> <example> <scenario>Developer edits skill without testing changes</scenario> <code> # Existing skill works well # Developer thinks: "I'll just add this section about edge cases"

[Adds 50 lines to skill]

Commits without testing

</code>

<why_it_fails>

Don't know if new section actually helps (no baseline)
Might introduce contradictions with existing guidance
Could make skill less effective (more verbose, less clear)
Violates Iron Law: applies to edits too
Changes might not address actual rationalization patterns </why_it_fails>

<correction> **Correct approach:**

RED Phase (for edit):

Identify specific failure mode you want to address
Create pressure scenario that triggers it
Run WITH current skill → document how agent fails

GREEN Phase (edit):

Add ONLY content addressing that failure
Keep changes minimal
Run WITH edited skill → verify agent now complies

REFACTOR Phase:

Check edit didn't break existing scenarios
Run previous test cases
Verify all still pass

What you gain:

Changes address real problems (saw failure)
Know edit helps (saw improvement)
Didn't break existing guidance (regression tested)
Skill stays bulletproof </correction>

</example> <example> <scenario>Skill description too vague for search</scenario> <code> --- name: async-testing description: For testing async code ---

Skill content...

</code>

<why_it_fails>

Future Claude won't find this when needed
"For testing async code" too abstract (when would Claude search this?)
Doesn't describe symptoms or triggers
Missing keywords like "flaky," "race condition," "timeout"
Won't show up when agent has the actual problem </why_it_fails>

<correction> **Better description:**

---
name: condition-based-waiting
description: Use when tests have race conditions, timing dependencies, or pass/fail inconsistently - replaces arbitrary timeouts with condition polling for reliable async tests
---

Why this works:

Starts with "Use when" (triggers)
Lists symptoms: "race conditions," "pass/fail inconsistently"
Describes problem AND solution
Keywords: "race conditions," "timing," "inconsistent," "timeouts"
Future Claude searching "why are my tests flaky" will find this

What you gain:

Skill actually gets found when needed
Claude knows when to use it (clear triggers)
Search terms match real developer language
Description doubles as activation criteria </correction>

</example> </examples>

<skill_types>

Technique

Concrete method with steps to follow.

Examples: condition-based-waiting, hyperpowers:root-cause-tracing

Test approach: Pressure scenarios with combined pressures

Pattern

Way of thinking about problems.

Examples: flatten-with-flags, test-invariants

Test approach: Present problems the pattern solves, verify agent applies pattern

Reference

API docs, syntax guides, tool documentation.

Examples: Office document manipulation, API reference guides

Test approach: Give task requiring reference, verify agent uses it correctly

For detailed testing methodology by skill type: See resources/testing-methodology.md </skill_types>

<file_organization>

Self-Contained Skill

defense-in-depth/
  SKILL.md    # Everything inline

When: All content fits, no heavy reference needed

Skill with Reusable Tool

condition-based-waiting/
  SKILL.md    # Overview + patterns
  example.ts  # Working helpers to adapt

When: Tool is reusable code, not just narrative

Skill with Heavy Reference

pptx/
  SKILL.md       # Overview + workflows
  pptxgenjs.md   # 600 lines API reference
  ooxml.md       # 500 lines XML structure
  scripts/       # Executable tools

When: Reference material too large for inline (>100 lines)

Keep inline:

Principles and concepts
Code patterns (<50 lines)
Everything that fits </file_organization>

<search_optimization>

Claude Search Optimization (CSO)

Future Claude needs to FIND your skill. Optimize for search.

1. Rich Description Field

Format: Start with "Use when..." + triggers + what it does

# ❌ BAD: Too abstract
description: For async testing

# ❌ BAD: First person
description: I can help you with async tests

# ✅ GOOD: Triggers + problem + solution
description: Use when tests have race conditions or pass/fail inconsistently - replaces arbitrary timeouts with condition polling

2. Keyword Coverage

Use words Claude would search for:

Error messages: "Hook timed out", "ENOTEMPTY", "race condition"
Symptoms: "flaky", "hanging", "zombie", "pollution"
Synonyms: "timeout/hang/freeze", "cleanup/teardown/afterEach"
Tools: Actual commands, library names, file types

3. Token Efficiency

Problem: Frequently-referenced skills load into EVERY conversation.

Target word counts:

Frequently-loaded: <200 words
Other skills: <500 words

Techniques:

Move details to tool --help
Use cross-references to other skills
Compress examples
Eliminate redundancy

Verification:

wc -w skills/skill-name/SKILL.md

4. Cross-Referencing

Use skill name only, with explicit markers:

**REQUIRED BACKGROUND:** You MUST understand hyperpowers:test-driven-development
**REQUIRED SUB-SKILL:** Use hyperpowers:debugging-with-tools first

Don't use @ links: Force-loads files immediately, burns context unnecessarily. </search_optimization>

<critical_rules>

Rules That Have No Exceptions

NO SKILL WITHOUT FAILING TEST FIRST → Applies to new skills AND edits
Test with subagents under pressure → Combined pressures (time + sunk cost + authority)
Document baseline behavior → Exact rationalizations, not paraphrases
Write minimal skill addressing baseline → Don't add content not validated by testing
STOP before next skill → Complete RED-GREEN-REFACTOR-DEPLOY for each skill

Common Excuses

All of these mean: STOP. Run baseline test first.

"Simple skill, don't need testing" (If simple, testing is fast. Do it.)
"Just adding documentation" (Documentation can be wrong. Test it.)
"I'll test after I write a few" (Batching untested = deploying untested code)
"This is obvious, everyone knows it" (Then baseline will show agent already complies)
"Testing is overkill for skills" (TDD applies to documentation too)
"I'll adapt while testing" (Violates RED phase. Start over.)
"I'll keep untested as reference" (Delete means delete. No exceptions.)

The Iron Law

Same as TDD:

NO SKILL WITHOUT FAILING TEST FIRST

No exceptions for:

"Simple additions"
"Just adding a section"
"Documentation updates"
Edits to existing skills

Write skill before testing? Delete it. Start over. </critical_rules>

<verification_checklist> Before deploying ANY skill:

RED Phase:

[ ] Created pressure scenarios (3+ combined pressures for discipline skills)
[ ] Ran WITHOUT skill present
[ ] Documented baseline behavior verbatim (exact rationalizations)
[ ] Identified patterns in failures

GREEN Phase:

[ ] Name uses only letters, numbers, hyphens
[ ] YAML frontmatter: name + description only (max 1024 chars)
[ ] Description starts with "Use when..." and includes triggers
[ ] Description in third person
[ ] Has <quick_reference> section
[ ] Has <rigidity_level> explicit
[ ] Has 2-3 <example> tags
[ ] Addresses specific baseline failures
[ ] Ran WITH skill present
[ ] Verified agent now complies

REFACTOR Phase:

[ ] Tested with different pressures
[ ] Found NEW rationalizations
[ ] Added explicit counters
[ ] Re-tested until bulletproof

Quality:

[ ] Keywords throughout for search
[ ] One excellent code example (not multi-language)
[ ] Token-efficient (check word count)
[ ] Supporting files only if needed

Deploy:

[ ] Committed to git with descriptive message
[ ] Pushed to plugin repository (if applicable)

Can't check all boxes? Return to process and fix. </verification_checklist>

<integration> **This skill requires:** - hyperpowers:test-driven-development (understand TDD before applying to docs) - Task tool (for running subagent tests)

This skill is called by:

Anyone creating or editing skills
Plugin maintainers
Users with personal skill repositories

Agents used:

general-purpose (for testing skills under pressure) </integration>

<resources> **Detailed guides:** - [Testing methodology by skill type](resources/testing-methodology.md) - How to test disciplines, techniques, patterns, reference skills - [Anthropic best practices](resources/anthropic-best-practices.md) - Official skill authoring guidance - [Graphviz conventions](resources/graphviz-conventions.dot) - Flowchart style rules

When stuck:

Skill seems too simple to test → If simple, testing is fast. Do it anyway.
Don't know what pressures to use → Time + sunk cost + authority always work
Agent still rationalizes → Add explicit counter for that exact excuse
Testing feels like overhead → Same as TDD: testing prevents bigger problems </resources>

Agent Skills: ❌ BAD: Too abstract, first person

Install this agent skill to your local

Skill Files

1. RED Phase - Create Failing Test

2. GREEN Phase - Write Minimal Skill

3. REFACTOR Phase - Close Loopholes

4. Quality Checks

5. Deploy

Then tries to deploy it

Commits without testing

Skill content...

Technique

Pattern

Reference

Self-Contained Skill

Skill with Reusable Tool

Skill with Heavy Reference

Claude Search Optimization (CSO)

1. Rich Description Field

2. Keyword Coverage

3. Token Efficiency

4. Cross-Referencing

Rules That Have No Exceptions

Common Excuses

The Iron Law