Skill Maker
Overview
Creating skills IS Test-Driven Development applied to process documentation.
Write test cases (pressure scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch tests pass (agents comply), and refactor (close loopholes).
Core principle: If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing.
What is a Skill?
Skills are modular packages that extend Claude's capabilities with specialized knowledge, workflows, and tools — like "onboarding guides" for specific domains.
Skills provide: Specialized workflows, tool integrations, domain expertise, bundled resources (scripts, references, assets)
Skills are: Reusable techniques, patterns, tools, reference guides
Skills are NOT: Narratives, one-off solutions, or project-specific conventions (use CLAUDE.md for those)
Skill Types: Technique (concrete steps), Pattern (mental models), Reference (API docs/guides), Discipline-Enforcing (rules/requirements)
Create when: Technique wasn't obvious, applies broadly, would be referenced again, ensures consistent behavior across instances
Don't create for: One-off solutions, standard well-documented practices, project-specific conventions
Skill Anatomy
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter: name (required), description (required), allowed-tools (optional)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ - Executable code (deterministic, token-efficient)
├── references/ - Documentation loaded as needed (schemas, APIs)
└── assets/ - Files used in output (templates, boilerplate)
Frontmatter rules:
name: Letters, numbers, hyphens only (max 64 chars)description: Third-person, starts with "Use when...", includes triggers AND purpose (Anthropic capsdescription+ optionalwhen_to_useat 1536 chars combined; aim well under)allowed-tools: Optional list restricting tool access
Writing style: Use imperative/infinitive form ("To accomplish X, do Y"), not second person
Progressive Disclosure: Metadata always loaded (~100 words) → SKILL.md on trigger (<5k words) → Resources as needed
Reference best practices: Keep SKILL.md lean; if reference files >10k words, include grep patterns in SKILL.md; avoid duplicating content between SKILL.md and references
The Iron Law
NO SKILL WITHOUT A FAILING TEST FIRST
Applies to NEW skills AND EDITS. Write/edit skill before testing? Delete it. Start over. No exceptions — not for "simple additions", "just a section", or "documentation updates". Delete means delete.
RED-GREEN-REFACTOR for Skills
RED: Baseline Testing
Run pressure scenarios WITHOUT the skill. Document exact behavior:
- What choices did the agent make?
- What rationalizations did they use (verbatim)?
- Which pressures triggered violations?
Method: Create new conversation/subagent → present scenario → apply pressure (time constraints, sunk cost, authority, exhaustion) → document failures
GREEN: Write Minimal Skill
Write skill addressing those specific rationalizations. Don't add content for hypothetical cases.
Run same scenarios WITH skill present. Agent should now comply. Document any new violations.
REFACTOR: Close Loopholes
For each new violation: identify rationalization → add explicit counter → re-test until bulletproof.
For discipline-enforcing skills: Build rationalization table, create "Red Flags" list, add "No exceptions" counters, address "spirit vs letter" arguments.
Skill Creation Process
Step 1: Understanding
Clarify concrete examples of usage. Ask: "What should this skill support?", "What triggers it?", "How would it be used?"
UltraThink before creating structure:
"Let me ultrathink what this skill should really accomplish and how it fits the ecosystem."
Question fundamentals: Is this a skill or project docs? What's reusable vs one-off? How will Claude discover it? What rationalizations will agents use? What are we NOT including?
Step 2: Plan Resources
For each use case: How would you execute from scratch? What scripts, references, or assets would help when executing repeatedly?
Step 3: Initialize Structure
mkdir -p dev-workflow/skills/skill-name/{scripts,references,assets}
Create SKILL.md with frontmatter template:
---
name: skill-name
description: Use when [triggers] - [what it does]
allowed-tools: Read, Write, Edit, Glob, Grep
---
# Skill Name
## Overview
[Core principle in 1-2 sentences]
## When to Use
[Activation triggers; when NOT to use]
## [Main content sections]
Step 4: Execute RED-GREEN-REFACTOR
Follow the cycle above: baseline test (RED) → write content addressing failures (GREEN) → close loopholes (REFACTOR).
Skill content must answer: What is the purpose? When should it be used? How should Claude use it?
Step 5: Iterate After Deployment
Use skill on real tasks → notice struggles → baseline test again → update → verify.
Claude Search Optimization (CSO)
Note: "CSO" is kisune's framing, not an official Anthropic term. The underlying mechanics are official: Claude matches user intent against the frontmatter
description(and optionalwhen_to_use) of available skills to decide which to auto-invoke. The body of SKILL.md is NOT scanned for matching — padding the body with keywords does not help activation. Only frontmatter participates.
Description field rules
- Start with "Use when..." in third person (not second person)
- Include concrete triggers, error messages, and symptom keywords a user or agent would search for in the description itself (body keywords are wasted for matching)
- Describe the problem the skill solves, not implementation details
- Anthropic caps
description+when_to_useat 1536 chars combined; aim for 200-400 chars to leave headroom
Naming rules
- Use verb-first or gerund form:
creating-skills,testing-async-code,dispatching-agents - Letters, numbers, and hyphens only; max 64 chars
- Avoid vague labels (
helper,utils,common)
Token efficiency targets
Official guidance: Anthropic recommends SKILL.md body under 500 lines and progressive disclosure (move bulk into
references/). The word-count tiers below are kisune heuristics layered on top, not Anthropic policy.
| Skill type | Kisune target word count | |---|---| | Getting-started / bootstrap | < 150 words | | Frequently-loaded (auto-trigger common workflows) | < 500 words | | Domain / reference | < 1000 words | | Discipline-enforcing | up to 1500 words (rationalization tables justify length) |
Hard ceiling either way: 500 lines of SKILL.md body. If you're approaching that, move material to references/.
If SKILL.md exceeds the target, move reference material to a references/ subfolder and link to it.
Cross-referencing rules
- Reference other skills by name only (e.g., "use
dev-workflow:test-driven-development") - Avoid
@filename.mdsyntax — it force-loads files into context and defeats progressive disclosure - Use a "REQUIRED SUB-SKILL:" marker when a sub-skill must be invoked
Testing by Skill Type
| Type | Test With | Success Criteria | |------|-----------|-----------------| | Discipline | Academic questions + pressure scenarios (3+ combined: time, sunk cost, exhaustion) | Agent follows rule under maximum pressure | | Technique | Application scenarios, variations, missing info tests | Agent applies technique to new scenario | | Pattern | Recognition, application, counter-examples | Agent knows when/how to apply (and when NOT to) | | Reference | Retrieval, application, gap testing | Agent finds and correctly applies information |
Skill Creation Checklist
Use TodoWrite to create todos for EACH item.
RED Phase:
- [ ] Create pressure scenarios (3+ combined pressures for discipline skills)
- [ ] Run scenarios WITHOUT skill - document baseline verbatim
- [ ] Identify patterns in rationalizations/failures
GREEN Phase:
- [ ] Valid frontmatter: name (letters/numbers/hyphens), description ("Use when...", third person, <1536 chars combined with
when_to_use; aim for 200-400) - [ ] Keywords throughout for search (errors, symptoms, tools)
- [ ] Clear overview with core principle
- [ ] Address specific baseline failures from RED
- [ ] One excellent example (not multi-language)
- [ ] Run scenarios WITH skill - verify compliance
REFACTOR Phase:
- [ ] Add counters for new rationalizations
- [ ] Build rationalization table and red flags list
- [ ] Re-test until bulletproof
Quality Checks:
- [ ] Overview answers: What? When? How?
- [ ] Quick reference table or bullets
- [ ] Common mistakes section
- [ ] No narrative storytelling
- [ ] Token count within targets
Deployment:
- [ ] Commit skill to git
- [ ] Update documentation if needed
Anti-Patterns
- Narrative examples ("In session 2025-10-03...") — too specific, not reusable
- Multi-language dilution (example-js, example-py, example-go) — mediocre quality, maintenance burden
- Generic labels (helper1, step3) — use semantic names
- Untested skills — guarantees issues in production
- Vague descriptions ("Helps with coding") — Claude won't know when to activate
Integration with Dev-Workflow
Leverage existing skills: test-driven-development for testing, review for code review, documentation for docs, brainstorming for design exploration.
Activation context: Consider when skill activates in the workflow (planning, implementation, quality) and whether it auto-activates or requires explicit invocation.
Tool restrictions by phase:
- Planning:
Read, Write, Glob, Grep - Implementation:
Read, Write, Edit, Glob, Grep, Bash - Quality:
Read, Grep, Glob
The Bottom Line
Creating skills IS TDD for process documentation. Same Iron Law, same cycle (RED → GREEN → REFACTOR), same benefits. If you follow TDD for code, follow it for skills.