Calibrate - Post-Launch AI Feature Calibration
Core Philosophy
Calibration happens after launch, not before.
The mistake: Building elaborate systems to perfectly calibrate AI behavior before launch. The reality: You learn what quality means by shipping to users and seeing what they actually need.
The Calibration Loop:
- Deploy at current agency level
- Monitor performance in prod
- Analyze and learn
- Calibrate system
- Test changes
- Consider agency increase
- Repeat
Entry Point
When this skill is invoked, start with:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CALIBRATE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Calibration happens after launch, not before.
What do you need?
1. Document error patterns
→ Analyze failures, categorize, plan fixes
2. Review eval performance
→ Are evals catching real issues? Missing patterns?
3. Agency promotion decision
→ Is this feature ready for more autonomy?
4. Quick calibration check
→ Is the system behaving as expected?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Parse intent from context:
- If user mentions "errors" or "failures" or "bugs" → Flow 1
- If user mentions "evals" or "tests" or "coverage" → Flow 2
- If user mentions "promote" or "increase" or "V2" → Flow 3
- If user mentions "check" or "status" or "quick" → Flow 4
Command-line shortcuts:
/calibrate→ Show entry menu/calibrate --errors→ Flow 1 (error patterns)/calibrate --evals→ Flow 2 (eval review)/calibrate --promote→ Flow 3 (agency promotion)/calibrate --quick→ Flow 4 (quick check)
Flow 1: Document Error Patterns
Step 1: Gather Error Data
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ERROR PATTERN DOCUMENTATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Let's catalog what's going wrong.
Where are you seeing errors?
• User feedback / complaints
• Support tickets
• Monitoring alerts
• Manual review
• User corrections / overrides
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Questions to ask:
- "What specific failures have you observed?"
- "How often does this happen? (rare, occasional, frequent)"
- "What's the impact when it fails?"
- "Are there patterns in WHEN it fails?"
Step 2: Categorize Errors
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ERROR CATEGORIES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Common error categories for AI features:
• Hallucination - AI makes things up
• Wrong context - AI misreads the situation
• Tone mismatch - Output style is wrong
• Scope creep - AI goes beyond boundaries
• Missing information - AI lacks needed context
• Confidence miscalibration - Too certain or uncertain
• Edge case - Scenario not covered
• Adversarial - Intentional misuse
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 3: Analyze Root Causes
For each error pattern, determine:
- Likely reason: Why is this happening?
- Potential fix: Prompt change? Context? Guardrail? Training data?
- Priority: P1 (critical), P2 (important), P3 (nice to fix)
Output: Error Pattern Table
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ERROR PATTERN ANALYSIS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Analysis Date: [date]
Data Source: [where errors were observed]
| Error Pattern | Category | Likely Reason | Potential Fix | Priority |
|---------------|----------|---------------|---------------|----------|
| [description] | [type] | [why] | [how to fix] | P1 |
| [description] | [type] | [why] | [how to fix] | P2 |
| [description] | [type] | [why] | [how to fix] | P3 |
PATTERN ANALYSIS:
- Most common category: [X]
- Emerging pattern: [Y]
- Regression from last period: [Z]
RECOMMENDED ACTIONS:
1. [P1 action]
2. [P2 action]
3. [P3 action]
ADD TO EVALS:
- [ ] Add test case for [error pattern 1]
- [ ] Add test case for [error pattern 2]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Flow 2: Review Eval Performance
Step 1: Current Eval State
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
EVAL PERFORMANCE REVIEW
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Let's see if your evals are working.
Current state:
• How many test cases do you have?
• What's your pass rate?
• When did you last update evals?
• Are you seeing failures in prod that evals missed?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Questions to ask:
- "Are current evals catching real issues?"
- "What new patterns have emerged since launch?"
- "Are you passing 100%?" (If yes, evals may be too easy)
- "How often do you run evals?"
Step 2: Gap Analysis
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
EVAL GAP ANALYSIS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Check coverage across categories:
□ Happy path - Common successful scenarios
□ Edge cases - Unusual but valid inputs
□ Adversarial - Intentional misuse attempts
□ Boundary - Out of scope handling
□ Regression - Previously fixed issues
□ Production errors - Real failures observed
Missing categories = gaps in coverage
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Output: Eval Assessment
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
EVAL ASSESSMENT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Test Cases: [count]
Pass Rate: [%]
Last Updated: [date]
COVERAGE ASSESSMENT:
| Category | Coverage | Status |
|----------|----------|--------|
| Happy path | [X] cases | ✅ Good |
| Edge cases | [X] cases | ⚠️ Needs work |
| Adversarial | [X] cases | ❌ Missing |
| Boundary | [X] cases | ✅ Good |
| Regression | [X] cases | ⚠️ Needs work |
EFFECTIVENESS:
- Catching real issues? [Yes/No/Partially]
- False positive rate: [%]
- Prod errors missed: [list]
RECOMMENDATIONS:
1. Add [X] test cases for [gap]
2. Update [Y] tests that are stale
3. Remove [Z] tests that are redundant
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Flow 3: Agency Promotion Decision
Promotion Checklist
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
AGENCY PROMOTION CHECK
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Considering: V[current] → V[target]
Let's verify readiness.
QUALITY METRICS
□ Accuracy/quality stable for 4+ weeks?
□ No new error patterns emerging?
□ User corrections decreasing?
SAFETY & TRUST
□ Confident in known failure modes?
□ Override mechanism working well?
□ User feedback positive?
OPERATIONAL READINESS
□ Monitoring in place for new level?
□ Rollback plan ready?
□ Team aligned on promotion?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
For each item, ask:
- "What's your evidence?"
- "How long have you observed this?"
- "What would change your answer?"
Output: Promotion Verdict
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PROMOTION VERDICT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Current: V[n] | Target: V[n+1]
VERDICT: [READY ✅ / NOT READY ❌ / NEEDS WORK ⚠️]
✅ PASSING:
- [criteria met with evidence]
- [criteria met with evidence]
❌ BLOCKING:
- [criteria not met + what's needed]
- [criteria not met + what's needed]
⚠️ RISKS IF PROMOTED NOW:
- [risk + mitigation needed]
RECOMMENDATION:
[Clear recommendation with reasoning]
NEXT STEPS:
1. [action if ready]
2. [action if not ready]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Flow 4: Quick Calibration Check
Health Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
QUICK CALIBRATION CHECK
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Fast health check for [feature name]
• Current quality metric: [X%]
• Trend: [↑ improving / → stable / ↓ degrading]
• Any alerts triggered? [Y/N]
• User feedback signals: [positive/neutral/negative]
• Override rate: [X%] (is this expected?)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Output: Health Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CALIBRATION STATUS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Agency Level: V[n]
Check Date: [date]
STATUS: [HEALTHY ✅ / ATTENTION ⚠️ / DEGRADED ❌]
METRICS:
| Metric | Value | Trend | Status |
|--------|-------|-------|--------|
| Quality | [X%] | [→] | ✅ |
| Override rate | [X%] | [↓] | ✅ |
| User satisfaction | [X] | [→] | ⚠️ |
ALERTS:
- [any triggered alerts]
ACTION NEEDED:
- [none / specific action]
NEXT CHECK: [recommended cadence]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Calibration Cadence
Weekly: Quick check (Flow 4)
- Is quality stable?
- Any new alerts?
- User feedback signals?
Monthly: Eval review (Flow 2)
- Are evals catching real issues?
- New patterns to add?
- Stale tests to update?
Quarterly: Deep calibration
- Error pattern analysis (Flow 1)
- Agency promotion consideration (Flow 3)
- Strategic calibration review
Challenge Patterns
"Quality looks fine" → "What's your correction rate? What's the trend? Fine compared to what baseline?"
"No complaints" → "Are users actually using it? Are they silently working around it?"
"Ready to promote" → "Show me the data. How long has quality been stable? What failure modes have you validated?"
"Evals are passing" → "100% pass rate? That might mean evals are too easy. When did you last add new test cases?"
Integration with Other Commands
Before /calibrate:
/agency-ladder- Define the ladder first/spec --ai- Ensure spec includes calibration plan
Related:
/ai-health-check- Pre-launch validation/start-evals- Set up eval infrastructure
Attribution
Framework: CC/CD (Continuous Calibration/Continuous Development) Source: Aishwarya Naresh Reganti & Kiriti Badam (Lenny's Newsletter) Adaptation: Post-launch calibration workflows
Remember
- Calibration is continuous, not one-time
- Error patterns are the richest signal
- User corrections show where AI fails
- Evals must evolve with the product
- Promotion requires proven reliability
- 100% pass rate means evals are too easy