Cost Verification Auditor Skill

Cost Verification Auditor

Verify that token cost estimates are within ±20% of actual Claude API usage.

When to Use

✅ Use for:

Validating token estimation systems after implementation
Pre-deployment cost accuracy checks
Debugging unexpected API bills
Periodic estimation drift detection

❌ NOT for:

Looking up model pricing (use pricing docs)
Budget planning or forecasting
Cost optimization strategies
Comparing models by price

Core Audit Process

Decision Tree

Has estimator? ──No──→ Build estimator first (see Calibration Guidelines)
      │
     Yes
      ↓
Define 3+ test cases (simple/medium/complex)
      ↓
Estimate BEFORE execution (no peeking!)
      ↓
Execute against real API
      ↓
Calculate variance: (actual - estimated) / estimated
      ↓
Variance ≤ ±20%? ──Yes──→ PASS ✓
      │
     No
      ↓
Apply fixes from Anti-Patterns section
      ↓
Re-run verification

Variance Formula

const inputVariance = (actual.inputTokens - estimate.inputTokens) / estimate.inputTokens;
const outputVariance = (actual.outputTokens - estimate.outputTokens) / estimate.outputTokens;
const costVariance = (actual.totalCost - estimate.totalCost) / estimate.totalCost;

// PASS if both input AND output within ±20%
const passed = Math.abs(inputVariance) <= 0.20 && Math.abs(outputVariance) <= 0.20;

Common Anti-Patterns

Anti-Pattern: The 500-Token Overhead Myth

Novice thinking: "Claude Code adds ~500 tokens overhead, so add that to every estimate."

Reality: Direct API calls have ~10 token overhead. The 500+ overhead is ONLY when using Claude Code's full context (system prompts, tools, conversation history).

Timeline:

Pre-2025: Many tutorials used 500+ token estimates
2025+: Direct API overhead is minimal (~10 tokens)

What to use instead: | Context | Overhead | |---------|----------| | Direct API call | ~10 tokens | | With system prompt | 50-200 tokens | | With tools/functions | 100-500 tokens | | Claude Code full context | 500-2000 tokens |

How to detect: Consistent 40-90% overestimation = overhead too high.

Anti-Pattern: Per-Node Accuracy Obsession

Novice thinking: "Every node must be within ±20% or the estimator is broken."

Reality: LLM output length is non-deterministic. Per-node output variance of 30-50% is normal. What matters is aggregate cost accuracy.

What to use instead:

Focus on total DAG cost variance (should be ±20%)
Accept per-node output variance up to ±40%
Use constrained prompts ("list exactly 3") to reduce variance

How to detect: Input estimates accurate, output varies wildly = normal LLM behavior.

Anti-Pattern: Peeking Before Estimating

Novice thinking: "Let me run the API call first to see what tokens we get, then build the estimator."

Reality: This produces perfectly-fitted estimates that fail on new prompts. Estimation must happen BEFORE execution.

Correct approach:

Estimate based on prompt length and heuristics
Execute API call
Compare variance
Adjust heuristics if needed

Calibration Guidelines

Input Token Estimation

// Calibrated 2026-01-30
const inputTokens = Math.ceil(prompt.length / CHARS_PER_TOKEN) + OVERHEAD;

| Text Type | CHARS_PER_TOKEN | Notes | |-----------|-----------------|-------| | English prose | 4.0 | Most consistent | | Code | 3.0-3.5 | Symbols tokenize differently | | Mixed | 3.5 | Balanced (recommended default) | | JSON/structured | 3.0 | Punctuation heavy |

Output Token Estimation

| Prompt Constraint | Multiplier | Notes | |-------------------|------------|-------| | "List exactly N items" | 0.8x input | Highly constrained | | "Brief summary" | 1.0x input | Moderate | | "Explain in detail" | 2-3x input | Expansive | | Unconstrained | 1.5x input | Variable |

Always: Minimum 100 output tokens for any meaningful response.

Model Behavior

| Model | Output Tendency | |-------|-----------------| | Claude Opus | Longer, more detailed | | Claude Sonnet | Balanced | | Claude Haiku | Concise, efficient |

Quick Fixes

| Symptom | Cause | Fix | |---------|-------|-----| | Overestimating by 40%+ | Overhead too high | Reduce from 500 → 10 | | Underestimating inputs | Chars/token too high | Reduce from 4.0 → 3.5 | | Output wildly varies | LLM non-determinism | Use constrained prompts | | Total cost accurate but per-node off | Normal aggregation | Accept it, focus on totals |

Verification Checklist

[ ] 3+ test cases (simple, medium, complex)
[ ] Estimates run BEFORE API calls
[ ] Variance formula: (actual - estimated) / estimated
[ ] Target: ±20% for input AND output
[ ] Report includes actionable recommendations

References

See /references/calibration-data.md for detailed calibration tables and historical data.

Agent Skills: Cost Verification Auditor

Install this agent skill to your local

Skill Files