Anthropic Performance Tuning
Overview
Claude latency has two components: time to first token (TTFT) and tokens per second (TPS). Different strategies target each.
Latency Benchmarks (approximate)
| Model | TTFT (p50) | TTFT (p95) | Output TPS | |-------|-----------|-----------|------------| | Claude Haiku 4.5 | 200ms | 600ms | ~150 | | Claude Sonnet 4 | 400ms | 1.2s | ~90 | | Claude Opus 4 | 800ms | 2.5s | ~40 |
Optimization Strategies
Instructions
Step 1: Always Stream
// Streaming delivers the first token ASAP — user sees response instantly
// instead of waiting for the full response to generate
const stream = client.messages.stream({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages,
});
// First token arrives in ~400ms (Sonnet)
// Full response may take 5-10s, but user sees progress immediately
for await (const event of stream) {
if (event.type === 'content_block_delta') {
yield event.delta.text;
}
}
Step 2: Prompt Caching — Faster TTFT
// Cached prompts skip re-processing — dramatically lower TTFT for large system prompts
const message = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
system: [{
type: 'text',
text: largeSystemPrompt, // 10K+ tokens
cache_control: { type: 'ephemeral' },
}],
messages,
}, {
headers: { 'claude-beta': 'prompt-caching-2024-07-31' },
});
// TTFT drops from ~2s to ~500ms on cache hit with large prompts
Step 3: Use Haiku for Speed-Critical Paths
// Haiku is 2-4x faster than Sonnet with 80% quality for many tasks
// Use for: classification, extraction, simple Q&A, routing decisions
const route = await client.messages.create({
model: 'claude-haiku-4-5-20251001', // 200ms TTFT
max_tokens: 10,
system: 'Classify the intent. Reply with exactly one word: search, create, update, delete.',
messages: [{ role: 'user', content: userInput }],
});
// Then use Sonnet/Opus for the actual task
Step 4: Reuse Client Instance
// BAD — creates new connection pool per request
app.get('/api/chat', async (req, res) => {
const client = new Anthropic(); // DON'T
// ...
});
// GOOD — single client shared across requests
const client = new Anthropic(); // Module-level singleton
app.get('/api/chat', async (req, res) => {
const message = await client.messages.create({ ... });
// ...
});
Step 5: Parallel Requests
// When you need multiple independent Claude calls, fire them in parallel
const [summary, sentiment, entities] = await Promise.all([
client.messages.create({ model: 'claude-haiku-4-5-20251001', max_tokens: 200,
messages: [{ role: 'user', content: `Summarize: ${text}` }] }),
client.messages.create({ model: 'claude-haiku-4-5-20251001', max_tokens: 20,
messages: [{ role: 'user', content: `Sentiment (positive/negative/neutral): ${text}` }] }),
client.messages.create({ model: 'claude-haiku-4-5-20251001', max_tokens: 200,
messages: [{ role: 'user', content: `Extract named entities from: ${text}` }] }),
]);
Step 6: Minimize Output Tokens
// Fewer output tokens = faster response
system: 'Be extremely concise. Use bullet points, not paragraphs.',
// Set tight max_tokens
max_tokens: 256, // Don't use 4096 for short answers
Output
- Streaming enabled for all user-facing responses (first token in ~400ms with Sonnet)
- Prompt caching reducing TTFT for large system prompts
- Model routing to Haiku for speed-critical classification/routing tasks
- Client instance reused across requests (no per-request connection overhead)
- Parallel requests firing independent Claude calls concurrently
Error Handling
| Issue | Cause | Fix |
|-------|-------|-----|
| TTFT > 3s | Large uncached prompt | Enable prompt caching |
| Slow output | Using Opus for simple tasks | Downgrade to Haiku/Sonnet |
| Timeouts | Long generation + default timeout | new Anthropic({ timeout: 120_000 }) |
| 529 overloaded | API capacity | SDK auto-retries; add fallback model |
Examples
See Latency Benchmarks table and six numbered strategy sections above, each with complete TypeScript code examples.
Resources
Next Steps
See clade-deploy-integration for production deployment patterns.
Prerequisites
- Completed
clade-install-auth - User-facing application where latency matters
- Understanding of streaming and async patterns