Mistral AI Performance Tuning
Overview
Optimize Mistral AI API response times and throughput. Key levers: model selection (Mistral Small ~200ms TTFT vs Large ~500ms), prompt length (fewer tokens = faster), streaming (perceived speed), caching (zero-latency repeats), and concurrent request management.
Prerequisites
- Mistral API integration in production
- Understanding of RPM/TPM limits for your tier
- Application architecture supporting streaming
Instructions
Step 1: Model Selection by Latency Budget
const MODELS_BY_USE_CASE: Record<string, { model: string; ttftMs: string; note: string }> = {
realtime_chat: { model: 'mistral-small-latest', ttftMs: '~200ms', note: '256k ctx, cheapest' },
code_completion: { model: 'codestral-latest', ttftMs: '~150ms', note: 'Optimized for code + FIM' },
code_agents: { model: 'devstral-latest', ttftMs: '~300ms', note: 'Agentic coding tasks' },
reasoning: { model: 'mistral-large-latest', ttftMs: '~500ms', note: '256k ctx, strongest' },
vision: { model: 'pixtral-large-latest', ttftMs: '~600ms', note: 'Image + text multimodal' },
embeddings: { model: 'mistral-embed', ttftMs: '~50ms', note: '1024-dim, batch-friendly' },
edge_devices: { model: 'ministral-latest', ttftMs: '~100ms', note: '3B-14B, fastest' },
};
Step 2: Streaming for User-Facing Responses
Streaming reduces perceived latency from 1-2s (full response) to ~200ms (first token):
import { Mistral } from '@mistralai/mistralai';
const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });
async function* streamChat(messages: any[], model = 'mistral-small-latest') {
const stream = await client.chat.stream({ model, messages });
for await (const chunk of stream) {
const content = chunk.data?.choices?.[0]?.delta?.content;
if (content) yield content;
}
}
// Web Response with SSE
function streamToSSE(messages: any[]): Response {
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const text of streamChat(messages)) {
controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text })}\n\n`));
}
controller.enqueue(encoder.encode('data: [DONE]\n\n'));
controller.close();
},
});
return new Response(readable, {
headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache' },
});
}
Step 3: Response Caching
import { createHash } from 'crypto';
import { LRUCache } from 'lru-cache';
const cache = new LRUCache<string, any>({
max: 5000,
ttl: 3_600_000, // 1 hour
});
async function cachedChat(
messages: any[],
model: string,
temperature = 0,
): Promise<any> {
// Only cache deterministic requests
if (temperature > 0) {
return client.chat.complete({ model, messages, temperature });
}
const key = createHash('sha256')
.update(JSON.stringify({ model, messages }))
.digest('hex');
const cached = cache.get(key);
if (cached) {
console.debug('Cache HIT');
return cached;
}
const result = await client.chat.complete({ model, messages, temperature: 0 });
cache.set(key, result);
return result;
}
Step 4: Prompt Length Optimization
// Shorter prompts = faster TTFT and lower cost
function optimizePrompt(systemPrompt: string, maxChars = 500): string {
return systemPrompt
.replace(/\s+/g, ' ') // Collapse whitespace
.replace(/\n\s*\n/g, '\n') // Remove blank lines
.trim()
.slice(0, maxChars);
}
// Trim conversation history to last N turns
function trimHistory(messages: any[], maxTurns = 10): any[] {
const system = messages.filter(m => m.role === 'system');
const history = messages.filter(m => m.role !== 'system').slice(-maxTurns * 2);
return [...system, ...history];
}
// Impact: Reducing from 4000 to 500 input tokens saves ~50% TTFT
Step 5: Concurrent Request Queue
import PQueue from 'p-queue';
// Match concurrency to your workspace RPM limit
const queue = new PQueue({
concurrency: 10,
interval: 60_000,
intervalCap: 100, // RPM limit
});
async function queuedChat(messages: any[], model = 'mistral-small-latest') {
return queue.add(() => client.chat.complete({ model, messages }));
}
// Process 100 requests respecting RPM
const prompts = Array.from({ length: 100 }, (_, i) => `Question ${i}`);
const results = await Promise.all(
prompts.map(p => queuedChat([{ role: 'user', content: p }]))
);
Step 6: Batch API for Non-Realtime Workloads
Use Batch API for 50% cost savings when latency is not critical:
// Batch API processes requests asynchronously (minutes to hours)
// Supports: /v1/chat/completions, /v1/embeddings, /v1/fim/completions, /v1/moderations
// See mistral-webhooks-events for full batch implementation
Step 7: FIM (Fill-in-the-Middle) for Code
// Codestral supports FIM — faster than full chat for code completion
const response = await client.fim.complete({
model: 'codestral-latest',
prompt: 'function fibonacci(n) {\n if (n <= 1) return n;\n',
suffix: '\n}\n',
maxTokens: 100,
});
// Returns just the middle part — minimal tokens, minimal latency
Performance Benchmarks
| Optimization | Typical Impact | |-------------|----------------| | mistral-small vs mistral-large | 2-4x faster TTFT | | Streaming vs non-streaming | 5-10x perceived speed | | Response caching (temp=0) | 100x faster (cache hit) | | Prompt trimming (4k to 500 tokens) | 30-50% faster TTFT | | Batch API | Not faster, but 50% cheaper | | FIM vs chat for code | 2-3x fewer tokens |
Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| 429 rate_limit_exceeded | RPM/TPM cap hit | Use PQueue with interval cap |
| High TTFT (>1s) | Prompt too long or large model | Trim prompt, use mistral-small |
| Stream disconnected | Network timeout | Implement reconnection |
| Cache thrashing | High cardinality prompts | Increase cache size or reduce TTL |
Resources
Output
- Model selection optimized for latency requirements
- Streaming endpoints for perceived speed
- LRU response cache for deterministic requests
- Prompt optimization reducing token count
- Concurrent request queue respecting RPM limits