Cohere Cost Tuning
Overview
Optimize Cohere costs through model selection, token budgets, embedding compression, and usage monitoring. Cohere pricing is token-based with separate input/output rates.
Prerequisites
- Cohere production key (trial is free but limited)
- Access to dashboard.cohere.com billing page
Cohere Pricing Model
Key principle: Cohere charges per token. Input tokens and output tokens have different rates. Embed, Rerank, and Classify have separate pricing based on search units.
| Tier | Access | Rate Limits | Cost | |------|--------|-------------|------| | Trial | Free | 5-20 calls/min, 1000/month | $0 | | Production | Metered | 1000 calls/min, unlimited | Per-token |
Model Cost Comparison
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best For |
|-------|----------------------|------------------------|----------|
| command-r7b-12-2024 | Lowest | Lowest | High-volume, simple tasks |
| command-r-08-2024 | Low | Low | RAG, cost-effective |
| command-r-plus-08-2024 | Medium | Medium | Complex reasoning |
| command-a-03-2025 | Higher | Higher | Best quality |
Non-Chat Pricing
| Endpoint | Pricing Unit | Notes | |----------|-------------|-------| | Embed | Per input token | Batch 96 texts to minimize calls | | Rerank | Per search unit | 1 query + N docs = 1 search unit | | Classify | Per classification | Charges per input classified |
Instructions
Strategy 1: Model Tiering
type CostTier = 'economy' | 'standard' | 'premium';
function selectModel(tier: CostTier): string {
switch (tier) {
case 'economy': return 'command-r7b-12-2024'; // ~5x cheaper
case 'standard': return 'command-r-08-2024'; // Good balance
case 'premium': return 'command-a-03-2025'; // Best quality
}
}
// Route by use case
function routeModel(task: string): string {
// High-volume, simple tasks → cheapest model
if (['classify', 'extract', 'summarize-short'].includes(task)) {
return selectModel('economy');
}
// RAG, moderate complexity
if (['rag', 'search', 'qa'].includes(task)) {
return selectModel('standard');
}
// Complex reasoning, user-facing
return selectModel('premium');
}
Strategy 2: Token Budget Controls
import { CohereClientV2 } from 'cohere-ai';
const cohere = new CohereClientV2();
// Set maxTokens to prevent runaway generation costs
async function budgetedChat(message: string, maxOutputTokens = 500) {
const response = await cohere.chat({
model: 'command-r-08-2024',
messages: [{ role: 'user', content: message }],
maxTokens: maxOutputTokens, // Hard limit on output tokens
});
// Track actual usage
const usage = response.usage?.billedUnits;
console.log(`Tokens: in=${usage?.inputTokens} out=${usage?.outputTokens}`);
return response;
}
Strategy 3: Embedding Cost Reduction
// 1. Use int8 embeddings (same quality, cheaper storage)
const response = await cohere.embed({
model: 'embed-v4.0',
texts: documents,
inputType: 'search_document',
embeddingTypes: ['int8'], // 75% less storage than float
});
// 2. Batch to 96 per call (minimize API calls)
// 3. Cache embeddings (they're deterministic — embed once, use forever)
// 4. Use embed-multilingual-v3.0 if you don't need v4 features
Strategy 4: Usage Monitoring
class CohereUsageTracker {
private usage: Record<string, { inputTokens: number; outputTokens: number; calls: number }> = {};
private dailyBudget: number;
constructor(dailyBudgetUSD: number) {
this.dailyBudget = dailyBudgetUSD;
}
track(endpoint: string, billedUnits: { inputTokens?: number; outputTokens?: number }) {
if (!this.usage[endpoint]) {
this.usage[endpoint] = { inputTokens: 0, outputTokens: 0, calls: 0 };
}
this.usage[endpoint].inputTokens += billedUnits.inputTokens ?? 0;
this.usage[endpoint].outputTokens += billedUnits.outputTokens ?? 0;
this.usage[endpoint].calls++;
}
getReport(): string {
return Object.entries(this.usage)
.map(([ep, u]) =>
`${ep}: ${u.calls} calls, ${u.inputTokens} in, ${u.outputTokens} out`
)
.join('\n');
}
estimateDailyCost(): number {
// Rough estimate — check cohere.com/pricing for exact rates
const chatIn = (this.usage['chat']?.inputTokens ?? 0) / 1_000_000;
const chatOut = (this.usage['chat']?.outputTokens ?? 0) / 1_000_000;
const embedIn = (this.usage['embed']?.inputTokens ?? 0) / 1_000_000;
// Multiply by per-million-token rates from pricing page
return (chatIn * 0.5) + (chatOut * 1.5) + (embedIn * 0.1); // example rates
}
}
// Wrap all API calls
const tracker = new CohereUsageTracker(10); // $10/day budget
async function trackedChat(params: any) {
const response = await cohere.chat(params);
tracker.track('chat', response.usage?.billedUnits ?? {});
if (tracker.estimateDailyCost() > tracker['dailyBudget'] * 0.8) {
console.warn('WARNING: Approaching daily Cohere budget limit');
}
return response;
}
Strategy 5: Rerank Before RAG (Skip Embed for Small Corpora)
// If you have < 1000 documents, skip embedding entirely
// Rerank is cheaper than Embed + vector search for small collections
async function cheapRAG(query: string, corpus: string[]) {
// 1 search unit instead of N embed calls
const ranked = await cohere.rerank({
model: 'rerank-v3.5',
query,
documents: corpus,
topN: 3,
});
const docs = ranked.results.map((r, i) => ({
id: `doc-${i}`,
data: { text: corpus[r.index] },
}));
// Use cheaper model for generation
return cohere.chat({
model: 'command-r-08-2024', // Not command-a (cheaper)
messages: [{ role: 'user', content: query }],
documents: docs,
maxTokens: 300,
});
}
Cost Optimization Checklist
- [ ] Use
command-r7bfor simple tasks,command-aonly for complex ones - [ ] Set
maxTokenson all chat calls - [ ] Batch embed calls (96 texts per request)
- [ ] Cache embeddings (deterministic — compute once)
- [ ] Use
int8embeddings for storage - [ ] Monitor
usage.billedUnitsin every response - [ ] Set daily budget alerts
- [ ] Use
rerankinstead ofembedfor small corpora (< 1000 docs)
Output
- Model tiering by cost/quality
- Token budget controls preventing runaway costs
- Usage tracking with daily budget alerts
- Cost-effective RAG with rerank pre-filtering
Error Handling
| Issue | Cause | Solution | |-------|-------|----------| | Unexpected bill spike | No maxTokens | Set maxTokens on all chat calls | | High embed costs | Individual texts | Batch to 96 per call | | Budget exceeded | No monitoring | Track billedUnits per response | | Over-provisioned model | Using premium everywhere | Tier models by task complexity |
Resources
Next Steps
For architecture patterns, see cohere-reference-architecture.