OpenRouter Caching Strategy
Overview
OpenRouter charges per token, so caching identical or similar requests can dramatically cut costs. Deterministic requests (temperature=0) with the same model and messages produce identical outputs -- these are safe to cache. This skill covers in-memory caching, persistent caching with TTL, and Anthropic prompt caching via OpenRouter.
In-Memory Cache
import os, hashlib, json, time
from typing import Optional
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
default_headers={"HTTP-Referer": "https://my-app.com", "X-Title": "my-app"},
)
class LLMCache:
def __init__(self, ttl_seconds: int = 3600):
self._cache: dict[str, tuple[dict, float]] = {}
self._ttl = ttl_seconds
self.hits = 0
self.misses = 0
def _key(self, model: str, messages: list, **kwargs) -> str:
blob = json.dumps({"model": model, "messages": messages, **kwargs}, sort_keys=True)
return hashlib.sha256(blob.encode()).hexdigest()
def get(self, model: str, messages: list, **kwargs) -> Optional[dict]:
k = self._key(model, messages, **kwargs)
if k in self._cache:
data, ts = self._cache[k]
if time.time() - ts < self._ttl:
self.hits += 1
return data
del self._cache[k]
self.misses += 1
return None
def set(self, model: str, messages: list, response: dict, **kwargs):
k = self._key(model, messages, **kwargs)
self._cache[k] = (response, time.time())
cache = LLMCache(ttl_seconds=1800)
def cached_completion(messages, model="anthropic/claude-3.5-sonnet", **kwargs):
"""Only cache deterministic requests (temperature=0)."""
kwargs.setdefault("temperature", 0)
kwargs.setdefault("max_tokens", 1024)
cached = cache.get(model, messages, **kwargs)
if cached:
return cached
response = client.chat.completions.create(model=model, messages=messages, **kwargs)
result = {
"content": response.choices[0].message.content,
"model": response.model,
"usage": {"prompt": response.usage.prompt_tokens, "completion": response.usage.completion_tokens},
}
cache.set(model, messages, result, **kwargs)
return result
Persistent Cache with Redis
import redis, json, hashlib
r = redis.Redis(host="localhost", port=6379, db=0)
def redis_cached_completion(messages, model="openai/gpt-4o-mini", ttl=3600, **kwargs):
"""Cache in Redis with automatic TTL expiry."""
kwargs["temperature"] = 0 # Must be deterministic
key = f"or:{hashlib.sha256(json.dumps({'m': model, 'msgs': messages, **kwargs}, sort_keys=True).encode()).hexdigest()}"
cached = r.get(key)
if cached:
return json.loads(cached)
response = client.chat.completions.create(model=model, messages=messages, **kwargs)
result = {
"content": response.choices[0].message.content,
"model": response.model,
"tokens": response.usage.prompt_tokens + response.usage.completion_tokens,
}
r.setex(key, ttl, json.dumps(result))
return result
Anthropic Prompt Caching via OpenRouter
Anthropic models on OpenRouter support prompt caching -- large system prompts are cached server-side, reducing input cost by 90% on cache hits.
# Mark large static content blocks with cache_control
response = client.chat.completions.create(
model="anthropic/claude-3.5-sonnet",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an expert. Here is the full source:\n" + large_context,
"cache_control": {"type": "ephemeral"}, # Cache this block
}
],
},
{"role": "user", "content": "What does the main() function do?"},
],
max_tokens=1024,
)
# First call: cache_creation_input_tokens charged at 1.25x
# Subsequent: cache_read_input_tokens charged at 0.1x (90% savings)
Cache Key Design
def cache_key(model: str, messages: list, **params) -> str:
"""Deterministic cache key. Include everything that affects output.
Include: model ID (with variant like :floor), messages, temperature,
max_tokens, top_p, transforms, provider routing.
Exclude: stream (doesn't affect content), HTTP-Referer, X-Title.
"""
canonical = json.dumps({
"model": model, "messages": messages,
"temperature": params.get("temperature", 0),
"max_tokens": params.get("max_tokens"),
"top_p": params.get("top_p"),
}, sort_keys=True)
return hashlib.sha256(canonical.encode()).hexdigest()
Cache Invalidation
| Trigger | Action | Why |
|---------|--------|-----|
| Model version update | Flush keys for that model | New version may give different outputs |
| System prompt change | Flush all keys | Output semantics changed |
| TTL expiry | Automatic eviction | Prevents stale data |
| Manual purge | r.delete(key) or clear by prefix | Debugging or policy change |
Error Handling
| Error | Cause | Fix |
|-------|-------|-----|
| Stale cache response | TTL too long | Reduce TTL or version cache keys |
| Cache miss storm | Cold start or invalidation | Warm cache with common queries at deploy |
| Redis connection error | Redis down | Fall through to direct API call |
| Non-deterministic cache | temperature > 0 cached | Only cache when temperature=0 |
Enterprise Considerations
- Only cache deterministic requests (
temperature=0) -- non-zero temperatures produce different outputs each time - Use Anthropic prompt caching for large system prompts (RAG context) -- 90% cost reduction on cache hits
- Set TTL based on content freshness needs (30 min for dynamic, 24h for reference data)
- Track cache hit rate to justify caching infrastructure cost
- Use Redis or Memcached for multi-instance deployments; in-memory only works for single-process
- Version cache keys when updating system prompts or switching model versions