Agent Skills: OpenRouter Caching Strategy

|

UncategorizedID: jeremylongshore/claude-code-plugins-plus-skills/openrouter-caching-strategy

Install this agent skill to your local

pnpm dlx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/HEAD/plugins/saas-packs/openrouter-pack/skills/openrouter-caching-strategy

Skill Files

Browse the full folder contents for openrouter-caching-strategy.

Download Skill

Loading file tree…

plugins/saas-packs/openrouter-pack/skills/openrouter-caching-strategy/SKILL.md

Skill Metadata

Name
openrouter-caching-strategy
Description
|

OpenRouter Caching Strategy

Overview

OpenRouter charges per token, so caching identical or similar requests can dramatically cut costs. Deterministic requests (temperature=0) with the same model and messages produce identical outputs -- these are safe to cache. This skill covers in-memory caching, persistent caching with TTL, and Anthropic prompt caching via OpenRouter.

In-Memory Cache

import os, hashlib, json, time
from typing import Optional
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
    default_headers={"HTTP-Referer": "https://my-app.com", "X-Title": "my-app"},
)

class LLMCache:
    def __init__(self, ttl_seconds: int = 3600):
        self._cache: dict[str, tuple[dict, float]] = {}
        self._ttl = ttl_seconds
        self.hits = 0
        self.misses = 0

    def _key(self, model: str, messages: list, **kwargs) -> str:
        blob = json.dumps({"model": model, "messages": messages, **kwargs}, sort_keys=True)
        return hashlib.sha256(blob.encode()).hexdigest()

    def get(self, model: str, messages: list, **kwargs) -> Optional[dict]:
        k = self._key(model, messages, **kwargs)
        if k in self._cache:
            data, ts = self._cache[k]
            if time.time() - ts < self._ttl:
                self.hits += 1
                return data
            del self._cache[k]
        self.misses += 1
        return None

    def set(self, model: str, messages: list, response: dict, **kwargs):
        k = self._key(model, messages, **kwargs)
        self._cache[k] = (response, time.time())

cache = LLMCache(ttl_seconds=1800)

def cached_completion(messages, model="anthropic/claude-3.5-sonnet", **kwargs):
    """Only cache deterministic requests (temperature=0)."""
    kwargs.setdefault("temperature", 0)
    kwargs.setdefault("max_tokens", 1024)

    cached = cache.get(model, messages, **kwargs)
    if cached:
        return cached

    response = client.chat.completions.create(model=model, messages=messages, **kwargs)
    result = {
        "content": response.choices[0].message.content,
        "model": response.model,
        "usage": {"prompt": response.usage.prompt_tokens, "completion": response.usage.completion_tokens},
    }
    cache.set(model, messages, result, **kwargs)
    return result

Persistent Cache with Redis

import redis, json, hashlib

r = redis.Redis(host="localhost", port=6379, db=0)

def redis_cached_completion(messages, model="openai/gpt-4o-mini", ttl=3600, **kwargs):
    """Cache in Redis with automatic TTL expiry."""
    kwargs["temperature"] = 0  # Must be deterministic
    key = f"or:{hashlib.sha256(json.dumps({'m': model, 'msgs': messages, **kwargs}, sort_keys=True).encode()).hexdigest()}"

    cached = r.get(key)
    if cached:
        return json.loads(cached)

    response = client.chat.completions.create(model=model, messages=messages, **kwargs)
    result = {
        "content": response.choices[0].message.content,
        "model": response.model,
        "tokens": response.usage.prompt_tokens + response.usage.completion_tokens,
    }
    r.setex(key, ttl, json.dumps(result))
    return result

Anthropic Prompt Caching via OpenRouter

Anthropic models on OpenRouter support prompt caching -- large system prompts are cached server-side, reducing input cost by 90% on cache hits.

# Mark large static content blocks with cache_control
response = client.chat.completions.create(
    model="anthropic/claude-3.5-sonnet",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an expert. Here is the full source:\n" + large_context,
                    "cache_control": {"type": "ephemeral"},  # Cache this block
                }
            ],
        },
        {"role": "user", "content": "What does the main() function do?"},
    ],
    max_tokens=1024,
)
# First call: cache_creation_input_tokens charged at 1.25x
# Subsequent: cache_read_input_tokens charged at 0.1x (90% savings)

Cache Key Design

def cache_key(model: str, messages: list, **params) -> str:
    """Deterministic cache key. Include everything that affects output.

    Include: model ID (with variant like :floor), messages, temperature,
    max_tokens, top_p, transforms, provider routing.
    Exclude: stream (doesn't affect content), HTTP-Referer, X-Title.
    """
    canonical = json.dumps({
        "model": model, "messages": messages,
        "temperature": params.get("temperature", 0),
        "max_tokens": params.get("max_tokens"),
        "top_p": params.get("top_p"),
    }, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()

Cache Invalidation

| Trigger | Action | Why | |---------|--------|-----| | Model version update | Flush keys for that model | New version may give different outputs | | System prompt change | Flush all keys | Output semantics changed | | TTL expiry | Automatic eviction | Prevents stale data | | Manual purge | r.delete(key) or clear by prefix | Debugging or policy change |

Error Handling

| Error | Cause | Fix | |-------|-------|-----| | Stale cache response | TTL too long | Reduce TTL or version cache keys | | Cache miss storm | Cold start or invalidation | Warm cache with common queries at deploy | | Redis connection error | Redis down | Fall through to direct API call | | Non-deterministic cache | temperature > 0 cached | Only cache when temperature=0 |

Enterprise Considerations

  • Only cache deterministic requests (temperature=0) -- non-zero temperatures produce different outputs each time
  • Use Anthropic prompt caching for large system prompts (RAG context) -- 90% cost reduction on cache hits
  • Set TTL based on content freshness needs (30 min for dynamic, 24h for reference data)
  • Track cache hit rate to justify caching infrastructure cost
  • Use Redis or Memcached for multi-instance deployments; in-memory only works for single-process
  • Version cache keys when updating system prompts or switching model versions

References