Cloudflare Workers AI - Complete Reference
Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.
Status: Production Ready ✅ Last Updated: 2025-10-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.43.0, @cloudflare/workers-types@4.20251014.0
Table of Contents
- Quick Start (5 minutes)
- Workers AI API Reference
- Model Selection Guide
- Common Patterns
- AI Gateway Integration
- Rate Limits & Pricing
- Production Checklist
Quick Start (5 minutes)
1. Add AI Binding
wrangler.jsonc:
{
"ai": {
"binding": "AI"
}
}
2. Run Your First Model
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
prompt: 'What is Cloudflare?',
});
return Response.json(response);
},
};
3. Add Streaming (Recommended)
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true, // Always use streaming for text generation!
});
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
Why streaming?
- Prevents buffering large responses in memory
- Faster time-to-first-token
- Better user experience for long-form content
- Avoids Worker timeout issues
Workers AI API Reference
env.AI.run()
Run an AI model inference.
Signature:
async env.AI.run(
model: string,
inputs: ModelInputs,
options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>
Parameters:
model(string, required) - Model ID (e.g.,@cf/meta/llama-3.1-8b-instruct)inputs(object, required) - Model-specific inputsoptions(object, optional) - Additional optionsgateway(object) - AI Gateway configurationid(string) - Gateway IDskipCache(boolean) - Skip AI Gateway cache
Returns:
- Non-streaming:
Promise<ModelOutput>- JSON response - Streaming:
ReadableStream- Server-sent events stream
Text Generation Models
Input Format:
{
messages?: Array<{ role: 'system' | 'user' | 'assistant'; content: string }>;
prompt?: string; // Deprecated, use messages
stream?: boolean; // Default: false
max_tokens?: number; // Max tokens to generate
temperature?: number; // 0.0-1.0, default varies by model
top_p?: number; // 0.0-1.0
top_k?: number;
}
Output Format (Non-Streaming):
{
response: string; // Generated text
}
Example:
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is TypeScript?' },
],
stream: false,
});
console.log(response.response);
Text Embeddings Models
Input Format:
{
text: string | string[]; // Single text or array of texts
}
Output Format:
{
shape: number[]; // [batch_size, embedding_dimensions]
data: number[][]; // Array of embedding vectors
}
Example:
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: ['Hello world', 'Cloudflare Workers'],
});
console.log(embeddings.shape); // [2, 768]
console.log(embeddings.data[0]); // [0.123, -0.456, ...]
Image Generation Models
Input Format:
{
prompt: string; // Text description
num_steps?: number; // Default: 20
guidance?: number; // CFG scale, default: 7.5
strength?: number; // For img2img, default: 1.0
image?: number[][]; // For img2img (base64 or array)
}
Output Format:
- Binary image data (PNG/JPEG)
Example:
const imageStream = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'A beautiful sunset over mountains',
});
return new Response(imageStream, {
headers: { 'content-type': 'image/png' },
});
Vision Models
Input Format:
{
messages: Array<{
role: 'user' | 'assistant';
content: Array<{ type: 'text' | 'image_url'; text?: string; image_url?: { url: string } }>;
}>;
}
Example:
const response = await env.AI.run('@cf/meta/llama-3.2-11b-vision-instruct', {
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'What is in this image?' },
{ type: 'image_url', image_url: { url: 'data:image/png;base64,iVBOR...' } },
],
},
],
});
Model Selection Guide
Text Generation (LLMs)
| Model | Best For | Rate Limit | Size |
|-------|----------|------------|------|
| @cf/meta/llama-3.1-8b-instruct | General purpose, fast | 300/min | 8B |
| @cf/meta/llama-3.2-1b-instruct | Ultra-fast, simple tasks | 300/min | 1B |
| @cf/qwen/qwen1.5-14b-chat-awq | High quality, complex reasoning | 150/min | 14B |
| @cf/deepseek-ai/deepseek-r1-distill-qwen-32b | Coding, technical content | 300/min | 32B |
| @hf/thebloke/mistral-7b-instruct-v0.1-awq | Fast, efficient | 400/min | 7B |
Text Embeddings
| Model | Dimensions | Best For | Rate Limit |
|-------|-----------|----------|------------|
| @cf/baai/bge-base-en-v1.5 | 768 | General purpose RAG | 3000/min |
| @cf/baai/bge-large-en-v1.5 | 1024 | High accuracy search | 1500/min |
| @cf/baai/bge-small-en-v1.5 | 384 | Fast, low storage | 3000/min |
Image Generation
| Model | Best For | Rate Limit | Speed |
|-------|----------|------------|-------|
| @cf/black-forest-labs/flux-1-schnell | High quality, photorealistic | 720/min | Fast |
| @cf/stabilityai/stable-diffusion-xl-base-1.0 | General purpose | 720/min | Medium |
| @cf/lykon/dreamshaper-8-lcm | Artistic, stylized | 720/min | Fast |
Vision Models
| Model | Best For | Rate Limit |
|-------|----------|------------|
| @cf/meta/llama-3.2-11b-vision-instruct | Image understanding | 720/min |
| @cf/unum/uform-gen2-qwen-500m | Fast image captioning | 720/min |
Common Patterns
Pattern 1: Chat Completion with History
app.post('/chat', async (c) => {
const { messages } = await c.req.json<{
messages: Array<{ role: string; content: string }>;
}>();
const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages,
stream: true,
});
return new Response(response, {
headers: { 'content-type': 'text/event-stream' },
});
});
Pattern 2: RAG (Retrieval Augmented Generation)
// Step 1: Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [userQuery],
});
const vector = embeddings.data[0];
// Step 2: Search Vectorize
const matches = await env.VECTORIZE.query(vector, { topK: 3 });
// Step 3: Build context from matches
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');
// Step 4: Generate response with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{
role: 'system',
content: `Answer using this context:\n${context}`,
},
{ role: 'user', content: userQuery },
],
stream: true,
});
return new Response(response, {
headers: { 'content-type': 'text/event-stream' },
});
Pattern 3: Structured Output with Zod
import { z } from 'zod';
const RecipeSchema = z.object({
name: z.string(),
ingredients: z.array(z.string()),
instructions: z.array(z.string()),
prepTime: z.number(),
});
app.post('/recipe', async (c) => {
const { dish } = await c.req.json<{ dish: string }>();
const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{
role: 'user',
content: `Generate a recipe for ${dish}. Return ONLY valid JSON matching this schema: ${JSON.stringify(RecipeSchema.shape)}`,
},
],
});
// Parse and validate
const recipe = RecipeSchema.parse(JSON.parse(response.response));
return c.json(recipe);
});
Pattern 4: Image Generation + R2 Storage
app.post('/generate-image', async (c) => {
const { prompt } = await c.req.json<{ prompt: string }>();
// Generate image
const imageStream = await c.env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt,
});
const imageBytes = await new Response(imageStream).bytes();
// Store in R2
const key = `images/${Date.now()}.png`;
await c.env.BUCKET.put(key, imageBytes, {
httpMetadata: { contentType: 'image/png' },
});
return c.json({
success: true,
url: `https://your-domain.com/${key}`,
});
});
AI Gateway Integration
AI Gateway provides caching, logging, and analytics for AI requests.
Setup:
const response = await env.AI.run(
'@cf/meta/llama-3.1-8b-instruct',
{ prompt: 'Hello' },
{
gateway: {
id: 'my-gateway', // Your gateway ID
skipCache: false, // Use cache
},
}
);
Benefits:
- ✅ Cost Tracking - Monitor neurons usage per request
- ✅ Caching - Reduce duplicate inference costs
- ✅ Logging - Debug and analyze AI requests
- ✅ Rate Limiting - Additional layer of protection
- ✅ Analytics - Request patterns and performance
Access Gateway Logs:
const gateway = env.AI.gateway('my-gateway');
const logId = env.AI.aiGatewayLogId;
// Send feedback
await gateway.patchLog(logId, {
feedback: { rating: 1, comment: 'Great response' },
});
Rate Limits & Pricing
Rate Limits (per minute)
| Task Type | Default Limit | Notes | |-----------|---------------|-------| | Text Generation | 300/min | Some fast models: 400-1500/min | | Text Embeddings | 3000/min | BGE-large: 1500/min | | Image Generation | 720/min | All image models | | Vision Models | 720/min | Image understanding | | Translation | 720/min | M2M100, Opus MT | | Classification | 2000/min | Text classification | | Speech Recognition | 720/min | Whisper models |
Pricing (Neurons-Based)
Free Tier:
- 10,000 neurons per day
- Resets daily at 00:00 UTC
Paid Tier:
- $0.011 per 1,000 neurons
- 10,000 neurons/day included
- Unlimited usage above free allocation
Example Costs:
| Model | Input (1M tokens) | Output (1M tokens) | |-------|-------------------|-------------------| | Llama 3.2 1B | $0.027 | $0.201 | | Llama 3.1 8B | $0.088 | $0.606 | | BGE-base embeddings | $0.005 | N/A | | Flux image generation | ~$0.011/image | N/A |
Production Checklist
Before Deploying
- [ ] Enable AI Gateway for cost tracking and logging
- [ ] Implement streaming for all text generation endpoints
- [ ] Add rate limit retry with exponential backoff
- [ ] Validate input length to prevent token limit errors
- [ ] Set appropriate timeouts (Workers: 30s CPU default, 5m max)
- [ ] Monitor neurons usage in Cloudflare dashboard
- [ ] Test error handling for model unavailable, rate limits
- [ ] Add input sanitization to prevent prompt injection
- [ ] Configure CORS if using from browser
- [ ] Plan for scale - upgrade to Paid plan if needed
Error Handling
async function runAIWithRetry(
env: Env,
model: string,
inputs: any,
maxRetries = 3
): Promise<any> {
let lastError: Error;
for (let i = 0; i < maxRetries; i++) {
try {
return await env.AI.run(model, inputs);
} catch (error) {
lastError = error as Error;
const message = lastError.message.toLowerCase();
// Rate limit - retry with backoff
if (message.includes('429') || message.includes('rate limit')) {
const delay = Math.pow(2, i) * 1000; // Exponential backoff
await new Promise((resolve) => setTimeout(resolve, delay));
continue;
}
// Other errors - throw immediately
throw error;
}
}
throw lastError!;
}
Monitoring
app.use('*', async (c, next) => {
const start = Date.now();
await next();
// Log AI usage
console.log({
path: c.req.path,
duration: Date.now() - start,
logId: c.env.AI.aiGatewayLogId,
});
});
OpenAI Compatibility
Workers AI supports OpenAI-compatible endpoints.
Using OpenAI SDK:
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: env.CLOUDFLARE_API_KEY,
baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});
// Chat completions
const completion = await openai.chat.completions.create({
model: '@cf/meta/llama-3.1-8b-instruct',
messages: [{ role: 'user', content: 'Hello!' }],
});
// Embeddings
const embeddings = await openai.embeddings.create({
model: '@cf/baai/bge-base-en-v1.5',
input: 'Hello world',
});
Endpoints:
/v1/chat/completions- Text generation/v1/embeddings- Text embeddings
Vercel AI SDK Integration
npm install workers-ai-provider ai
import { createWorkersAI } from 'workers-ai-provider';
import { generateText, streamText } from 'ai';
const workersai = createWorkersAI({ binding: env.AI });
// Generate text
const result = await generateText({
model: workersai('@cf/meta/llama-3.1-8b-instruct'),
prompt: 'Write a poem',
});
// Stream text
const stream = streamText({
model: workersai('@cf/meta/llama-3.1-8b-instruct'),
prompt: 'Tell me a story',
});
Limits Summary
| Feature | Limit | |---------|-------| | Concurrent requests | No hard limit (rate limits apply) | | Max input tokens | Varies by model (typically 2K-128K) | | Max output tokens | Varies by model (typically 512-2048) | | Streaming chunk size | ~1 KB | | Image size (output) | ~5 MB | | Request timeout | Workers timeout applies (30s default, 5m max CPU) | | Daily free neurons | 10,000 | | Rate limits | See "Rate Limits & Pricing" section |