Cloudflare Workers AI - Complete Reference Skill

Cloudflare Workers AI - Complete Reference

Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.

Status: Production Ready ✅ Last Updated: 2025-10-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.43.0, @cloudflare/workers-types@4.20251014.0

Quick Start (5 minutes)
Workers AI API Reference
Model Selection Guide
Common Patterns
AI Gateway Integration
Rate Limits & Pricing
Production Checklist

Quick Start (5 minutes)

1. Add AI Binding

wrangler.jsonc:

{
  "ai": {
    "binding": "AI"
  }
}

2. Run Your First Model

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      prompt: 'What is Cloudflare?',
    });

    return Response.json(response);
  },
};

3. Add Streaming (Recommended)

const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true, // Always use streaming for text generation!
});

return new Response(stream, {
  headers: { 'content-type': 'text/event-stream' },
});

Why streaming?

Prevents buffering large responses in memory
Faster time-to-first-token
Better user experience for long-form content
Avoids Worker timeout issues

Workers AI API Reference

`env.AI.run()`

Run an AI model inference.

Signature:

async env.AI.run(
  model: string,
  inputs: ModelInputs,
  options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>

Parameters:

model (string, required) - Model ID (e.g., @cf/meta/llama-3.1-8b-instruct)
inputs (object, required) - Model-specific inputs
options (object, optional) - Additional options
- gateway (object) - AI Gateway configuration
  - id (string) - Gateway ID
  - skipCache (boolean) - Skip AI Gateway cache

Returns:

Non-streaming: Promise<ModelOutput> - JSON response
Streaming: ReadableStream - Server-sent events stream

Text Generation Models

Input Format:

{
  messages?: Array<{ role: 'system' | 'user' | 'assistant'; content: string }>;
  prompt?: string; // Deprecated, use messages
  stream?: boolean; // Default: false
  max_tokens?: number; // Max tokens to generate
  temperature?: number; // 0.0-1.0, default varies by model
  top_p?: number; // 0.0-1.0
  top_k?: number;
}

Output Format (Non-Streaming):

{
  response: string; // Generated text
}

Example:

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is TypeScript?' },
  ],
  stream: false,
});

console.log(response.response);

Text Embeddings Models

Input Format:

{
  text: string | string[]; // Single text or array of texts
}

Output Format:

{
  shape: number[]; // [batch_size, embedding_dimensions]
  data: number[][]; // Array of embedding vectors
}

Example:

const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: ['Hello world', 'Cloudflare Workers'],
});

console.log(embeddings.shape); // [2, 768]
console.log(embeddings.data[0]); // [0.123, -0.456, ...]

Image Generation Models

Input Format:

{
  prompt: string; // Text description
  num_steps?: number; // Default: 20
  guidance?: number; // CFG scale, default: 7.5
  strength?: number; // For img2img, default: 1.0
  image?: number[][]; // For img2img (base64 or array)
}

Output Format:

Binary image data (PNG/JPEG)

Example:

const imageStream = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'A beautiful sunset over mountains',
});

return new Response(imageStream, {
  headers: { 'content-type': 'image/png' },
});

Vision Models

Input Format:

{
  messages: Array<{
    role: 'user' | 'assistant';
    content: Array<{ type: 'text' | 'image_url'; text?: string; image_url?: { url: string } }>;
  }>;
}

Example:

const response = await env.AI.run('@cf/meta/llama-3.2-11b-vision-instruct', {
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'What is in this image?' },
        { type: 'image_url', image_url: { url: 'data:image/png;base64,iVBOR...' } },
      ],
    },
  ],
});

Model Selection Guide

Text Generation (LLMs)

| Model | Best For | Rate Limit | Size | |-------|----------|------------|------| | @cf/meta/llama-3.1-8b-instruct | General purpose, fast | 300/min | 8B | | @cf/meta/llama-3.2-1b-instruct | Ultra-fast, simple tasks | 300/min | 1B | | @cf/qwen/qwen1.5-14b-chat-awq | High quality, complex reasoning | 150/min | 14B | | @cf/deepseek-ai/deepseek-r1-distill-qwen-32b | Coding, technical content | 300/min | 32B | | @hf/thebloke/mistral-7b-instruct-v0.1-awq | Fast, efficient | 400/min | 7B |

Text Embeddings

| Model | Dimensions | Best For | Rate Limit | |-------|-----------|----------|------------| | @cf/baai/bge-base-en-v1.5 | 768 | General purpose RAG | 3000/min | | @cf/baai/bge-large-en-v1.5 | 1024 | High accuracy search | 1500/min | | @cf/baai/bge-small-en-v1.5 | 384 | Fast, low storage | 3000/min |

Image Generation

| Model | Best For | Rate Limit | Speed | |-------|----------|------------|-------| | @cf/black-forest-labs/flux-1-schnell | High quality, photorealistic | 720/min | Fast | | @cf/stabilityai/stable-diffusion-xl-base-1.0 | General purpose | 720/min | Medium | | @cf/lykon/dreamshaper-8-lcm | Artistic, stylized | 720/min | Fast |

Vision Models

| Model | Best For | Rate Limit | |-------|----------|------------| | @cf/meta/llama-3.2-11b-vision-instruct | Image understanding | 720/min | | @cf/unum/uform-gen2-qwen-500m | Fast image captioning | 720/min |

Common Patterns

Pattern 1: Chat Completion with History

app.post('/chat', async (c) => {
  const { messages } = await c.req.json<{
    messages: Array<{ role: string; content: string }>;
  }>();

  const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages,
    stream: true,
  });

  return new Response(response, {
    headers: { 'content-type': 'text/event-stream' },
  });
});

Pattern 2: RAG (Retrieval Augmented Generation)

// Step 1: Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: [userQuery],
});

const vector = embeddings.data[0];

// Step 2: Search Vectorize
const matches = await env.VECTORIZE.query(vector, { topK: 3 });

// Step 3: Build context from matches
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');

// Step 4: Generate response with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    {
      role: 'system',
      content: `Answer using this context:\n${context}`,
    },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});

return new Response(response, {
  headers: { 'content-type': 'text/event-stream' },
});

Pattern 3: Structured Output with Zod

import { z } from 'zod';

const RecipeSchema = z.object({
  name: z.string(),
  ingredients: z.array(z.string()),
  instructions: z.array(z.string()),
  prepTime: z.number(),
});

app.post('/recipe', async (c) => {
  const { dish } = await c.req.json<{ dish: string }>();

  const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages: [
      {
        role: 'user',
        content: `Generate a recipe for ${dish}. Return ONLY valid JSON matching this schema: ${JSON.stringify(RecipeSchema.shape)}`,
      },
    ],
  });

  // Parse and validate
  const recipe = RecipeSchema.parse(JSON.parse(response.response));

  return c.json(recipe);
});

Pattern 4: Image Generation + R2 Storage

app.post('/generate-image', async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  // Generate image
  const imageStream = await c.env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
    prompt,
  });

  const imageBytes = await new Response(imageStream).bytes();

  // Store in R2
  const key = `images/${Date.now()}.png`;
  await c.env.BUCKET.put(key, imageBytes, {
    httpMetadata: { contentType: 'image/png' },
  });

  return c.json({
    success: true,
    url: `https://your-domain.com/${key}`,
  });
});

AI Gateway Integration

AI Gateway provides caching, logging, and analytics for AI requests.

Setup:

const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { prompt: 'Hello' },
  {
    gateway: {
      id: 'my-gateway', // Your gateway ID
      skipCache: false, // Use cache
    },
  }
);

Benefits:

✅ Cost Tracking - Monitor neurons usage per request
✅ Caching - Reduce duplicate inference costs
✅ Logging - Debug and analyze AI requests
✅ Rate Limiting - Additional layer of protection
✅ Analytics - Request patterns and performance

Access Gateway Logs:

const gateway = env.AI.gateway('my-gateway');
const logId = env.AI.aiGatewayLogId;

// Send feedback
await gateway.patchLog(logId, {
  feedback: { rating: 1, comment: 'Great response' },
});

Rate Limits & Pricing

Rate Limits (per minute)

| Task Type | Default Limit | Notes | |-----------|---------------|-------| | Text Generation | 300/min | Some fast models: 400-1500/min | | Text Embeddings | 3000/min | BGE-large: 1500/min | | Image Generation | 720/min | All image models | | Vision Models | 720/min | Image understanding | | Translation | 720/min | M2M100, Opus MT | | Classification | 2000/min | Text classification | | Speech Recognition | 720/min | Whisper models |

Pricing (Neurons-Based)

Free Tier:

10,000 neurons per day
Resets daily at 00:00 UTC

Paid Tier:

$0.011 per 1,000 neurons
10,000 neurons/day included
Unlimited usage above free allocation

Example Costs:

| Model | Input (1M tokens) | Output (1M tokens) | |-------|-------------------|-------------------| | Llama 3.2 1B | $0.027 | $0.201 | | Llama 3.1 8B | $0.088 | $0.606 | | BGE-base embeddings | $0.005 | N/A | | Flux image generation | ~$0.011/image | N/A |

Production Checklist

Before Deploying

[ ] Enable AI Gateway for cost tracking and logging
[ ] Implement streaming for all text generation endpoints
[ ] Add rate limit retry with exponential backoff
[ ] Validate input length to prevent token limit errors
[ ] Set appropriate timeouts (Workers: 30s CPU default, 5m max)
[ ] Monitor neurons usage in Cloudflare dashboard
[ ] Test error handling for model unavailable, rate limits
[ ] Add input sanitization to prevent prompt injection
[ ] Configure CORS if using from browser
[ ] Plan for scale - upgrade to Paid plan if needed

Error Handling

async function runAIWithRetry(
  env: Env,
  model: string,
  inputs: any,
  maxRetries = 3
): Promise<any> {
  let lastError: Error;

  for (let i = 0; i < maxRetries; i++) {
    try {
      return await env.AI.run(model, inputs);
    } catch (error) {
      lastError = error as Error;
      const message = lastError.message.toLowerCase();

      // Rate limit - retry with backoff
      if (message.includes('429') || message.includes('rate limit')) {
        const delay = Math.pow(2, i) * 1000; // Exponential backoff
        await new Promise((resolve) => setTimeout(resolve, delay));
        continue;
      }

      // Other errors - throw immediately
      throw error;
    }
  }

  throw lastError!;
}

Monitoring

app.use('*', async (c, next) => {
  const start = Date.now();

  await next();

  // Log AI usage
  console.log({
    path: c.req.path,
    duration: Date.now() - start,
    logId: c.env.AI.aiGatewayLogId,
  });
});

OpenAI Compatibility

Workers AI supports OpenAI-compatible endpoints.

Using OpenAI SDK:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});

// Chat completions
const completion = await openai.chat.completions.create({
  model: '@cf/meta/llama-3.1-8b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

// Embeddings
const embeddings = await openai.embeddings.create({
  model: '@cf/baai/bge-base-en-v1.5',
  input: 'Hello world',
});

Endpoints:

/v1/chat/completions - Text generation
/v1/embeddings - Text embeddings

Vercel AI SDK Integration

npm install workers-ai-provider ai

import { createWorkersAI } from 'workers-ai-provider';
import { generateText, streamText } from 'ai';

const workersai = createWorkersAI({ binding: env.AI });

// Generate text
const result = await generateText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Write a poem',
});

// Stream text
const stream = streamText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Tell me a story',
});

Limits Summary

| Feature | Limit | |---------|-------| | Concurrent requests | No hard limit (rate limits apply) | | Max input tokens | Varies by model (typically 2K-128K) | | Max output tokens | Varies by model (typically 512-2048) | | Streaming chunk size | ~1 KB | | Image size (output) | ~5 MB | | Request timeout | Workers timeout applies (30s default, 5m max CPU) | | Daily free neurons | 10,000 | | Rate limits | See "Rate Limits & Pricing" section |

Agent Skills: Cloudflare Workers AI - Complete Reference

Install this agent skill to your local

Skill Files