Skip to main content
Technology & EngineeringCloudflare Workers348 lines

Workers AI

Cloudflare Workers AI for running inference at the edge, covering supported models, text generation, embeddings, image generation, speech-to-text, AI bindings, and streaming responses.

Quick Summary27 lines
You are an expert in Cloudflare Workers AI, which enables running machine learning models directly on Cloudflare's global network without managing GPUs, containers, or model serving infrastructure.

## Key Points

- **No infrastructure**: No GPUs to manage, no model weights to deploy.
- **Global availability**: Models run on Cloudflare's network, close to users.
- **Simple API**: One binding, one method call per inference.
- **Streaming support**: LLM responses can be streamed token-by-token.
- **Free tier**: 10,000 neurons per day at no cost.
- **Free tier**: 10,000 neurons/day (roughly ~100-300 LLM requests depending on model and token count).
- **Paid**: $0.011 per 1,000 neurons. Neuron cost varies by model and input/output size.
- **No minimum commitment**: Pay only for what you use.

## Quick Example

```toml
[ai]
binding = "AI"
```

```typescript
export interface Env {
  AI: Ai;
}
```
skilldb get cloudflare-workers-skills/Workers AIFull skill: 348 lines
Paste into your CLAUDE.md or agent config

Workers AI — Cloudflare Workers

You are an expert in Cloudflare Workers AI, which enables running machine learning models directly on Cloudflare's global network without managing GPUs, containers, or model serving infrastructure.

Core Philosophy

Overview

Workers AI provides a catalog of pre-deployed models accessible through a simple binding API. Models run on Cloudflare's GPU fleet and are available from every data center. You pay per inference (or use the free tier) with no cold starts or provisioned capacity. Workers AI supports text generation (LLMs), text embeddings, image generation, image classification, speech-to-text, translation, and more.

Key benefits

  • No infrastructure: No GPUs to manage, no model weights to deploy.
  • Global availability: Models run on Cloudflare's network, close to users.
  • Simple API: One binding, one method call per inference.
  • Streaming support: LLM responses can be streamed token-by-token.
  • Free tier: 10,000 neurons per day at no cost.

Setup

Bind in wrangler.toml

[ai]
binding = "AI"

TypeScript binding

export interface Env {
  AI: Ai;
}

Text Generation (LLMs)

Basic chat completion

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { prompt } = await request.json<{ prompt: string }>();

    const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: prompt },
      ],
    });

    return Response.json(response);
    // { response: "The generated text..." }
  },
};

Streaming responses

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { prompt } = await request.json<{ prompt: string }>();

    const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: prompt },
      ],
      stream: true,
    });

    return new Response(stream, {
      headers: { "content-type": "text/event-stream" },
    });
  },
};

The stream returns Server-Sent Events (SSE) in this format:

data: {"response":"Hello"}
data: {"response":" there"}
data: {"response":"!"}
data: [DONE]

Advanced generation parameters

const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [
    { role: "system", content: "You are a concise technical writer." },
    { role: "user", content: "Explain WebSockets in 3 sentences." },
  ],
  max_tokens: 256,
  temperature: 0.7, // 0.0 = deterministic, 2.0 = very creative
  top_p: 0.9,
  top_k: 40,
  repetition_penalty: 1.1,
});

Conversation with history

async function chat(env: Env, history: Array<{ role: string; content: string }>, userMessage: string) {
  const messages = [
    { role: "system", content: "You are a helpful coding assistant." },
    ...history,
    { role: "user", content: userMessage },
  ];

  const result = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", { messages });

  // Append to history for next turn
  history.push({ role: "user", content: userMessage });
  history.push({ role: "assistant", content: result.response });

  return result.response;
}

Text Embeddings

Generate vector embeddings for semantic search, RAG, and similarity matching:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { texts } = await request.json<{ texts: string[] }>();

    const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
      text: texts,
    });

    return Response.json({
      shape: embeddings.shape, // [n, 768]
      data: embeddings.data,   // number[][]
    });
  },
};

Embedding a single query for search

async function embedQuery(env: Env, query: string): Promise<number[]> {
  const result = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
    text: [query],
  });
  return result.data[0];
}

RAG pattern (Retrieval-Augmented Generation)

async function ragQuery(env: Env, question: string): Promise<string> {
  // 1. Embed the question
  const queryEmbedding = await embedQuery(env, question);

  // 2. Search Vectorize for relevant documents
  const matches = await env.VECTORIZE.query(queryEmbedding, { topK: 5 });

  // 3. Retrieve the actual document text
  const contexts: string[] = [];
  for (const match of matches.matches) {
    const doc = await env.KV.get(`doc:${match.id}`);
    if (doc) contexts.push(doc);
  }

  // 4. Generate answer with context
  const result = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [
      {
        role: "system",
        content: `Answer the user's question based on the following context:\n\n${contexts.join("\n\n")}`,
      },
      { role: "user", content: question },
    ],
  });

  return result.response;
}

Image Generation

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { prompt } = await request.json<{ prompt: string }>();

    const image = await env.AI.run("@cf/stabilityai/stable-diffusion-xl-base-1.0", {
      prompt,
      num_steps: 20,
    });

    return new Response(image, {
      headers: { "content-type": "image/png" },
    });
  },
};

Image-to-text (captioning)

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const imageData = await request.arrayBuffer();

    const result = await env.AI.run("@cf/llava-hf/llava-1.5-7b-hf", {
      image: [...new Uint8Array(imageData)],
      prompt: "Describe this image in detail.",
      max_tokens: 512,
    });

    return Response.json(result);
  },
};

Speech-to-Text

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const audioData = await request.arrayBuffer();

    const result = await env.AI.run("@cf/openai/whisper", {
      audio: [...new Uint8Array(audioData)],
    });

    return Response.json({
      text: result.text,
      // word-level timestamps if available
      words: result.words,
    });
  },
};

Text Classification and Sentiment

const result = await env.AI.run("@cf/huggingface/distilbert-sst-2-int8", {
  text: "This product is amazing, I love it!",
});
// result: [{ label: "POSITIVE", score: 0.9998 }, { label: "NEGATIVE", score: 0.0002 }]

Translation

const result = await env.AI.run("@cf/meta/m2m100-1.2b", {
  text: "Hello, how are you?",
  source_lang: "en",
  target_lang: "fr",
});
// result: { translated_text: "Bonjour, comment allez-vous?" }

Available Model Categories

CategoryExample ModelUse Case
Text Generation@cf/meta/llama-3.1-8b-instructChat, summarization, code
Text Embeddings@cf/baai/bge-base-en-v1.5Semantic search, RAG
Image Generation@cf/stabilityai/stable-diffusion-xl-base-1.0Image creation
Speech-to-Text@cf/openai/whisperAudio transcription
Translation@cf/meta/m2m100-1.2bLanguage translation
Text Classification@cf/huggingface/distilbert-sst-2-int8Sentiment, labels
Image Classification@cf/microsoft/resnet-50Image labeling
Object Detection@cf/facebook/detr-resnet-50Find objects in images

Function Calling Pattern

Implement tool use / function calling with supported models:

const tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "Get current weather for a location",
      parameters: {
        type: "object",
        properties: {
          location: { type: "string", description: "City name" },
        },
        required: ["location"],
      },
    },
  },
];

const result = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: "What's the weather in Tokyo?" }],
  tools,
});

// Check if the model wants to call a tool
if (result.tool_calls) {
  for (const call of result.tool_calls) {
    if (call.name === "get_weather") {
      const weather = await fetchWeather(call.arguments.location);
      // Feed result back to the model for final answer
    }
  }
}

Error Handling

try {
  const result = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
    messages: [{ role: "user", content: prompt }],
  });
  return Response.json(result);
} catch (err) {
  if (err instanceof Error) {
    // Common errors:
    // - "Model not found" — check the model name
    // - "Input too long" — reduce input tokens
    // - "Rate limited" — too many requests
    console.error("AI error:", err.message);
  }
  return Response.json({ error: "AI inference failed" }, { status: 500 });
}

Pricing

  • Free tier: 10,000 neurons/day (roughly ~100-300 LLM requests depending on model and token count).
  • Paid: $0.011 per 1,000 neurons. Neuron cost varies by model and input/output size.
  • No minimum commitment: Pay only for what you use.

Install this skill directly: skilldb add cloudflare-workers-skills

Get CLI access →