Workers AI
Cloudflare Workers AI for running inference at the edge, covering supported models, text generation, embeddings, image generation, speech-to-text, AI bindings, and streaming responses.
You are an expert in Cloudflare Workers AI, which enables running machine learning models directly on Cloudflare's global network without managing GPUs, containers, or model serving infrastructure.
## Key Points
- **No infrastructure**: No GPUs to manage, no model weights to deploy.
- **Global availability**: Models run on Cloudflare's network, close to users.
- **Simple API**: One binding, one method call per inference.
- **Streaming support**: LLM responses can be streamed token-by-token.
- **Free tier**: 10,000 neurons per day at no cost.
- **Free tier**: 10,000 neurons/day (roughly ~100-300 LLM requests depending on model and token count).
- **Paid**: $0.011 per 1,000 neurons. Neuron cost varies by model and input/output size.
- **No minimum commitment**: Pay only for what you use.
## Quick Example
```toml
[ai]
binding = "AI"
```
```typescript
export interface Env {
AI: Ai;
}
```skilldb get cloudflare-workers-skills/Workers AIFull skill: 348 linesWorkers AI — Cloudflare Workers
You are an expert in Cloudflare Workers AI, which enables running machine learning models directly on Cloudflare's global network without managing GPUs, containers, or model serving infrastructure.
Core Philosophy
Overview
Workers AI provides a catalog of pre-deployed models accessible through a simple binding API. Models run on Cloudflare's GPU fleet and are available from every data center. You pay per inference (or use the free tier) with no cold starts or provisioned capacity. Workers AI supports text generation (LLMs), text embeddings, image generation, image classification, speech-to-text, translation, and more.
Key benefits
- No infrastructure: No GPUs to manage, no model weights to deploy.
- Global availability: Models run on Cloudflare's network, close to users.
- Simple API: One binding, one method call per inference.
- Streaming support: LLM responses can be streamed token-by-token.
- Free tier: 10,000 neurons per day at no cost.
Setup
Bind in wrangler.toml
[ai]
binding = "AI"
TypeScript binding
export interface Env {
AI: Ai;
}
Text Generation (LLMs)
Basic chat completion
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { prompt } = await request.json<{ prompt: string }>();
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: prompt },
],
});
return Response.json(response);
// { response: "The generated text..." }
},
};
Streaming responses
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { prompt } = await request.json<{ prompt: string }>();
const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: prompt },
],
stream: true,
});
return new Response(stream, {
headers: { "content-type": "text/event-stream" },
});
},
};
The stream returns Server-Sent Events (SSE) in this format:
data: {"response":"Hello"}
data: {"response":" there"}
data: {"response":"!"}
data: [DONE]
Advanced generation parameters
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{ role: "system", content: "You are a concise technical writer." },
{ role: "user", content: "Explain WebSockets in 3 sentences." },
],
max_tokens: 256,
temperature: 0.7, // 0.0 = deterministic, 2.0 = very creative
top_p: 0.9,
top_k: 40,
repetition_penalty: 1.1,
});
Conversation with history
async function chat(env: Env, history: Array<{ role: string; content: string }>, userMessage: string) {
const messages = [
{ role: "system", content: "You are a helpful coding assistant." },
...history,
{ role: "user", content: userMessage },
];
const result = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", { messages });
// Append to history for next turn
history.push({ role: "user", content: userMessage });
history.push({ role: "assistant", content: result.response });
return result.response;
}
Text Embeddings
Generate vector embeddings for semantic search, RAG, and similarity matching:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { texts } = await request.json<{ texts: string[] }>();
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: texts,
});
return Response.json({
shape: embeddings.shape, // [n, 768]
data: embeddings.data, // number[][]
});
},
};
Embedding a single query for search
async function embedQuery(env: Env, query: string): Promise<number[]> {
const result = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: [query],
});
return result.data[0];
}
RAG pattern (Retrieval-Augmented Generation)
async function ragQuery(env: Env, question: string): Promise<string> {
// 1. Embed the question
const queryEmbedding = await embedQuery(env, question);
// 2. Search Vectorize for relevant documents
const matches = await env.VECTORIZE.query(queryEmbedding, { topK: 5 });
// 3. Retrieve the actual document text
const contexts: string[] = [];
for (const match of matches.matches) {
const doc = await env.KV.get(`doc:${match.id}`);
if (doc) contexts.push(doc);
}
// 4. Generate answer with context
const result = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{
role: "system",
content: `Answer the user's question based on the following context:\n\n${contexts.join("\n\n")}`,
},
{ role: "user", content: question },
],
});
return result.response;
}
Image Generation
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { prompt } = await request.json<{ prompt: string }>();
const image = await env.AI.run("@cf/stabilityai/stable-diffusion-xl-base-1.0", {
prompt,
num_steps: 20,
});
return new Response(image, {
headers: { "content-type": "image/png" },
});
},
};
Image-to-text (captioning)
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const imageData = await request.arrayBuffer();
const result = await env.AI.run("@cf/llava-hf/llava-1.5-7b-hf", {
image: [...new Uint8Array(imageData)],
prompt: "Describe this image in detail.",
max_tokens: 512,
});
return Response.json(result);
},
};
Speech-to-Text
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const audioData = await request.arrayBuffer();
const result = await env.AI.run("@cf/openai/whisper", {
audio: [...new Uint8Array(audioData)],
});
return Response.json({
text: result.text,
// word-level timestamps if available
words: result.words,
});
},
};
Text Classification and Sentiment
const result = await env.AI.run("@cf/huggingface/distilbert-sst-2-int8", {
text: "This product is amazing, I love it!",
});
// result: [{ label: "POSITIVE", score: 0.9998 }, { label: "NEGATIVE", score: 0.0002 }]
Translation
const result = await env.AI.run("@cf/meta/m2m100-1.2b", {
text: "Hello, how are you?",
source_lang: "en",
target_lang: "fr",
});
// result: { translated_text: "Bonjour, comment allez-vous?" }
Available Model Categories
| Category | Example Model | Use Case |
|---|---|---|
| Text Generation | @cf/meta/llama-3.1-8b-instruct | Chat, summarization, code |
| Text Embeddings | @cf/baai/bge-base-en-v1.5 | Semantic search, RAG |
| Image Generation | @cf/stabilityai/stable-diffusion-xl-base-1.0 | Image creation |
| Speech-to-Text | @cf/openai/whisper | Audio transcription |
| Translation | @cf/meta/m2m100-1.2b | Language translation |
| Text Classification | @cf/huggingface/distilbert-sst-2-int8 | Sentiment, labels |
| Image Classification | @cf/microsoft/resnet-50 | Image labeling |
| Object Detection | @cf/facebook/detr-resnet-50 | Find objects in images |
Function Calling Pattern
Implement tool use / function calling with supported models:
const tools = [
{
type: "function",
function: {
name: "get_weather",
description: "Get current weather for a location",
parameters: {
type: "object",
properties: {
location: { type: "string", description: "City name" },
},
required: ["location"],
},
},
},
];
const result = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [{ role: "user", content: "What's the weather in Tokyo?" }],
tools,
});
// Check if the model wants to call a tool
if (result.tool_calls) {
for (const call of result.tool_calls) {
if (call.name === "get_weather") {
const weather = await fetchWeather(call.arguments.location);
// Feed result back to the model for final answer
}
}
}
Error Handling
try {
const result = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [{ role: "user", content: prompt }],
});
return Response.json(result);
} catch (err) {
if (err instanceof Error) {
// Common errors:
// - "Model not found" — check the model name
// - "Input too long" — reduce input tokens
// - "Rate limited" — too many requests
console.error("AI error:", err.message);
}
return Response.json({ error: "AI inference failed" }, { status: 500 });
}
Pricing
- Free tier: 10,000 neurons/day (roughly ~100-300 LLM requests depending on model and token count).
- Paid: $0.011 per 1,000 neurons. Neuron cost varies by model and input/output size.
- No minimum commitment: Pay only for what you use.
Install this skill directly: skilldb add cloudflare-workers-skills
Related Skills
Durable Objects
Cloudflare Durable Objects for stateful edge computing, covering constructor patterns, storage API, WebSocket support, alarm handlers, consistency guarantees, and use cases like rate limiting, collaboration, and game state.
Workers D1
Cloudflare D1 serverless SQLite database for Workers, covering schema management, migrations, queries, prepared statements, batch operations, local development, replication, backups, and performance optimization.
Workers Fundamentals
Cloudflare Workers runtime fundamentals including V8 isolates, wrangler CLI, project setup, local development, deployment, environment variables, secrets, and compatibility dates.
Workers KV
Cloudflare Workers KV namespace for globally distributed key-value storage, including read/write patterns, caching strategies, TTL, list operations, metadata, bulk operations, and the eventual consistency model.
Workers Patterns
Production patterns for Cloudflare Workers including queue consumers, cron triggers, email workers, browser rendering, Hyperdrive database connection pooling, Vectorize vector search, and the analytics engine.
Workers R2
Cloudflare R2 object storage with S3-compatible API, covering bucket operations, multipart uploads, presigned URLs, public buckets, lifecycle rules, event notifications, and cost optimization compared to S3.