Fireworks AI
"Fireworks AI: fast inference, function calling, grammar mode, JSON output, OpenAI-compatible API, fine-tuning"
Fireworks AI provides **fast, reliable inference** for open-source and custom models with an **OpenAI-compatible API**. Its standout feature is **grammar mode** — constrained decoding that guarantees outputs match a JSON schema or regular expression at the token level. Use Fireworks when you need structured outputs from open-source models, fast inference with function calling, or fine-tuned model hosting. The API feels like OpenAI, so migration is straightforward. Fireworks excels at production workloads that demand both speed and output reliability. ## Key Points - **Use grammar mode** (JSON schema in `response_format`) for any structured extraction — it is Fireworks' killer feature and guarantees valid output. - **Use `firefunction-v2`** for function calling workloads — it is specifically optimized for tool use. - **Use the OpenAI SDK** — Fireworks is fully compatible, so leverage existing OpenAI code and tooling. - **Set `max_tokens` explicitly** — avoid unexpected costs from long generations. - **Use 8B models for simple tasks** — classification, extraction, and formatting do not need 70B parameters. - **Batch concurrent requests with `Promise.allSettled`** — Fireworks handles high concurrency well. - **Pin model versions** for production stability. Model names without versions may be updated. - **Log `usage` from responses** for cost tracking and optimization. - **Using plain JSON mode when grammar mode is available** — grammar mode with a schema is strictly more reliable than JSON mode alone. - **Not using the OpenAI SDK** — writing raw fetch calls loses type safety, retry logic, and streaming helpers. - **Ignoring `finish_reason`** — check for `"length"` (truncated output) and handle accordingly. - **Using 70B models for batch classification** — 8B models are dramatically cheaper and often equally accurate for simple tasks. ## Quick Example ```bash FIREWORKS_API_KEY=fw_... ```
skilldb get ai-llm-services-skills/Fireworks AIFull skill: 350 linesFireworks AI Skill
Core Philosophy
Fireworks AI provides fast, reliable inference for open-source and custom models with an OpenAI-compatible API. Its standout feature is grammar mode — constrained decoding that guarantees outputs match a JSON schema or regular expression at the token level. Use Fireworks when you need structured outputs from open-source models, fast inference with function calling, or fine-tuned model hosting. The API feels like OpenAI, so migration is straightforward. Fireworks excels at production workloads that demand both speed and output reliability.
Setup
Use the OpenAI SDK with the Fireworks base URL:
import OpenAI from "openai";
const fireworks = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY!,
baseURL: "https://api.fireworks.ai/inference/v1",
});
// Basic chat completion
const response = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Write a TypeScript debounce function." },
],
max_tokens: 512,
temperature: 0.6,
});
console.log(response.choices[0].message.content);
Environment variables:
FIREWORKS_API_KEY=fw_...
Key Techniques
Streaming
const stream = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
messages: [
{ role: "user", content: "Explain event-driven architecture in detail." },
],
max_tokens: 1024,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
JSON Mode
const response = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
messages: [
{
role: "system",
content: "Extract structured data from the text. Respond with valid JSON.",
},
{
role: "user",
content:
"Jane Doe, age 32, works at TechCorp as a senior engineer in San Francisco.",
},
],
response_format: { type: "json_object" },
max_tokens: 256,
});
const person = JSON.parse(response.choices[0].message.content!);
console.log(person);
Grammar Mode (Constrained JSON Schema)
// Grammar mode guarantees the output matches the schema at the token level
const response = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
messages: [
{
role: "user",
content: "Extract: The iPhone 15 Pro costs $999, has a 6.1-inch display, and comes in titanium.",
},
],
response_format: {
type: "json_object",
schema: {
type: "object",
properties: {
product_name: { type: "string" },
price: { type: "number" },
screen_size: { type: "string" },
material: { type: "string" },
features: {
type: "array",
items: { type: "string" },
},
},
required: ["product_name", "price"],
},
},
max_tokens: 256,
});
const product = JSON.parse(response.choices[0].message.content!);
// Guaranteed to have product_name (string) and price (number)
Function Calling
const response = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
messages: [
{ role: "user", content: "Book a flight from NYC to London for next Friday." },
],
tools: [
{
type: "function",
function: {
name: "search_flights",
description: "Search for available flights between two cities",
parameters: {
type: "object",
properties: {
origin: { type: "string", description: "Departure city or airport code" },
destination: { type: "string", description: "Arrival city or airport code" },
date: { type: "string", description: "Flight date in YYYY-MM-DD format" },
passengers: { type: "number", description: "Number of passengers" },
},
required: ["origin", "destination", "date"],
},
},
},
{
type: "function",
function: {
name: "book_flight",
description: "Book a specific flight by ID",
parameters: {
type: "object",
properties: {
flight_id: { type: "string" },
passenger_name: { type: "string" },
},
required: ["flight_id", "passenger_name"],
},
},
},
],
tool_choice: "auto",
max_tokens: 512,
});
const toolCalls = response.choices[0].message.tool_calls;
if (toolCalls) {
const results = await Promise.all(
toolCalls.map(async (tc) => {
const args = JSON.parse(tc.function.arguments);
let result;
if (tc.function.name === "search_flights") {
result = await searchFlights(args);
} else if (tc.function.name === "book_flight") {
result = await bookFlight(args);
}
return {
role: "tool" as const,
tool_call_id: tc.id,
content: JSON.stringify(result),
};
})
);
// Continue the conversation with tool results
const followUp = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
messages: [
{ role: "user", content: "Book a flight from NYC to London for next Friday." },
response.choices[0].message,
...results,
],
tools: [/* same tools */],
max_tokens: 512,
});
console.log(followUp.choices[0].message.content);
}
Model Selection
// Llama 3.1 70B — best quality open-source
const quality = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
messages: [{ role: "user", content: "Complex analysis task..." }],
max_tokens: 1024,
});
// Llama 3.1 8B — fastest, cheapest
const fast = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-8b-instruct",
messages: [{ role: "user", content: "Quick classification..." }],
max_tokens: 64,
});
// Mixtral MoE — good reasoning at moderate cost
const balanced = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/mixtral-8x22b-instruct",
messages: [{ role: "user", content: "Multi-step reasoning task..." }],
max_tokens: 1024,
});
// FireFunction V2 — optimized for function calling
const funcModel = await fireworks.chat.completions.create({
model: "accounts/fireworks/models/firefunction-v2",
messages: [{ role: "user", content: "What tools should I use?" }],
tools: [/* ... */],
max_tokens: 512,
});
Fine-Tuning
// Upload training data via the Fireworks API
const uploadResponse = await fetch(
"https://api.fireworks.ai/inference/v1/files",
{
method: "POST",
headers: {
Authorization: `Bearer ${process.env.FIREWORKS_API_KEY}`,
},
body: (() => {
const form = new FormData();
form.append("file", new Blob([trainingJsonl], { type: "application/jsonl" }), "train.jsonl");
form.append("purpose", "fine-tune");
return form;
})(),
}
);
const file = await uploadResponse.json();
// Create fine-tuning job
const jobResponse = await fetch(
"https://api.fireworks.ai/inference/v1/fine-tuning/jobs",
{
method: "POST",
headers: {
Authorization: `Bearer ${process.env.FIREWORKS_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "accounts/fireworks/models/llama-v3p1-8b-instruct",
training_file: file.id,
hyperparameters: {
n_epochs: 3,
learning_rate_multiplier: 1.0,
batch_size: 8,
},
suffix: "my-custom-model",
}),
}
);
const job = await jobResponse.json();
console.log("Fine-tuning job:", job.id);
Embeddings
const embeddingResponse = await fireworks.embeddings.create({
model: "nomic-ai/nomic-embed-text-v1.5",
input: ["What is machine learning?", "How do neural networks work?"],
});
const vectors = embeddingResponse.data.map((d) => d.embedding);
console.log("Dimensions:", vectors[0].length);
Batch Requests for Throughput
// Process many requests efficiently with Promise.allSettled
const prompts = [
"Summarize document 1...",
"Summarize document 2...",
"Summarize document 3...",
];
const results = await Promise.allSettled(
prompts.map((prompt) =>
fireworks.chat.completions.create({
model: "accounts/fireworks/models/llama-v3p1-8b-instruct",
messages: [{ role: "user", content: prompt }],
max_tokens: 256,
})
)
);
for (const [i, result] of results.entries()) {
if (result.status === "fulfilled") {
console.log(`Result ${i}:`, result.value.choices[0].message.content);
} else {
console.error(`Failed ${i}:`, result.reason);
}
}
Best Practices
- Use grammar mode (JSON schema in
response_format) for any structured extraction — it is Fireworks' killer feature and guarantees valid output. - Use
firefunction-v2for function calling workloads — it is specifically optimized for tool use. - Use the OpenAI SDK — Fireworks is fully compatible, so leverage existing OpenAI code and tooling.
- Set
max_tokensexplicitly — avoid unexpected costs from long generations. - Use 8B models for simple tasks — classification, extraction, and formatting do not need 70B parameters.
- Batch concurrent requests with
Promise.allSettled— Fireworks handles high concurrency well. - Pin model versions for production stability. Model names without versions may be updated.
- Log
usagefrom responses for cost tracking and optimization.
Anti-Patterns
- Using plain JSON mode when grammar mode is available — grammar mode with a schema is strictly more reliable than JSON mode alone.
- Not using the OpenAI SDK — writing raw fetch calls loses type safety, retry logic, and streaming helpers.
- Ignoring
finish_reason— check for"length"(truncated output) and handle accordingly. - Using 70B models for batch classification — 8B models are dramatically cheaper and often equally accurate for simple tasks.
- Not handling rate limits — implement exponential backoff. Fireworks returns standard HTTP 429 responses.
- Sending function calling requests to non-function models — use
firefunction-v2or Llama 3.1 instruct models for tool use. Not all models support it. - Fine-tuning without enough data — like all fine-tuning, quality matters more than quantity, but aim for 200+ diverse examples.
- Constructing complex grammar patterns by hand — use JSON schema in
response_formatinstead of writing raw GBNF grammars. The API handles the conversion.
Install this skill directly: skilldb add ai-llm-services-skills
Related Skills
Anthropic Claude API
"Anthropic Claude API: messages API, tool use, streaming, vision, system prompts, extended thinking, batches, Node SDK"
Google Gemini API
"Google Gemini API: generateContent, multimodal (images/video/audio), function calling, streaming, embeddings, context caching"
Groq
"Groq: ultra-fast inference, OpenAI-compatible API, Llama/Mixtral models, tool use, JSON mode, streaming"
OpenAI API
"OpenAI API: chat completions, function calling/tools, streaming, embeddings, vision, JSON mode, assistants, Node SDK"
Replicate
"Replicate: run open-source models, image generation (Flux/SDXL), predictions API, webhooks, streaming, Node SDK"
Together AI
"Together AI: inference API, open-source LLMs (Llama/Mistral), chat completions, embeddings, fine-tuning, JSON mode"