Technology & EngineeringAi Llm Services350 lines

Fireworks AI

"Fireworks AI: fast inference, function calling, grammar mode, JSON output, OpenAI-compatible API, fine-tuning"

Quick Summary24 lines

Fireworks AI provides **fast, reliable inference** for open-source and custom models with an **OpenAI-compatible API**. Its standout feature is **grammar mode** — constrained decoding that guarantees outputs match a JSON schema or regular expression at the token level. Use Fireworks when you need structured outputs from open-source models, fast inference with function calling, or fine-tuned model hosting. The API feels like OpenAI, so migration is straightforward. Fireworks excels at production workloads that demand both speed and output reliability.

## Key Points

- **Use grammar mode** (JSON schema in `response_format`) for any structured extraction — it is Fireworks' killer feature and guarantees valid output.
- **Use `firefunction-v2`** for function calling workloads — it is specifically optimized for tool use.
- **Use the OpenAI SDK** — Fireworks is fully compatible, so leverage existing OpenAI code and tooling.
- **Set `max_tokens` explicitly** — avoid unexpected costs from long generations.
- **Use 8B models for simple tasks** — classification, extraction, and formatting do not need 70B parameters.
- **Batch concurrent requests with `Promise.allSettled`** — Fireworks handles high concurrency well.
- **Pin model versions** for production stability. Model names without versions may be updated.
- **Log `usage` from responses** for cost tracking and optimization.
- **Using plain JSON mode when grammar mode is available** — grammar mode with a schema is strictly more reliable than JSON mode alone.
- **Not using the OpenAI SDK** — writing raw fetch calls loses type safety, retry logic, and streaming helpers.
- **Ignoring `finish_reason`** — check for `"length"` (truncated output) and handle accordingly.
- **Using 70B models for batch classification** — 8B models are dramatically cheaper and often equally accurate for simple tasks.

## Quick Example

```bash
FIREWORKS_API_KEY=fw_...
```

skilldb get ai-llm-services-skills/Fireworks AIFull skill: 350 lines

Paste into your CLAUDE.md or agent config

Fireworks AI Skill

Core Philosophy

Fireworks AI provides fast, reliable inference for open-source and custom models with an OpenAI-compatible API. Its standout feature is grammar mode — constrained decoding that guarantees outputs match a JSON schema or regular expression at the token level. Use Fireworks when you need structured outputs from open-source models, fast inference with function calling, or fine-tuned model hosting. The API feels like OpenAI, so migration is straightforward. Fireworks excels at production workloads that demand both speed and output reliability.

Setup

Use the OpenAI SDK with the Fireworks base URL:

import OpenAI from "openai";

const fireworks = new OpenAI({
  apiKey: process.env.FIREWORKS_API_KEY!,
  baseURL: "https://api.fireworks.ai/inference/v1",
});

// Basic chat completion
const response = await fireworks.chat.completions.create({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Write a TypeScript debounce function." },
  ],
  max_tokens: 512,
  temperature: 0.6,
});

console.log(response.choices[0].message.content);

Environment variables:

FIREWORKS_API_KEY=fw_...

Key Techniques

Streaming

const stream = await fireworks.chat.completions.create({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  messages: [
    { role: "user", content: "Explain event-driven architecture in detail." },
  ],
  max_tokens: 1024,
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

JSON Mode

const response = await fireworks.chat.completions.create({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  messages: [
    {
      role: "system",
      content: "Extract structured data from the text. Respond with valid JSON.",
    },
    {
      role: "user",
      content:
        "Jane Doe, age 32, works at TechCorp as a senior engineer in San Francisco.",
    },
  ],
  response_format: { type: "json_object" },
  max_tokens: 256,
});

const person = JSON.parse(response.choices[0].message.content!);
console.log(person);

Grammar Mode (Constrained JSON Schema)

// Grammar mode guarantees the output matches the schema at the token level
const response = await fireworks.chat.completions.create({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  messages: [
    {
      role: "user",
      content: "Extract: The iPhone 15 Pro costs $999, has a 6.1-inch display, and comes in titanium.",
    },
  ],
  response_format: {
    type: "json_object",
    schema: {
      type: "object",
      properties: {
        product_name: { type: "string" },
        price: { type: "number" },
        screen_size: { type: "string" },
        material: { type: "string" },
        features: {
          type: "array",
          items: { type: "string" },
        },
      },
      required: ["product_name", "price"],
    },
  },
  max_tokens: 256,
});

const product = JSON.parse(response.choices[0].message.content!);
// Guaranteed to have product_name (string) and price (number)

Function Calling

const response = await fireworks.chat.completions.create({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  messages: [
    { role: "user", content: "Book a flight from NYC to London for next Friday." },
  ],
  tools: [
    {
      type: "function",
      function: {
        name: "search_flights",
        description: "Search for available flights between two cities",
        parameters: {
          type: "object",
          properties: {
            origin: { type: "string", description: "Departure city or airport code" },
            destination: { type: "string", description: "Arrival city or airport code" },
            date: { type: "string", description: "Flight date in YYYY-MM-DD format" },
            passengers: { type: "number", description: "Number of passengers" },
          },
          required: ["origin", "destination", "date"],
        },
      },
    },
    {
      type: "function",
      function: {
        name: "book_flight",
        description: "Book a specific flight by ID",
        parameters: {
          type: "object",
          properties: {
            flight_id: { type: "string" },
            passenger_name: { type: "string" },
          },
          required: ["flight_id", "passenger_name"],
        },
      },
    },
  ],
  tool_choice: "auto",
  max_tokens: 512,
});

const toolCalls = response.choices[0].message.tool_calls;
if (toolCalls) {
  const results = await Promise.all(
    toolCalls.map(async (tc) => {
      const args = JSON.parse(tc.function.arguments);

      let result;
      if (tc.function.name === "search_flights") {
        result = await searchFlights(args);
      } else if (tc.function.name === "book_flight") {
        result = await bookFlight(args);
      }

      return {
        role: "tool" as const,
        tool_call_id: tc.id,
        content: JSON.stringify(result),
      };
    })
  );

  // Continue the conversation with tool results
  const followUp = await fireworks.chat.completions.create({
    model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
    messages: [
      { role: "user", content: "Book a flight from NYC to London for next Friday." },
      response.choices[0].message,
      ...results,
    ],
    tools: [/* same tools */],
    max_tokens: 512,
  });

  console.log(followUp.choices[0].message.content);
}

Model Selection

// Llama 3.1 70B — best quality open-source
const quality = await fireworks.chat.completions.create({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  messages: [{ role: "user", content: "Complex analysis task..." }],
  max_tokens: 1024,
});

// Llama 3.1 8B — fastest, cheapest
const fast = await fireworks.chat.completions.create({
  model: "accounts/fireworks/models/llama-v3p1-8b-instruct",
  messages: [{ role: "user", content: "Quick classification..." }],
  max_tokens: 64,
});

// Mixtral MoE — good reasoning at moderate cost
const balanced = await fireworks.chat.completions.create({
  model: "accounts/fireworks/models/mixtral-8x22b-instruct",
  messages: [{ role: "user", content: "Multi-step reasoning task..." }],
  max_tokens: 1024,
});

// FireFunction V2 — optimized for function calling
const funcModel = await fireworks.chat.completions.create({
  model: "accounts/fireworks/models/firefunction-v2",
  messages: [{ role: "user", content: "What tools should I use?" }],
  tools: [/* ... */],
  max_tokens: 512,
});

Fine-Tuning

// Upload training data via the Fireworks API
const uploadResponse = await fetch(
  "https://api.fireworks.ai/inference/v1/files",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.FIREWORKS_API_KEY}`,
    },
    body: (() => {
      const form = new FormData();
      form.append("file", new Blob([trainingJsonl], { type: "application/jsonl" }), "train.jsonl");
      form.append("purpose", "fine-tune");
      return form;
    })(),
  }
);

const file = await uploadResponse.json();

// Create fine-tuning job
const jobResponse = await fetch(
  "https://api.fireworks.ai/inference/v1/fine-tuning/jobs",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.FIREWORKS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "accounts/fireworks/models/llama-v3p1-8b-instruct",
      training_file: file.id,
      hyperparameters: {
        n_epochs: 3,
        learning_rate_multiplier: 1.0,
        batch_size: 8,
      },
      suffix: "my-custom-model",
    }),
  }
);

const job = await jobResponse.json();
console.log("Fine-tuning job:", job.id);

Embeddings

const embeddingResponse = await fireworks.embeddings.create({
  model: "nomic-ai/nomic-embed-text-v1.5",
  input: ["What is machine learning?", "How do neural networks work?"],
});

const vectors = embeddingResponse.data.map((d) => d.embedding);
console.log("Dimensions:", vectors[0].length);

Batch Requests for Throughput

// Process many requests efficiently with Promise.allSettled
const prompts = [
  "Summarize document 1...",
  "Summarize document 2...",
  "Summarize document 3...",
];

const results = await Promise.allSettled(
  prompts.map((prompt) =>
    fireworks.chat.completions.create({
      model: "accounts/fireworks/models/llama-v3p1-8b-instruct",
      messages: [{ role: "user", content: prompt }],
      max_tokens: 256,
    })
  )
);

for (const [i, result] of results.entries()) {
  if (result.status === "fulfilled") {
    console.log(`Result ${i}:`, result.value.choices[0].message.content);
  } else {
    console.error(`Failed ${i}:`, result.reason);
  }
}

Best Practices

Use grammar mode (JSON schema in response_format) for any structured extraction — it is Fireworks' killer feature and guarantees valid output.
Use firefunction-v2 for function calling workloads — it is specifically optimized for tool use.
Use the OpenAI SDK — Fireworks is fully compatible, so leverage existing OpenAI code and tooling.
Set max_tokens explicitly — avoid unexpected costs from long generations.
Use 8B models for simple tasks — classification, extraction, and formatting do not need 70B parameters.
Batch concurrent requests with Promise.allSettled — Fireworks handles high concurrency well.
Pin model versions for production stability. Model names without versions may be updated.
Log usage from responses for cost tracking and optimization.

Anti-Patterns

Using plain JSON mode when grammar mode is available — grammar mode with a schema is strictly more reliable than JSON mode alone.
Not using the OpenAI SDK — writing raw fetch calls loses type safety, retry logic, and streaming helpers.
Ignoring finish_reason — check for "length" (truncated output) and handle accordingly.
Using 70B models for batch classification — 8B models are dramatically cheaper and often equally accurate for simple tasks.
Not handling rate limits — implement exponential backoff. Fireworks returns standard HTTP 429 responses.
Sending function calling requests to non-function models — use firefunction-v2 or Llama 3.1 instruct models for tool use. Not all models support it.
Fine-tuning without enough data — like all fine-tuning, quality matters more than quantity, but aim for 200+ diverse examples.
Constructing complex grammar patterns by hand — use JSON schema in response_format instead of writing raw GBNF grammars. The API handles the conversion.

Install this skill directly: skilldb add ai-llm-services-skills

Get CLI access →

Fireworks AI

Fireworks AI Skill

Core Philosophy

Setup

Key Techniques

Streaming

JSON Mode

Grammar Mode (Constrained JSON Schema)

Function Calling

Model Selection

Fine-Tuning

Embeddings

Batch Requests for Throughput

Best Practices

Anti-Patterns

Related Skills

Anthropic Claude API

Google Gemini API

Groq

OpenAI API

Replicate

Together AI