Skip to main content
Technology & EngineeringAi Llm Services326 lines

Groq

"Groq: ultra-fast inference, OpenAI-compatible API, Llama/Mixtral models, tool use, JSON mode, streaming"

Quick Summary24 lines
Groq provides **ultra-fast inference** on custom LPU (Language Processing Unit) hardware. Use Groq when **latency is the primary concern** — it delivers tokens 5-10x faster than GPU-based providers. The API is fully **OpenAI-compatible**, making migration trivial. Groq hosts popular open-source models (Llama, Mixtral, Gemma). Build latency-sensitive features — real-time chat, autocomplete, live analysis — on Groq. Trade model variety for raw speed.

## Key Points

- **Use Groq for latency-critical paths** — real-time chat UIs, autocomplete, live data processing. It is the fastest inference provider.
- **Use `llama-3.1-8b-instant` for simple tasks** — classification, extraction, and formatting. It is extremely fast on Groq.
- **Use `llama-3.3-70b-versatile` for quality** — complex reasoning, analysis, and generation tasks.
- **Monitor rate limits** — Groq enforces per-minute token limits. Check response headers for `x-ratelimit-remaining-tokens`.
- **Set `max_tokens`** to avoid hitting context window limits, especially on Mixtral (32K context).
- **Use JSON mode** for structured outputs — Groq's implementation is reliable across all hosted models.
- **Log `usage` from responses** to track token consumption and optimize prompts.
- **Use streaming** even though Groq is fast — it still improves perceived responsiveness for long outputs.
- **Using Groq as your only provider** — model selection is limited to what Groq hosts. Have a fallback for models not available on Groq.
- **Ignoring rate limits in production** — Groq's free tier and even paid tiers have strict per-minute limits. Implement queuing and backoff.
- **Sending very long contexts expecting GPU-level flexibility** — Groq context windows are model-dependent. Check limits before sending large prompts.
- **Not leveraging the OpenAI compatibility** — if you already use the OpenAI SDK, just swap the base URL. Do not rewrite your integration from scratch.

## Quick Example

```bash
GROQ_API_KEY=gsk_...
```
skilldb get ai-llm-services-skills/GroqFull skill: 326 lines
Paste into your CLAUDE.md or agent config

Groq Skill

Core Philosophy

Groq provides ultra-fast inference on custom LPU (Language Processing Unit) hardware. Use Groq when latency is the primary concern — it delivers tokens 5-10x faster than GPU-based providers. The API is fully OpenAI-compatible, making migration trivial. Groq hosts popular open-source models (Llama, Mixtral, Gemma). Build latency-sensitive features — real-time chat, autocomplete, live analysis — on Groq. Trade model variety for raw speed.

Setup

Use the Groq SDK or the OpenAI SDK with a custom base URL:

import Groq from "groq-sdk";

const groq = new Groq({
  apiKey: process.env.GROQ_API_KEY!,
});

// Basic chat completion
const response = await groq.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    { role: "system", content: "You are a concise technical assistant." },
    { role: "user", content: "What is a bloom filter?" },
  ],
  temperature: 0.5,
  max_tokens: 512,
});

console.log(response.choices[0].message.content);

OpenAI SDK compatibility:

import OpenAI from "openai";

const groq = new OpenAI({
  apiKey: process.env.GROQ_API_KEY!,
  baseURL: "https://api.groq.com/openai/v1",
});

const response = await groq.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [{ role: "user", content: "Hello!" }],
});

Environment variables:

GROQ_API_KEY=gsk_...

Key Techniques

Streaming

const stream = await groq.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    { role: "user", content: "Write a concise guide to WebSockets in Node.js." },
  ],
  max_tokens: 1024,
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

JSON Mode

const response = await groq.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    {
      role: "system",
      content:
        "You extract event details from text. Respond with a JSON object containing: title, date, time, location.",
    },
    {
      role: "user",
      content:
        "The React Summit is happening on June 14th at 9am at the Amsterdam Conference Center.",
    },
  ],
  response_format: { type: "json_object" },
  max_tokens: 256,
});

const event = JSON.parse(response.choices[0].message.content!);
console.log(event);
// { title: "React Summit", date: "June 14th", time: "9am", location: "Amsterdam Conference Center" }

Tool Use (Function Calling)

const response = await groq.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    { role: "user", content: "What's the stock price of AAPL and MSFT?" },
  ],
  tools: [
    {
      type: "function",
      function: {
        name: "get_stock_price",
        description: "Get the current stock price for a given ticker symbol",
        parameters: {
          type: "object",
          properties: {
            ticker: {
              type: "string",
              description: "Stock ticker symbol (e.g., AAPL, MSFT)",
            },
          },
          required: ["ticker"],
        },
      },
    },
  ],
  tool_choice: "auto",
  max_tokens: 512,
});

const toolCalls = response.choices[0].message.tool_calls;
if (toolCalls) {
  const toolResults = await Promise.all(
    toolCalls.map(async (tc) => {
      const args = JSON.parse(tc.function.arguments);
      const price = await getStockPrice(args.ticker);
      return {
        role: "tool" as const,
        tool_call_id: tc.id,
        content: JSON.stringify({ ticker: args.ticker, price }),
      };
    })
  );

  const followUp = await groq.chat.completions.create({
    model: "llama-3.3-70b-versatile",
    messages: [
      { role: "user", content: "What's the stock price of AAPL and MSFT?" },
      response.choices[0].message,
      ...toolResults,
    ],
    max_tokens: 512,
  });
  console.log(followUp.choices[0].message.content);
}

Parallel Tool Calls

// Groq supports parallel tool calls — the model can invoke multiple tools at once
const response = await groq.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    {
      role: "user",
      content: "Compare the weather in Tokyo, London, and Sydney right now.",
    },
  ],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get current weather for a city",
        parameters: {
          type: "object",
          properties: {
            city: { type: "string" },
          },
          required: ["city"],
        },
      },
    },
  ],
  parallel_tool_calls: true,
  max_tokens: 512,
});

// response.choices[0].message.tool_calls will contain multiple calls
const calls = response.choices[0].message.tool_calls ?? [];
console.log(`Model requested ${calls.length} parallel tool calls`);

Model Selection

// Llama 3.3 70B — best quality, great for complex tasks
const complex = await groq.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [{ role: "user", content: "Analyze this code for security issues..." }],
  max_tokens: 1024,
});

// Llama 3.1 8B — fastest, great for simple tasks
const fast = await groq.chat.completions.create({
  model: "llama-3.1-8b-instant",
  messages: [{ role: "user", content: "Classify: is this spam? 'You won a prize!'" }],
  max_tokens: 64,
});

// Mixtral 8x7B — good balance of speed and reasoning
const balanced = await groq.chat.completions.create({
  model: "mixtral-8x7b-32768",
  messages: [{ role: "user", content: "Summarize this article..." }],
  max_tokens: 512,
});

// Gemma 2 9B — Google's efficient model
const gemma = await groq.chat.completions.create({
  model: "gemma2-9b-it",
  messages: [{ role: "user", content: "Explain map, filter, reduce." }],
  max_tokens: 512,
});

Multi-Turn Conversation with Rate Limit Handling

import Groq from "groq-sdk";

const groq = new Groq({ apiKey: process.env.GROQ_API_KEY! });

interface Message {
  role: "system" | "user" | "assistant";
  content: string;
}

const history: Message[] = [
  { role: "system", content: "You are a helpful tutor." },
];

async function chat(userMessage: string): Promise<string> {
  history.push({ role: "user", content: userMessage });

  try {
    const response = await groq.chat.completions.create({
      model: "llama-3.3-70b-versatile",
      messages: history,
      max_tokens: 1024,
    });

    const reply = response.choices[0].message.content!;
    history.push({ role: "assistant", content: reply });

    // Log speed metrics from headers
    console.log("Usage:", response.usage);

    return reply;
  } catch (error: unknown) {
    if (error instanceof Groq.RateLimitError) {
      // Groq has per-minute token and request limits
      console.warn("Rate limited. Waiting before retry...");
      await new Promise((r) => setTimeout(r, 10000));
      return chat(userMessage); // Retry once
    }
    throw error;
  }
}

Vision with Llama Models

const response = await groq.chat.completions.create({
  model: "llama-3.2-90b-vision-preview",
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "Describe what you see in this image.",
        },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/photo.jpg",
          },
        },
      ],
    },
  ],
  max_tokens: 512,
});

console.log(response.choices[0].message.content);

Best Practices

  • Use Groq for latency-critical paths — real-time chat UIs, autocomplete, live data processing. It is the fastest inference provider.
  • Use llama-3.1-8b-instant for simple tasks — classification, extraction, and formatting. It is extremely fast on Groq.
  • Use llama-3.3-70b-versatile for quality — complex reasoning, analysis, and generation tasks.
  • Monitor rate limits — Groq enforces per-minute token limits. Check response headers for x-ratelimit-remaining-tokens.
  • Set max_tokens to avoid hitting context window limits, especially on Mixtral (32K context).
  • Use JSON mode for structured outputs — Groq's implementation is reliable across all hosted models.
  • Log usage from responses to track token consumption and optimize prompts.
  • Use streaming even though Groq is fast — it still improves perceived responsiveness for long outputs.

Anti-Patterns

  • Using Groq as your only provider — model selection is limited to what Groq hosts. Have a fallback for models not available on Groq.
  • Ignoring rate limits in production — Groq's free tier and even paid tiers have strict per-minute limits. Implement queuing and backoff.
  • Sending very long contexts expecting GPU-level flexibility — Groq context windows are model-dependent. Check limits before sending large prompts.
  • Not leveraging the OpenAI compatibility — if you already use the OpenAI SDK, just swap the base URL. Do not rewrite your integration from scratch.
  • Using 70B models for trivial tasks — the 8B model on Groq is faster and cheaper, and handles simple tasks equally well.
  • Tight-polling for streaming chunks — use async iteration (for await) and let the runtime handle backpressure.
  • Assuming all OpenAI features work — some advanced features (structured outputs with strict schema, assistants) may not be supported. Test your specific use case.
  • Not implementing fallback logic — Groq occasionally has capacity constraints. Have a backup provider ready.

Install this skill directly: skilldb add ai-llm-services-skills

Get CLI access →