Groq
"Groq: ultra-fast inference, OpenAI-compatible API, Llama/Mixtral models, tool use, JSON mode, streaming"
Groq provides **ultra-fast inference** on custom LPU (Language Processing Unit) hardware. Use Groq when **latency is the primary concern** — it delivers tokens 5-10x faster than GPU-based providers. The API is fully **OpenAI-compatible**, making migration trivial. Groq hosts popular open-source models (Llama, Mixtral, Gemma). Build latency-sensitive features — real-time chat, autocomplete, live analysis — on Groq. Trade model variety for raw speed. ## Key Points - **Use Groq for latency-critical paths** — real-time chat UIs, autocomplete, live data processing. It is the fastest inference provider. - **Use `llama-3.1-8b-instant` for simple tasks** — classification, extraction, and formatting. It is extremely fast on Groq. - **Use `llama-3.3-70b-versatile` for quality** — complex reasoning, analysis, and generation tasks. - **Monitor rate limits** — Groq enforces per-minute token limits. Check response headers for `x-ratelimit-remaining-tokens`. - **Set `max_tokens`** to avoid hitting context window limits, especially on Mixtral (32K context). - **Use JSON mode** for structured outputs — Groq's implementation is reliable across all hosted models. - **Log `usage` from responses** to track token consumption and optimize prompts. - **Use streaming** even though Groq is fast — it still improves perceived responsiveness for long outputs. - **Using Groq as your only provider** — model selection is limited to what Groq hosts. Have a fallback for models not available on Groq. - **Ignoring rate limits in production** — Groq's free tier and even paid tiers have strict per-minute limits. Implement queuing and backoff. - **Sending very long contexts expecting GPU-level flexibility** — Groq context windows are model-dependent. Check limits before sending large prompts. - **Not leveraging the OpenAI compatibility** — if you already use the OpenAI SDK, just swap the base URL. Do not rewrite your integration from scratch. ## Quick Example ```bash GROQ_API_KEY=gsk_... ```
skilldb get ai-llm-services-skills/GroqFull skill: 326 linesGroq Skill
Core Philosophy
Groq provides ultra-fast inference on custom LPU (Language Processing Unit) hardware. Use Groq when latency is the primary concern — it delivers tokens 5-10x faster than GPU-based providers. The API is fully OpenAI-compatible, making migration trivial. Groq hosts popular open-source models (Llama, Mixtral, Gemma). Build latency-sensitive features — real-time chat, autocomplete, live analysis — on Groq. Trade model variety for raw speed.
Setup
Use the Groq SDK or the OpenAI SDK with a custom base URL:
import Groq from "groq-sdk";
const groq = new Groq({
apiKey: process.env.GROQ_API_KEY!,
});
// Basic chat completion
const response = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "system", content: "You are a concise technical assistant." },
{ role: "user", content: "What is a bloom filter?" },
],
temperature: 0.5,
max_tokens: 512,
});
console.log(response.choices[0].message.content);
OpenAI SDK compatibility:
import OpenAI from "openai";
const groq = new OpenAI({
apiKey: process.env.GROQ_API_KEY!,
baseURL: "https://api.groq.com/openai/v1",
});
const response = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [{ role: "user", content: "Hello!" }],
});
Environment variables:
GROQ_API_KEY=gsk_...
Key Techniques
Streaming
const stream = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "user", content: "Write a concise guide to WebSockets in Node.js." },
],
max_tokens: 1024,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
JSON Mode
const response = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{
role: "system",
content:
"You extract event details from text. Respond with a JSON object containing: title, date, time, location.",
},
{
role: "user",
content:
"The React Summit is happening on June 14th at 9am at the Amsterdam Conference Center.",
},
],
response_format: { type: "json_object" },
max_tokens: 256,
});
const event = JSON.parse(response.choices[0].message.content!);
console.log(event);
// { title: "React Summit", date: "June 14th", time: "9am", location: "Amsterdam Conference Center" }
Tool Use (Function Calling)
const response = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "user", content: "What's the stock price of AAPL and MSFT?" },
],
tools: [
{
type: "function",
function: {
name: "get_stock_price",
description: "Get the current stock price for a given ticker symbol",
parameters: {
type: "object",
properties: {
ticker: {
type: "string",
description: "Stock ticker symbol (e.g., AAPL, MSFT)",
},
},
required: ["ticker"],
},
},
},
],
tool_choice: "auto",
max_tokens: 512,
});
const toolCalls = response.choices[0].message.tool_calls;
if (toolCalls) {
const toolResults = await Promise.all(
toolCalls.map(async (tc) => {
const args = JSON.parse(tc.function.arguments);
const price = await getStockPrice(args.ticker);
return {
role: "tool" as const,
tool_call_id: tc.id,
content: JSON.stringify({ ticker: args.ticker, price }),
};
})
);
const followUp = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "user", content: "What's the stock price of AAPL and MSFT?" },
response.choices[0].message,
...toolResults,
],
max_tokens: 512,
});
console.log(followUp.choices[0].message.content);
}
Parallel Tool Calls
// Groq supports parallel tool calls — the model can invoke multiple tools at once
const response = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{
role: "user",
content: "Compare the weather in Tokyo, London, and Sydney right now.",
},
],
tools: [
{
type: "function",
function: {
name: "get_weather",
description: "Get current weather for a city",
parameters: {
type: "object",
properties: {
city: { type: "string" },
},
required: ["city"],
},
},
},
],
parallel_tool_calls: true,
max_tokens: 512,
});
// response.choices[0].message.tool_calls will contain multiple calls
const calls = response.choices[0].message.tool_calls ?? [];
console.log(`Model requested ${calls.length} parallel tool calls`);
Model Selection
// Llama 3.3 70B — best quality, great for complex tasks
const complex = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [{ role: "user", content: "Analyze this code for security issues..." }],
max_tokens: 1024,
});
// Llama 3.1 8B — fastest, great for simple tasks
const fast = await groq.chat.completions.create({
model: "llama-3.1-8b-instant",
messages: [{ role: "user", content: "Classify: is this spam? 'You won a prize!'" }],
max_tokens: 64,
});
// Mixtral 8x7B — good balance of speed and reasoning
const balanced = await groq.chat.completions.create({
model: "mixtral-8x7b-32768",
messages: [{ role: "user", content: "Summarize this article..." }],
max_tokens: 512,
});
// Gemma 2 9B — Google's efficient model
const gemma = await groq.chat.completions.create({
model: "gemma2-9b-it",
messages: [{ role: "user", content: "Explain map, filter, reduce." }],
max_tokens: 512,
});
Multi-Turn Conversation with Rate Limit Handling
import Groq from "groq-sdk";
const groq = new Groq({ apiKey: process.env.GROQ_API_KEY! });
interface Message {
role: "system" | "user" | "assistant";
content: string;
}
const history: Message[] = [
{ role: "system", content: "You are a helpful tutor." },
];
async function chat(userMessage: string): Promise<string> {
history.push({ role: "user", content: userMessage });
try {
const response = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: history,
max_tokens: 1024,
});
const reply = response.choices[0].message.content!;
history.push({ role: "assistant", content: reply });
// Log speed metrics from headers
console.log("Usage:", response.usage);
return reply;
} catch (error: unknown) {
if (error instanceof Groq.RateLimitError) {
// Groq has per-minute token and request limits
console.warn("Rate limited. Waiting before retry...");
await new Promise((r) => setTimeout(r, 10000));
return chat(userMessage); // Retry once
}
throw error;
}
}
Vision with Llama Models
const response = await groq.chat.completions.create({
model: "llama-3.2-90b-vision-preview",
messages: [
{
role: "user",
content: [
{
type: "text",
text: "Describe what you see in this image.",
},
{
type: "image_url",
image_url: {
url: "https://example.com/photo.jpg",
},
},
],
},
],
max_tokens: 512,
});
console.log(response.choices[0].message.content);
Best Practices
- Use Groq for latency-critical paths — real-time chat UIs, autocomplete, live data processing. It is the fastest inference provider.
- Use
llama-3.1-8b-instantfor simple tasks — classification, extraction, and formatting. It is extremely fast on Groq. - Use
llama-3.3-70b-versatilefor quality — complex reasoning, analysis, and generation tasks. - Monitor rate limits — Groq enforces per-minute token limits. Check response headers for
x-ratelimit-remaining-tokens. - Set
max_tokensto avoid hitting context window limits, especially on Mixtral (32K context). - Use JSON mode for structured outputs — Groq's implementation is reliable across all hosted models.
- Log
usagefrom responses to track token consumption and optimize prompts. - Use streaming even though Groq is fast — it still improves perceived responsiveness for long outputs.
Anti-Patterns
- Using Groq as your only provider — model selection is limited to what Groq hosts. Have a fallback for models not available on Groq.
- Ignoring rate limits in production — Groq's free tier and even paid tiers have strict per-minute limits. Implement queuing and backoff.
- Sending very long contexts expecting GPU-level flexibility — Groq context windows are model-dependent. Check limits before sending large prompts.
- Not leveraging the OpenAI compatibility — if you already use the OpenAI SDK, just swap the base URL. Do not rewrite your integration from scratch.
- Using 70B models for trivial tasks — the 8B model on Groq is faster and cheaper, and handles simple tasks equally well.
- Tight-polling for streaming chunks — use async iteration (
for await) and let the runtime handle backpressure. - Assuming all OpenAI features work — some advanced features (structured outputs with strict schema, assistants) may not be supported. Test your specific use case.
- Not implementing fallback logic — Groq occasionally has capacity constraints. Have a backup provider ready.
Install this skill directly: skilldb add ai-llm-services-skills
Related Skills
Anthropic Claude API
"Anthropic Claude API: messages API, tool use, streaming, vision, system prompts, extended thinking, batches, Node SDK"
Fireworks AI
"Fireworks AI: fast inference, function calling, grammar mode, JSON output, OpenAI-compatible API, fine-tuning"
Google Gemini API
"Google Gemini API: generateContent, multimodal (images/video/audio), function calling, streaming, embeddings, context caching"
OpenAI API
"OpenAI API: chat completions, function calling/tools, streaming, embeddings, vision, JSON mode, assistants, Node SDK"
Replicate
"Replicate: run open-source models, image generation (Flux/SDXL), predictions API, webhooks, streaming, Node SDK"
Together AI
"Together AI: inference API, open-source LLMs (Llama/Mistral), chat completions, embeddings, fine-tuning, JSON mode"