Together AI
"Together AI: inference API, open-source LLMs (Llama/Mistral), chat completions, embeddings, fine-tuning, JSON mode"
Together AI provides fast, cost-effective inference for **open-source models** — Llama, Mistral, Mixtral, CodeLlama, and more. Use it when you want open-source model capabilities without managing GPU infrastructure. The API is **OpenAI-compatible**, so existing code migrates easily. Together also offers fine-tuning, making it a full platform for training and deploying custom models. Choose Together for cost-sensitive workloads, open-source model access, and when you need JSON mode with open models.
## Key Points
- **Use Turbo variants** (e.g., `Meta-Llama-3.1-70B-Instruct-Turbo`) for faster inference at the same quality.
- **Use JSON mode with a schema** for reliable structured extraction — Together enforces the schema at the token level.
- **Start with 8B models** for classification, extraction, and simple tasks. Scale to 70B+ only when quality demands it.
- **Use `stop` sequences** to control output length and format — open-source models sometimes ramble without them.
- **Set `max_tokens` explicitly** — defaults vary by model and can be unexpectedly high.
- **Prepare fine-tuning data in JSONL format** with `{"messages": [...]}` objects matching the chat format.
- **Test multiple models** — Together hosts many open-source options; the best model depends on your specific task.
- **Use the OpenAI SDK compatibility** to make it trivial to switch between providers.
- **Using huge models for simple tasks** — classification and extraction work great on 8B models at a fraction of the cost.
- **Not using JSON mode for structured output** — asking for JSON in the prompt alone is unreliable with open-source models.
- **Ignoring model-specific prompt formats** — Llama, Mistral, and other models have different chat templates. The chat API handles this, but raw completions require correct formatting.
- **Fine-tuning with too little data** — aim for at least 100-500 high-quality examples. Below that, few-shot prompting is often better.
## Quick Example
```bash
TOGETHER_API_KEY=...
```skilldb get ai-llm-services-skills/Together AIFull skill: 294 linesTogether AI Skill
Core Philosophy
Together AI provides fast, cost-effective inference for open-source models — Llama, Mistral, Mixtral, CodeLlama, and more. Use it when you want open-source model capabilities without managing GPU infrastructure. The API is OpenAI-compatible, so existing code migrates easily. Together also offers fine-tuning, making it a full platform for training and deploying custom models. Choose Together for cost-sensitive workloads, open-source model access, and when you need JSON mode with open models.
Setup
Install the SDK and configure:
import Together from "together-ai";
const together = new Together({
apiKey: process.env.TOGETHER_API_KEY!,
});
// Basic chat completion
const response = await together.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Write a TypeScript function to debounce." },
],
max_tokens: 512,
temperature: 0.7,
});
console.log(response.choices[0].message.content);
You can also use the OpenAI SDK:
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY!,
baseURL: "https://api.together.xyz/v1",
});
const response = await client.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [{ role: "user", content: "Hello!" }],
});
Environment variables:
TOGETHER_API_KEY=...
Key Techniques
Streaming
const stream = await together.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [{ role: "user", content: "Explain distributed consensus." }],
max_tokens: 1024,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
JSON Mode
const response = await together.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [
{
role: "system",
content: "Extract structured data from the user's text. Respond in JSON.",
},
{
role: "user",
content: "The meeting is on March 15th at 2pm with Alice and Bob about Q1 planning.",
},
],
response_format: { type: "json_object" },
max_tokens: 256,
});
const data = JSON.parse(response.choices[0].message.content!);
console.log(data);
// { date: "March 15th", time: "2pm", attendees: ["Alice", "Bob"], topic: "Q1 planning" }
JSON Schema (Structured Output)
const response = await together.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [
{
role: "system",
content: "Extract product information from the description.",
},
{
role: "user",
content: "The Sony WH-1000XM5 headphones cost $349 and have 30-hour battery life.",
},
],
response_format: {
type: "json_object",
schema: {
type: "object",
properties: {
product_name: { type: "string" },
brand: { type: "string" },
price: { type: "number" },
features: { type: "array", items: { type: "string" } },
},
required: ["product_name", "brand", "price"],
},
},
max_tokens: 256,
});
Embeddings
const embedding = await together.embeddings.create({
model: "togethercomputer/m2-bert-80M-8k-retrieval",
input: "What is retrieval augmented generation?",
});
const vector = embedding.data[0].embedding;
console.log("Dimensions:", vector.length);
// Batch embeddings
const batchEmbeddings = await together.embeddings.create({
model: "togethercomputer/m2-bert-80M-8k-retrieval",
input: [
"First document about machine learning",
"Second document about web development",
"Third document about cloud computing",
],
});
Model Selection
// Fast and cheap — great for simple tasks
const fast = await together.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
messages: [{ role: "user", content: "Classify sentiment: I love this!" }],
max_tokens: 10,
});
// Balanced — good quality at reasonable cost
const balanced = await together.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: [{ role: "user", content: "Write a detailed code review." }],
max_tokens: 1024,
});
// Coding focused
const coding = await together.chat.completions.create({
model: "Qwen/Qwen2.5-Coder-32B-Instruct",
messages: [
{
role: "user",
content: "Implement a red-black tree in TypeScript with insert and delete.",
},
],
max_tokens: 2048,
});
// Mixture of Experts — strong reasoning
const moe = await together.chat.completions.create({
model: "mistralai/Mixtral-8x22B-Instruct-v0.1",
messages: [{ role: "user", content: "Analyze this business case..." }],
max_tokens: 1024,
});
Fine-Tuning
// Upload training data (JSONL format)
const file = await together.files.upload({
file: new File(
[readFileSync("./training_data.jsonl")],
"training_data.jsonl",
{ type: "application/jsonl" }
),
purpose: "fine-tune",
});
// Create a fine-tuning job
const job = await together.fineTuning.create({
training_file: file.id,
model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
n_epochs: 3,
learning_rate: 1e-5,
batch_size: 4,
suffix: "my-custom-model",
});
console.log("Job ID:", job.id);
// Check job status
const status = await together.fineTuning.retrieve(job.id);
console.log("Status:", status.status);
// Once complete, use your fine-tuned model
const response = await together.chat.completions.create({
model: status.output_name!, // your fine-tuned model name
messages: [{ role: "user", content: "..." }],
});
Text Completions (Non-Chat)
const completion = await together.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
prompt: "The three laws of thermodynamics are:\n1.",
max_tokens: 256,
stop: ["\n\n"],
temperature: 0.5,
});
console.log(completion.choices[0].text);
Multi-Turn Conversation
interface Message {
role: "system" | "user" | "assistant";
content: string;
}
const history: Message[] = [
{ role: "system", content: "You are a Python tutor. Be encouraging but precise." },
];
async function chat(userMessage: string): Promise<string> {
history.push({ role: "user", content: userMessage });
const response = await together.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages: history,
max_tokens: 512,
temperature: 0.7,
});
const reply = response.choices[0].message.content!;
history.push({ role: "assistant", content: reply });
return reply;
}
await chat("How do Python generators work?");
await chat("Can you show me an example with fibonacci?");
Best Practices
- Use Turbo variants (e.g.,
Meta-Llama-3.1-70B-Instruct-Turbo) for faster inference at the same quality. - Use JSON mode with a schema for reliable structured extraction — Together enforces the schema at the token level.
- Start with 8B models for classification, extraction, and simple tasks. Scale to 70B+ only when quality demands it.
- Use
stopsequences to control output length and format — open-source models sometimes ramble without them. - Set
max_tokensexplicitly — defaults vary by model and can be unexpectedly high. - Prepare fine-tuning data in JSONL format with
{"messages": [...]}objects matching the chat format. - Test multiple models — Together hosts many open-source options; the best model depends on your specific task.
- Use the OpenAI SDK compatibility to make it trivial to switch between providers.
Anti-Patterns
- Using huge models for simple tasks — classification and extraction work great on 8B models at a fraction of the cost.
- Not using JSON mode for structured output — asking for JSON in the prompt alone is unreliable with open-source models.
- Ignoring model-specific prompt formats — Llama, Mistral, and other models have different chat templates. The chat API handles this, but raw completions require correct formatting.
- Fine-tuning with too little data — aim for at least 100-500 high-quality examples. Below that, few-shot prompting is often better.
- Not setting
stoptokens — open-source models may generate beyond the expected response boundary without explicit stop sequences. - Using the completions API when chat is available — the chat API applies the correct prompt template automatically.
- Sending excessively long system prompts on small models — 8B models have limited instruction-following capacity with long system prompts. Keep them concise.
- Not checking for rate limits — Together has per-minute token limits. Implement retry logic with exponential backoff.
Install this skill directly: skilldb add ai-llm-services-skills
Related Skills
Anthropic Claude API
"Anthropic Claude API: messages API, tool use, streaming, vision, system prompts, extended thinking, batches, Node SDK"
Fireworks AI
"Fireworks AI: fast inference, function calling, grammar mode, JSON output, OpenAI-compatible API, fine-tuning"
Google Gemini API
"Google Gemini API: generateContent, multimodal (images/video/audio), function calling, streaming, embeddings, context caching"
Groq
"Groq: ultra-fast inference, OpenAI-compatible API, Llama/Mixtral models, tool use, JSON mode, streaming"
OpenAI API
"OpenAI API: chat completions, function calling/tools, streaming, embeddings, vision, JSON mode, assistants, Node SDK"
Replicate
"Replicate: run open-source models, image generation (Flux/SDXL), predictions API, webhooks, streaming, Node SDK"