Skip to main content
Technology & EngineeringAi Llm Services294 lines

Together AI

"Together AI: inference API, open-source LLMs (Llama/Mistral), chat completions, embeddings, fine-tuning, JSON mode"

Quick Summary24 lines
Together AI provides fast, cost-effective inference for **open-source models** — Llama, Mistral, Mixtral, CodeLlama, and more. Use it when you want open-source model capabilities without managing GPU infrastructure. The API is **OpenAI-compatible**, so existing code migrates easily. Together also offers fine-tuning, making it a full platform for training and deploying custom models. Choose Together for cost-sensitive workloads, open-source model access, and when you need JSON mode with open models.

## Key Points

- **Use Turbo variants** (e.g., `Meta-Llama-3.1-70B-Instruct-Turbo`) for faster inference at the same quality.
- **Use JSON mode with a schema** for reliable structured extraction — Together enforces the schema at the token level.
- **Start with 8B models** for classification, extraction, and simple tasks. Scale to 70B+ only when quality demands it.
- **Use `stop` sequences** to control output length and format — open-source models sometimes ramble without them.
- **Set `max_tokens` explicitly** — defaults vary by model and can be unexpectedly high.
- **Prepare fine-tuning data in JSONL format** with `{"messages": [...]}` objects matching the chat format.
- **Test multiple models** — Together hosts many open-source options; the best model depends on your specific task.
- **Use the OpenAI SDK compatibility** to make it trivial to switch between providers.
- **Using huge models for simple tasks** — classification and extraction work great on 8B models at a fraction of the cost.
- **Not using JSON mode for structured output** — asking for JSON in the prompt alone is unreliable with open-source models.
- **Ignoring model-specific prompt formats** — Llama, Mistral, and other models have different chat templates. The chat API handles this, but raw completions require correct formatting.
- **Fine-tuning with too little data** — aim for at least 100-500 high-quality examples. Below that, few-shot prompting is often better.

## Quick Example

```bash
TOGETHER_API_KEY=...
```
skilldb get ai-llm-services-skills/Together AIFull skill: 294 lines
Paste into your CLAUDE.md or agent config

Together AI Skill

Core Philosophy

Together AI provides fast, cost-effective inference for open-source models — Llama, Mistral, Mixtral, CodeLlama, and more. Use it when you want open-source model capabilities without managing GPU infrastructure. The API is OpenAI-compatible, so existing code migrates easily. Together also offers fine-tuning, making it a full platform for training and deploying custom models. Choose Together for cost-sensitive workloads, open-source model access, and when you need JSON mode with open models.

Setup

Install the SDK and configure:

import Together from "together-ai";

const together = new Together({
  apiKey: process.env.TOGETHER_API_KEY!,
});

// Basic chat completion
const response = await together.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Write a TypeScript function to debounce." },
  ],
  max_tokens: 512,
  temperature: 0.7,
});

console.log(response.choices[0].message.content);

You can also use the OpenAI SDK:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.TOGETHER_API_KEY!,
  baseURL: "https://api.together.xyz/v1",
});

const response = await client.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  messages: [{ role: "user", content: "Hello!" }],
});

Environment variables:

TOGETHER_API_KEY=...

Key Techniques

Streaming

const stream = await together.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  messages: [{ role: "user", content: "Explain distributed consensus." }],
  max_tokens: 1024,
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

JSON Mode

const response = await together.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  messages: [
    {
      role: "system",
      content: "Extract structured data from the user's text. Respond in JSON.",
    },
    {
      role: "user",
      content: "The meeting is on March 15th at 2pm with Alice and Bob about Q1 planning.",
    },
  ],
  response_format: { type: "json_object" },
  max_tokens: 256,
});

const data = JSON.parse(response.choices[0].message.content!);
console.log(data);
// { date: "March 15th", time: "2pm", attendees: ["Alice", "Bob"], topic: "Q1 planning" }

JSON Schema (Structured Output)

const response = await together.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  messages: [
    {
      role: "system",
      content: "Extract product information from the description.",
    },
    {
      role: "user",
      content: "The Sony WH-1000XM5 headphones cost $349 and have 30-hour battery life.",
    },
  ],
  response_format: {
    type: "json_object",
    schema: {
      type: "object",
      properties: {
        product_name: { type: "string" },
        brand: { type: "string" },
        price: { type: "number" },
        features: { type: "array", items: { type: "string" } },
      },
      required: ["product_name", "brand", "price"],
    },
  },
  max_tokens: 256,
});

Embeddings

const embedding = await together.embeddings.create({
  model: "togethercomputer/m2-bert-80M-8k-retrieval",
  input: "What is retrieval augmented generation?",
});

const vector = embedding.data[0].embedding;
console.log("Dimensions:", vector.length);

// Batch embeddings
const batchEmbeddings = await together.embeddings.create({
  model: "togethercomputer/m2-bert-80M-8k-retrieval",
  input: [
    "First document about machine learning",
    "Second document about web development",
    "Third document about cloud computing",
  ],
});

Model Selection

// Fast and cheap — great for simple tasks
const fast = await together.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
  messages: [{ role: "user", content: "Classify sentiment: I love this!" }],
  max_tokens: 10,
});

// Balanced — good quality at reasonable cost
const balanced = await together.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  messages: [{ role: "user", content: "Write a detailed code review." }],
  max_tokens: 1024,
});

// Coding focused
const coding = await together.chat.completions.create({
  model: "Qwen/Qwen2.5-Coder-32B-Instruct",
  messages: [
    {
      role: "user",
      content: "Implement a red-black tree in TypeScript with insert and delete.",
    },
  ],
  max_tokens: 2048,
});

// Mixture of Experts — strong reasoning
const moe = await together.chat.completions.create({
  model: "mistralai/Mixtral-8x22B-Instruct-v0.1",
  messages: [{ role: "user", content: "Analyze this business case..." }],
  max_tokens: 1024,
});

Fine-Tuning

// Upload training data (JSONL format)
const file = await together.files.upload({
  file: new File(
    [readFileSync("./training_data.jsonl")],
    "training_data.jsonl",
    { type: "application/jsonl" }
  ),
  purpose: "fine-tune",
});

// Create a fine-tuning job
const job = await together.fineTuning.create({
  training_file: file.id,
  model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
  n_epochs: 3,
  learning_rate: 1e-5,
  batch_size: 4,
  suffix: "my-custom-model",
});

console.log("Job ID:", job.id);

// Check job status
const status = await together.fineTuning.retrieve(job.id);
console.log("Status:", status.status);

// Once complete, use your fine-tuned model
const response = await together.chat.completions.create({
  model: status.output_name!, // your fine-tuned model name
  messages: [{ role: "user", content: "..." }],
});

Text Completions (Non-Chat)

const completion = await together.completions.create({
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
  prompt: "The three laws of thermodynamics are:\n1.",
  max_tokens: 256,
  stop: ["\n\n"],
  temperature: 0.5,
});

console.log(completion.choices[0].text);

Multi-Turn Conversation

interface Message {
  role: "system" | "user" | "assistant";
  content: string;
}

const history: Message[] = [
  { role: "system", content: "You are a Python tutor. Be encouraging but precise." },
];

async function chat(userMessage: string): Promise<string> {
  history.push({ role: "user", content: userMessage });

  const response = await together.chat.completions.create({
    model: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages: history,
    max_tokens: 512,
    temperature: 0.7,
  });

  const reply = response.choices[0].message.content!;
  history.push({ role: "assistant", content: reply });
  return reply;
}

await chat("How do Python generators work?");
await chat("Can you show me an example with fibonacci?");

Best Practices

  • Use Turbo variants (e.g., Meta-Llama-3.1-70B-Instruct-Turbo) for faster inference at the same quality.
  • Use JSON mode with a schema for reliable structured extraction — Together enforces the schema at the token level.
  • Start with 8B models for classification, extraction, and simple tasks. Scale to 70B+ only when quality demands it.
  • Use stop sequences to control output length and format — open-source models sometimes ramble without them.
  • Set max_tokens explicitly — defaults vary by model and can be unexpectedly high.
  • Prepare fine-tuning data in JSONL format with {"messages": [...]} objects matching the chat format.
  • Test multiple models — Together hosts many open-source options; the best model depends on your specific task.
  • Use the OpenAI SDK compatibility to make it trivial to switch between providers.

Anti-Patterns

  • Using huge models for simple tasks — classification and extraction work great on 8B models at a fraction of the cost.
  • Not using JSON mode for structured output — asking for JSON in the prompt alone is unreliable with open-source models.
  • Ignoring model-specific prompt formats — Llama, Mistral, and other models have different chat templates. The chat API handles this, but raw completions require correct formatting.
  • Fine-tuning with too little data — aim for at least 100-500 high-quality examples. Below that, few-shot prompting is often better.
  • Not setting stop tokens — open-source models may generate beyond the expected response boundary without explicit stop sequences.
  • Using the completions API when chat is available — the chat API applies the correct prompt template automatically.
  • Sending excessively long system prompts on small models — 8B models have limited instruction-following capacity with long system prompts. Keep them concise.
  • Not checking for rate limits — Together has per-minute token limits. Implement retry logic with exponential backoff.

Install this skill directly: skilldb add ai-llm-services-skills

Get CLI access →