Technology & EngineeringLlm Integration208 lines

Openai API

OpenAI API integration patterns for chat completions, embeddings, and assistants

Quick Summary35 lines

You are an expert in OpenAI API integration for building LLM-powered applications.

## Key Points

- Store API keys in environment variables, never in source code.
- Set `max_tokens` to control costs and prevent runaway responses.
- Use `temperature: 0` for deterministic outputs in testing and data extraction.
- Implement retry logic with exponential backoff for rate limits (429) and server errors (5xx).
- Track token usage from `response.usage` for cost monitoring.
- Use the latest model versions (e.g., `gpt-4o`) for best price-performance.
- Validate and sanitize user input before including it in prompts.
- Set request timeouts to avoid hanging on slow responses.
- Exceeding context window limits without truncation logic causes API errors.
- Not handling `null` content in response choices leads to runtime crashes.
- Ignoring rate limit headers (`x-ratelimit-remaining`) results in unnecessary 429 errors.
- Hardcoding model names makes upgrades painful; use configuration or constants.

## Quick Example

```typescript
import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});
```

```typescript
const messages: OpenAI.ChatCompletionMessageParam[] = [
  { role: "system", content: "You are a helpful coding assistant." },
  { role: "user", content: "Explain async/await in JavaScript." },
];
```

skilldb get llm-integration-skills/Openai APIFull skill: 208 lines

Paste into your CLAUDE.md or agent config

OpenAI API — LLM Integration

You are an expert in OpenAI API integration for building LLM-powered applications.

Overview

The OpenAI API provides access to GPT-4, GPT-4o, and other models through a REST API and official SDKs. Integration involves managing API keys, constructing message arrays, handling token limits, and processing responses efficiently.

Core Concepts

Authentication and Client Setup

import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

Message Structure

OpenAI uses a role-based message array with system, user, and assistant roles:

const messages: OpenAI.ChatCompletionMessageParam[] = [
  { role: "system", content: "You are a helpful coding assistant." },
  { role: "user", content: "Explain async/await in JavaScript." },
];

Chat Completions

async function chat(prompt: string): Promise<string> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
    temperature: 0.7,
    max_tokens: 1024,
  });
  return response.choices[0].message.content ?? "";
}

Structured Outputs with response_format

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "List 3 programming languages as JSON" }],
  response_format: { type: "json_object" },
});

const data = JSON.parse(response.choices[0].message.content!);

Streaming Responses

const stream = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: prompt }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

Token Counting and Cost Management

import { encoding_for_model } from "tiktoken";

function countTokens(text: string, model: string = "gpt-4o"): number {
  const enc = encoding_for_model(model as any);
  const tokens = enc.encode(text);
  enc.free();
  return tokens.length;
}

Retry Logic with Exponential Backoff

async function chatWithRetry(
  messages: OpenAI.ChatCompletionMessageParam[],
  maxRetries = 3
): Promise<string> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await openai.chat.completions.create({
        model: "gpt-4o",
        messages,
      });
      return response.choices[0].message.content ?? "";
    } catch (error: any) {
      if (error.status === 429 || error.status >= 500) {
        const delay = Math.pow(2, attempt) * 1000;
        await new Promise((r) => setTimeout(r, delay));
        continue;
      }
      throw error;
    }
  }
  throw new Error("Max retries exceeded");
}

Implementation Patterns

Conversation History Management

class ConversationManager {
  private messages: OpenAI.ChatCompletionMessageParam[] = [];
  private maxTokens = 8000;

  constructor(systemPrompt: string) {
    this.messages = [{ role: "system", content: systemPrompt }];
  }

  async send(userMessage: string): Promise<string> {
    this.messages.push({ role: "user", content: userMessage });
    this.trimHistory();

    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: this.messages,
    });

    const reply = response.choices[0].message.content ?? "";
    this.messages.push({ role: "assistant", content: reply });
    return reply;
  }

  private trimHistory(): void {
    while (this.messages.length > 20) {
      // Keep system message, remove oldest user/assistant pair
      this.messages.splice(1, 2);
    }
  }
}

Batch Processing

async function batchProcess(prompts: string[], concurrency = 5): Promise<string[]> {
  const results: string[] = [];
  for (let i = 0; i < prompts.length; i += concurrency) {
    const batch = prompts.slice(i, i + concurrency);
    const batchResults = await Promise.all(batch.map((p) => chat(p)));
    results.push(...batchResults);
  }
  return results;
}

Best Practices

Store API keys in environment variables, never in source code.
Set max_tokens to control costs and prevent runaway responses.
Use temperature: 0 for deterministic outputs in testing and data extraction.
Implement retry logic with exponential backoff for rate limits (429) and server errors (5xx).
Track token usage from response.usage for cost monitoring.
Use the latest model versions (e.g., gpt-4o) for best price-performance.
Validate and sanitize user input before including it in prompts.
Set request timeouts to avoid hanging on slow responses.

Core Philosophy

The OpenAI API is a stateless request-response interface. Every call is independent; the API has no memory of previous requests. Conversation continuity is your responsibility -- you maintain the message history, decide what to include, and manage the token budget. This stateless design is both a constraint and a feature: it means you have full control over what context the model sees, but it also means you must implement trimming, summarization, or windowing when conversations grow long.

Treat cost and latency as first-class design parameters, not afterthoughts. Every token in the prompt costs money and adds latency; every token in the response costs more money and adds more latency. Setting max_tokens is not just about preventing runaway responses -- it is about expressing your intent for how long the response should be. Tracking response.usage per request, aggregating costs, and setting per-user or per-session budgets are production requirements, not nice-to-haves.

Reliability engineering is mandatory for any production integration. The API will return 429 (rate limit) and 5xx (server error) responses. Your code must handle these gracefully with exponential backoff and retry logic. It will occasionally return null content in response choices. It will sometimes produce malformed JSON even when response_format: json_object is set. Defensive coding -- null checks, parse validation, timeout handling -- is the difference between a demo and a production system.

Anti-Patterns

No retry logic for transient errors: Calling the API once and propagating any error directly to the user. Rate limits (429) and server errors (5xx) are expected in normal operation and should be retried with exponential backoff. Only authentication errors (401) and validation errors (400) should fail immediately.
Unbounded conversation history: Appending every user and assistant message to the history array without ever trimming. As the conversation grows, it eventually exceeds the context window, causing API errors. Implement a sliding window, summarization, or message count limit.
Hardcoding model names across the codebase: Scattering "gpt-4o" as a string literal in every API call instead of defining it once in configuration. When a new model is released or pricing changes, every callsite must be found and updated.
Ignoring response.usage for cost tracking: Making API calls without recording token usage. Without tracking, there is no way to detect cost anomalies, set budgets, or optimize prompt length. Log prompt_tokens, completion_tokens, and total_tokens from every response.
Trusting response_format: json_object without validation: Assuming that setting response_format to json_object guarantees valid, schema-compliant JSON. The format constraint improves reliability but does not eliminate malformed output. Always parse with JSON.parse in a try-catch and validate against your schema.

Common Pitfalls

Exceeding context window limits without truncation logic causes API errors.
Not handling null content in response choices leads to runtime crashes.
Ignoring rate limit headers (x-ratelimit-remaining) results in unnecessary 429 errors.
Hardcoding model names makes upgrades painful; use configuration or constants.
Sending conversation history without trimming accumulates costs rapidly.
Not setting max_tokens can result in unexpectedly long (and expensive) responses.
Using JSON.parse on model output without validation; models can produce malformed JSON even with response_format: json_object.

Install this skill directly: skilldb add llm-integration-skills

Get CLI access →