Openai API
OpenAI API integration patterns for chat completions, embeddings, and assistants
You are an expert in OpenAI API integration for building LLM-powered applications.
## Key Points
- Store API keys in environment variables, never in source code.
- Set `max_tokens` to control costs and prevent runaway responses.
- Use `temperature: 0` for deterministic outputs in testing and data extraction.
- Implement retry logic with exponential backoff for rate limits (429) and server errors (5xx).
- Track token usage from `response.usage` for cost monitoring.
- Use the latest model versions (e.g., `gpt-4o`) for best price-performance.
- Validate and sanitize user input before including it in prompts.
- Set request timeouts to avoid hanging on slow responses.
- Exceeding context window limits without truncation logic causes API errors.
- Not handling `null` content in response choices leads to runtime crashes.
- Ignoring rate limit headers (`x-ratelimit-remaining`) results in unnecessary 429 errors.
- Hardcoding model names makes upgrades painful; use configuration or constants.
## Quick Example
```typescript
import OpenAI from "openai";
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
```
```typescript
const messages: OpenAI.ChatCompletionMessageParam[] = [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Explain async/await in JavaScript." },
];
```skilldb get llm-integration-skills/Openai APIFull skill: 208 linesOpenAI API — LLM Integration
You are an expert in OpenAI API integration for building LLM-powered applications.
Overview
The OpenAI API provides access to GPT-4, GPT-4o, and other models through a REST API and official SDKs. Integration involves managing API keys, constructing message arrays, handling token limits, and processing responses efficiently.
Core Concepts
Authentication and Client Setup
import OpenAI from "openai";
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
Message Structure
OpenAI uses a role-based message array with system, user, and assistant roles:
const messages: OpenAI.ChatCompletionMessageParam[] = [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Explain async/await in JavaScript." },
];
Chat Completions
async function chat(prompt: string): Promise<string> {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
temperature: 0.7,
max_tokens: 1024,
});
return response.choices[0].message.content ?? "";
}
Structured Outputs with response_format
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "List 3 programming languages as JSON" }],
response_format: { type: "json_object" },
});
const data = JSON.parse(response.choices[0].message.content!);
Streaming Responses
const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
Token Counting and Cost Management
import { encoding_for_model } from "tiktoken";
function countTokens(text: string, model: string = "gpt-4o"): number {
const enc = encoding_for_model(model as any);
const tokens = enc.encode(text);
enc.free();
return tokens.length;
}
Retry Logic with Exponential Backoff
async function chatWithRetry(
messages: OpenAI.ChatCompletionMessageParam[],
maxRetries = 3
): Promise<string> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages,
});
return response.choices[0].message.content ?? "";
} catch (error: any) {
if (error.status === 429 || error.status >= 500) {
const delay = Math.pow(2, attempt) * 1000;
await new Promise((r) => setTimeout(r, delay));
continue;
}
throw error;
}
}
throw new Error("Max retries exceeded");
}
Implementation Patterns
Conversation History Management
class ConversationManager {
private messages: OpenAI.ChatCompletionMessageParam[] = [];
private maxTokens = 8000;
constructor(systemPrompt: string) {
this.messages = [{ role: "system", content: systemPrompt }];
}
async send(userMessage: string): Promise<string> {
this.messages.push({ role: "user", content: userMessage });
this.trimHistory();
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: this.messages,
});
const reply = response.choices[0].message.content ?? "";
this.messages.push({ role: "assistant", content: reply });
return reply;
}
private trimHistory(): void {
while (this.messages.length > 20) {
// Keep system message, remove oldest user/assistant pair
this.messages.splice(1, 2);
}
}
}
Batch Processing
async function batchProcess(prompts: string[], concurrency = 5): Promise<string[]> {
const results: string[] = [];
for (let i = 0; i < prompts.length; i += concurrency) {
const batch = prompts.slice(i, i + concurrency);
const batchResults = await Promise.all(batch.map((p) => chat(p)));
results.push(...batchResults);
}
return results;
}
Best Practices
- Store API keys in environment variables, never in source code.
- Set
max_tokensto control costs and prevent runaway responses. - Use
temperature: 0for deterministic outputs in testing and data extraction. - Implement retry logic with exponential backoff for rate limits (429) and server errors (5xx).
- Track token usage from
response.usagefor cost monitoring. - Use the latest model versions (e.g.,
gpt-4o) for best price-performance. - Validate and sanitize user input before including it in prompts.
- Set request timeouts to avoid hanging on slow responses.
Core Philosophy
The OpenAI API is a stateless request-response interface. Every call is independent; the API has no memory of previous requests. Conversation continuity is your responsibility -- you maintain the message history, decide what to include, and manage the token budget. This stateless design is both a constraint and a feature: it means you have full control over what context the model sees, but it also means you must implement trimming, summarization, or windowing when conversations grow long.
Treat cost and latency as first-class design parameters, not afterthoughts. Every token in the prompt costs money and adds latency; every token in the response costs more money and adds more latency. Setting max_tokens is not just about preventing runaway responses -- it is about expressing your intent for how long the response should be. Tracking response.usage per request, aggregating costs, and setting per-user or per-session budgets are production requirements, not nice-to-haves.
Reliability engineering is mandatory for any production integration. The API will return 429 (rate limit) and 5xx (server error) responses. Your code must handle these gracefully with exponential backoff and retry logic. It will occasionally return null content in response choices. It will sometimes produce malformed JSON even when response_format: json_object is set. Defensive coding -- null checks, parse validation, timeout handling -- is the difference between a demo and a production system.
Anti-Patterns
-
No retry logic for transient errors: Calling the API once and propagating any error directly to the user. Rate limits (429) and server errors (5xx) are expected in normal operation and should be retried with exponential backoff. Only authentication errors (401) and validation errors (400) should fail immediately.
-
Unbounded conversation history: Appending every user and assistant message to the history array without ever trimming. As the conversation grows, it eventually exceeds the context window, causing API errors. Implement a sliding window, summarization, or message count limit.
-
Hardcoding model names across the codebase: Scattering
"gpt-4o"as a string literal in every API call instead of defining it once in configuration. When a new model is released or pricing changes, every callsite must be found and updated. -
Ignoring
response.usagefor cost tracking: Making API calls without recording token usage. Without tracking, there is no way to detect cost anomalies, set budgets, or optimize prompt length. Logprompt_tokens,completion_tokens, andtotal_tokensfrom every response. -
Trusting
response_format: json_objectwithout validation: Assuming that settingresponse_formattojson_objectguarantees valid, schema-compliant JSON. The format constraint improves reliability but does not eliminate malformed output. Always parse withJSON.parsein a try-catch and validate against your schema.
Common Pitfalls
- Exceeding context window limits without truncation logic causes API errors.
- Not handling
nullcontent in response choices leads to runtime crashes. - Ignoring rate limit headers (
x-ratelimit-remaining) results in unnecessary 429 errors. - Hardcoding model names makes upgrades painful; use configuration or constants.
- Sending conversation history without trimming accumulates costs rapidly.
- Not setting
max_tokenscan result in unexpectedly long (and expensive) responses. - Using
JSON.parseon model output without validation; models can produce malformed JSON even withresponse_format: json_object.
Install this skill directly: skilldb add llm-integration-skills
Related Skills
Anthropic API
Anthropic Claude API integration for messages, streaming, and tool use
Embeddings
Text embeddings and semantic search with vector databases for LLM applications
Function Calling
Function/tool calling patterns for connecting LLMs to external APIs and data sources
Langchain
LangChain orchestration for chains, agents, memory, and retrieval workflows
Rag Pipeline
Building retrieval-augmented generation pipelines with document ingestion, retrieval, and synthesis
Streaming
Streaming LLM responses with SSE, WebSockets, and backpressure handling