Skip to main content
Technology & EngineeringVoice Speech Services277 lines

OpenAI TTS

"OpenAI TTS: text-to-speech API, voice selection (alloy/echo/fable/onyx/nova/shimmer), streaming, HD voices, audio formats"

Quick Summary18 lines
OpenAI's Text-to-Speech API provides high-quality voice synthesis with a simple, consistent interface. It prioritizes ease of use and integration with the broader OpenAI ecosystem. Build with these principles:

## Key Points

- **Simplicity over configuration** — The API uses sensible defaults. Pick a voice, send text, receive audio. No complex voice settings to tune.
- **Stream for responsiveness** — The streaming endpoint returns audio as it generates, enabling real-time playback in applications.
- **Match voice to context** — Each of the six voices has a distinct character. Choose deliberately based on your application's tone and audience.
- **Use HD when quality matters** — The `tts-1-hd` model produces higher fidelity audio at the cost of increased latency. Use `tts-1` for real-time scenarios.
- **Use opus for web streaming** — Opus provides the best quality-to-size ratio for web applications and real-time communication.
- **Use pcm for audio pipelines** — When feeding output into audio processing tools, use raw PCM (24kHz, 16-bit, mono) to avoid decode overhead.
- **Respect the 4096 character limit** — Split longer text at sentence boundaries. Never split mid-word or mid-sentence.
- **Cache generated audio** — Hash the input text + voice + model + speed to create cache keys. Identical inputs produce identical outputs.
- **Rate limit with queues** — Use a request queue with concurrency limits to stay within API rate limits during batch generation.
- **Using tts-1-hd for real-time chat** — The HD model adds significant latency. Use `tts-1` for conversational interfaces where responsiveness matters more than maximum fidelity.
- **Ignoring response_format** — Defaulting to mp3 when your pipeline needs raw PCM wastes CPU on decode. Match the format to your consumption pattern.
- **Generating then discarding** — Do not generate audio speculatively. Each request costs tokens. Generate on demand or cache results.
skilldb get voice-speech-services-skills/OpenAI TTSFull skill: 277 lines
Paste into your CLAUDE.md or agent config

OpenAI TTS Skill

Core Philosophy

OpenAI's Text-to-Speech API provides high-quality voice synthesis with a simple, consistent interface. It prioritizes ease of use and integration with the broader OpenAI ecosystem. Build with these principles:

  • Simplicity over configuration — The API uses sensible defaults. Pick a voice, send text, receive audio. No complex voice settings to tune.
  • Stream for responsiveness — The streaming endpoint returns audio as it generates, enabling real-time playback in applications.
  • Match voice to context — Each of the six voices has a distinct character. Choose deliberately based on your application's tone and audience.
  • Use HD when quality matters — The tts-1-hd model produces higher fidelity audio at the cost of increased latency. Use tts-1 for real-time scenarios.

Setup

Install the OpenAI SDK and configure:

import OpenAI from "openai";
import { createWriteStream } from "node:fs";
import { Readable } from "node:stream";
import { pipeline } from "node:stream/promises";
import { writeFile } from "node:fs/promises";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

Key Techniques

Basic Text-to-Speech

Generate speech and save to a file:

async function textToSpeech(
  text: string,
  voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" = "alloy",
  format: "mp3" | "opus" | "aac" | "flac" | "wav" | "pcm" = "mp3"
): Promise<Buffer> {
  const response = await openai.audio.speech.create({
    model: "tts-1",
    voice,
    input: text,
    response_format: format,
  });

  const arrayBuffer = await response.arrayBuffer();
  return Buffer.from(arrayBuffer);
}

// Usage
const audio = await textToSpeech(
  "Welcome to our application. Let me walk you through the key features.",
  "nova",
  "mp3"
);
await writeFile("welcome.mp3", audio);

HD Quality Generation

Use the HD model for narration, podcasts, and content where quality is critical:

async function generateHDSpeech(
  text: string,
  voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer"
): Promise<Buffer> {
  const response = await openai.audio.speech.create({
    model: "tts-1-hd",
    voice,
    input: text,
    response_format: "flac",
    speed: 1.0,
  });

  const arrayBuffer = await response.arrayBuffer();
  return Buffer.from(arrayBuffer);
}

Streaming Audio to File

Stream audio directly to disk for large text inputs:

async function streamSpeechToFile(
  text: string,
  outputPath: string,
  voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" = "alloy"
): Promise<void> {
  const response = await openai.audio.speech.create({
    model: "tts-1",
    voice,
    input: text,
    response_format: "mp3",
  });

  const nodeStream = Readable.from(
    Buffer.from(await response.arrayBuffer())
  );
  const fileStream = createWriteStream(outputPath);
  await pipeline(nodeStream, fileStream);
}

Streaming to an HTTP Response

Serve generated audio directly to a client in an Express-style handler:

import type { Request, Response } from "express";

async function handleTTSRequest(req: Request, res: Response): Promise<void> {
  const { text, voice = "alloy" } = req.body;

  if (!text || text.length > 4096) {
    res.status(400).json({ error: "Text required, max 4096 characters" });
    return;
  }

  const response = await openai.audio.speech.create({
    model: "tts-1",
    voice,
    input: text,
    response_format: "mp3",
  });

  res.setHeader("Content-Type", "audio/mpeg");
  res.setHeader("Transfer-Encoding", "chunked");

  const buffer = Buffer.from(await response.arrayBuffer());
  res.send(buffer);
}

Speed Control

Adjust playback speed between 0.25x and 4.0x:

async function generateWithSpeed(
  text: string,
  speed: number,
  voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" = "alloy"
): Promise<Buffer> {
  const clampedSpeed = Math.max(0.25, Math.min(4.0, speed));

  const response = await openai.audio.speech.create({
    model: "tts-1",
    voice,
    input: text,
    speed: clampedSpeed,
    response_format: "mp3",
  });

  return Buffer.from(await response.arrayBuffer());
}

Batch Generation with Multiple Voices

Generate the same text in multiple voices for comparison or multi-character scenarios:

type Voice = "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer";

interface VoiceResult {
  voice: Voice;
  audio: Buffer;
}

async function generateMultiVoice(
  text: string,
  voices: Voice[]
): Promise<VoiceResult[]> {
  const results = await Promise.all(
    voices.map(async (voice) => {
      const response = await openai.audio.speech.create({
        model: "tts-1-hd",
        voice,
        input: text,
        response_format: "mp3",
      });
      return {
        voice,
        audio: Buffer.from(await response.arrayBuffer()),
      };
    })
  );
  return results;
}

// Generate a dialogue with different voices per character
async function generateDialogue(
  lines: Array<{ speaker: Voice; text: string }>
): Promise<Buffer[]> {
  const audioSegments: Buffer[] = [];
  for (const line of lines) {
    const response = await openai.audio.speech.create({
      model: "tts-1-hd",
      voice: line.speaker,
      input: line.text,
      response_format: "mp3",
    });
    audioSegments.push(Buffer.from(await response.arrayBuffer()));
  }
  return audioSegments;
}

Chunked Processing for Long Text

Split long documents into chunks that respect sentence boundaries:

function splitTextIntoChunks(text: string, maxChars: number = 4000): string[] {
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  const chunks: string[] = [];
  let current = "";

  for (const sentence of sentences) {
    if (current.length + sentence.length > maxChars) {
      if (current) chunks.push(current.trim());
      current = sentence;
    } else {
      current += sentence;
    }
  }
  if (current) chunks.push(current.trim());
  return chunks;
}

async function generateLongAudio(
  text: string,
  voice: Voice
): Promise<Buffer[]> {
  const chunks = splitTextIntoChunks(text);
  const audioBuffers: Buffer[] = [];

  for (const chunk of chunks) {
    const response = await openai.audio.speech.create({
      model: "tts-1-hd",
      voice,
      input: chunk,
      response_format: "mp3",
    });
    audioBuffers.push(Buffer.from(await response.arrayBuffer()));
  }

  return audioBuffers;
}

Best Practices

  • Voice selection guidealloy: versatile neutral. echo: warm male. fable: expressive British. onyx: deep authoritative. nova: friendly female. shimmer: clear and gentle. Test each for your use case.
  • Use opus for web streaming — Opus provides the best quality-to-size ratio for web applications and real-time communication.
  • Use pcm for audio pipelines — When feeding output into audio processing tools, use raw PCM (24kHz, 16-bit, mono) to avoid decode overhead.
  • Respect the 4096 character limit — Split longer text at sentence boundaries. Never split mid-word or mid-sentence.
  • Cache generated audio — Hash the input text + voice + model + speed to create cache keys. Identical inputs produce identical outputs.
  • Rate limit with queues — Use a request queue with concurrency limits to stay within API rate limits during batch generation.

Anti-Patterns

  • Using tts-1-hd for real-time chat — The HD model adds significant latency. Use tts-1 for conversational interfaces where responsiveness matters more than maximum fidelity.
  • Ignoring response_format — Defaulting to mp3 when your pipeline needs raw PCM wastes CPU on decode. Match the format to your consumption pattern.
  • Generating then discarding — Do not generate audio speculatively. Each request costs tokens. Generate on demand or cache results.
  • Splitting text at character boundaries — Splitting mid-sentence produces unnatural pauses and intonation breaks. Always split at sentence or paragraph boundaries.
  • Embedding API keys in client-side code — Route all TTS requests through your backend. The API key grants access to your full OpenAI account.
  • Ignoring speed parameter for accessibility — Hardcoding speed at 1.0 excludes users who need slower speech. Expose speed as a user preference.

Install this skill directly: skilldb add voice-speech-services-skills

Get CLI access →