OpenAI TTS
"OpenAI TTS: text-to-speech API, voice selection (alloy/echo/fable/onyx/nova/shimmer), streaming, HD voices, audio formats"
OpenAI's Text-to-Speech API provides high-quality voice synthesis with a simple, consistent interface. It prioritizes ease of use and integration with the broader OpenAI ecosystem. Build with these principles: ## Key Points - **Simplicity over configuration** — The API uses sensible defaults. Pick a voice, send text, receive audio. No complex voice settings to tune. - **Stream for responsiveness** — The streaming endpoint returns audio as it generates, enabling real-time playback in applications. - **Match voice to context** — Each of the six voices has a distinct character. Choose deliberately based on your application's tone and audience. - **Use HD when quality matters** — The `tts-1-hd` model produces higher fidelity audio at the cost of increased latency. Use `tts-1` for real-time scenarios. - **Use opus for web streaming** — Opus provides the best quality-to-size ratio for web applications and real-time communication. - **Use pcm for audio pipelines** — When feeding output into audio processing tools, use raw PCM (24kHz, 16-bit, mono) to avoid decode overhead. - **Respect the 4096 character limit** — Split longer text at sentence boundaries. Never split mid-word or mid-sentence. - **Cache generated audio** — Hash the input text + voice + model + speed to create cache keys. Identical inputs produce identical outputs. - **Rate limit with queues** — Use a request queue with concurrency limits to stay within API rate limits during batch generation. - **Using tts-1-hd for real-time chat** — The HD model adds significant latency. Use `tts-1` for conversational interfaces where responsiveness matters more than maximum fidelity. - **Ignoring response_format** — Defaulting to mp3 when your pipeline needs raw PCM wastes CPU on decode. Match the format to your consumption pattern. - **Generating then discarding** — Do not generate audio speculatively. Each request costs tokens. Generate on demand or cache results.
skilldb get voice-speech-services-skills/OpenAI TTSFull skill: 277 linesOpenAI TTS Skill
Core Philosophy
OpenAI's Text-to-Speech API provides high-quality voice synthesis with a simple, consistent interface. It prioritizes ease of use and integration with the broader OpenAI ecosystem. Build with these principles:
- Simplicity over configuration — The API uses sensible defaults. Pick a voice, send text, receive audio. No complex voice settings to tune.
- Stream for responsiveness — The streaming endpoint returns audio as it generates, enabling real-time playback in applications.
- Match voice to context — Each of the six voices has a distinct character. Choose deliberately based on your application's tone and audience.
- Use HD when quality matters — The
tts-1-hdmodel produces higher fidelity audio at the cost of increased latency. Usetts-1for real-time scenarios.
Setup
Install the OpenAI SDK and configure:
import OpenAI from "openai";
import { createWriteStream } from "node:fs";
import { Readable } from "node:stream";
import { pipeline } from "node:stream/promises";
import { writeFile } from "node:fs/promises";
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
Key Techniques
Basic Text-to-Speech
Generate speech and save to a file:
async function textToSpeech(
text: string,
voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" = "alloy",
format: "mp3" | "opus" | "aac" | "flac" | "wav" | "pcm" = "mp3"
): Promise<Buffer> {
const response = await openai.audio.speech.create({
model: "tts-1",
voice,
input: text,
response_format: format,
});
const arrayBuffer = await response.arrayBuffer();
return Buffer.from(arrayBuffer);
}
// Usage
const audio = await textToSpeech(
"Welcome to our application. Let me walk you through the key features.",
"nova",
"mp3"
);
await writeFile("welcome.mp3", audio);
HD Quality Generation
Use the HD model for narration, podcasts, and content where quality is critical:
async function generateHDSpeech(
text: string,
voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer"
): Promise<Buffer> {
const response = await openai.audio.speech.create({
model: "tts-1-hd",
voice,
input: text,
response_format: "flac",
speed: 1.0,
});
const arrayBuffer = await response.arrayBuffer();
return Buffer.from(arrayBuffer);
}
Streaming Audio to File
Stream audio directly to disk for large text inputs:
async function streamSpeechToFile(
text: string,
outputPath: string,
voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" = "alloy"
): Promise<void> {
const response = await openai.audio.speech.create({
model: "tts-1",
voice,
input: text,
response_format: "mp3",
});
const nodeStream = Readable.from(
Buffer.from(await response.arrayBuffer())
);
const fileStream = createWriteStream(outputPath);
await pipeline(nodeStream, fileStream);
}
Streaming to an HTTP Response
Serve generated audio directly to a client in an Express-style handler:
import type { Request, Response } from "express";
async function handleTTSRequest(req: Request, res: Response): Promise<void> {
const { text, voice = "alloy" } = req.body;
if (!text || text.length > 4096) {
res.status(400).json({ error: "Text required, max 4096 characters" });
return;
}
const response = await openai.audio.speech.create({
model: "tts-1",
voice,
input: text,
response_format: "mp3",
});
res.setHeader("Content-Type", "audio/mpeg");
res.setHeader("Transfer-Encoding", "chunked");
const buffer = Buffer.from(await response.arrayBuffer());
res.send(buffer);
}
Speed Control
Adjust playback speed between 0.25x and 4.0x:
async function generateWithSpeed(
text: string,
speed: number,
voice: "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer" = "alloy"
): Promise<Buffer> {
const clampedSpeed = Math.max(0.25, Math.min(4.0, speed));
const response = await openai.audio.speech.create({
model: "tts-1",
voice,
input: text,
speed: clampedSpeed,
response_format: "mp3",
});
return Buffer.from(await response.arrayBuffer());
}
Batch Generation with Multiple Voices
Generate the same text in multiple voices for comparison or multi-character scenarios:
type Voice = "alloy" | "echo" | "fable" | "onyx" | "nova" | "shimmer";
interface VoiceResult {
voice: Voice;
audio: Buffer;
}
async function generateMultiVoice(
text: string,
voices: Voice[]
): Promise<VoiceResult[]> {
const results = await Promise.all(
voices.map(async (voice) => {
const response = await openai.audio.speech.create({
model: "tts-1-hd",
voice,
input: text,
response_format: "mp3",
});
return {
voice,
audio: Buffer.from(await response.arrayBuffer()),
};
})
);
return results;
}
// Generate a dialogue with different voices per character
async function generateDialogue(
lines: Array<{ speaker: Voice; text: string }>
): Promise<Buffer[]> {
const audioSegments: Buffer[] = [];
for (const line of lines) {
const response = await openai.audio.speech.create({
model: "tts-1-hd",
voice: line.speaker,
input: line.text,
response_format: "mp3",
});
audioSegments.push(Buffer.from(await response.arrayBuffer()));
}
return audioSegments;
}
Chunked Processing for Long Text
Split long documents into chunks that respect sentence boundaries:
function splitTextIntoChunks(text: string, maxChars: number = 4000): string[] {
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const chunks: string[] = [];
let current = "";
for (const sentence of sentences) {
if (current.length + sentence.length > maxChars) {
if (current) chunks.push(current.trim());
current = sentence;
} else {
current += sentence;
}
}
if (current) chunks.push(current.trim());
return chunks;
}
async function generateLongAudio(
text: string,
voice: Voice
): Promise<Buffer[]> {
const chunks = splitTextIntoChunks(text);
const audioBuffers: Buffer[] = [];
for (const chunk of chunks) {
const response = await openai.audio.speech.create({
model: "tts-1-hd",
voice,
input: chunk,
response_format: "mp3",
});
audioBuffers.push(Buffer.from(await response.arrayBuffer()));
}
return audioBuffers;
}
Best Practices
- Voice selection guide —
alloy: versatile neutral.echo: warm male.fable: expressive British.onyx: deep authoritative.nova: friendly female.shimmer: clear and gentle. Test each for your use case. - Use opus for web streaming — Opus provides the best quality-to-size ratio for web applications and real-time communication.
- Use pcm for audio pipelines — When feeding output into audio processing tools, use raw PCM (24kHz, 16-bit, mono) to avoid decode overhead.
- Respect the 4096 character limit — Split longer text at sentence boundaries. Never split mid-word or mid-sentence.
- Cache generated audio — Hash the input text + voice + model + speed to create cache keys. Identical inputs produce identical outputs.
- Rate limit with queues — Use a request queue with concurrency limits to stay within API rate limits during batch generation.
Anti-Patterns
- Using tts-1-hd for real-time chat — The HD model adds significant latency. Use
tts-1for conversational interfaces where responsiveness matters more than maximum fidelity. - Ignoring response_format — Defaulting to mp3 when your pipeline needs raw PCM wastes CPU on decode. Match the format to your consumption pattern.
- Generating then discarding — Do not generate audio speculatively. Each request costs tokens. Generate on demand or cache results.
- Splitting text at character boundaries — Splitting mid-sentence produces unnatural pauses and intonation breaks. Always split at sentence or paragraph boundaries.
- Embedding API keys in client-side code — Route all TTS requests through your backend. The API key grants access to your full OpenAI account.
- Ignoring speed parameter for accessibility — Hardcoding speed at 1.0 excludes users who need slower speech. Expose speed as a user preference.
Install this skill directly: skilldb add voice-speech-services-skills
Related Skills
Amazon Polly
"Amazon Polly: AWS text-to-speech, neural/standard voices, SSML, lexicons, speech marks, streaming"
AssemblyAI
"AssemblyAI: speech-to-text, real-time transcription, speaker diarization, content moderation, summarization, sentiment analysis"
Cartesia
Integrate Cartesia's ultra-low-latency voice API for real-time text-to-speech and voice cloning
Deepgram
"Deepgram: speech-to-text, real-time transcription, pre-recorded audio, diarization, sentiment analysis, WebSocket streaming"
ElevenLabs
"ElevenLabs: AI voice synthesis, text-to-speech, voice cloning, streaming audio, voice design, multilingual, WebSocket streaming"
Google Cloud Text to Speech
"Google Cloud Text-to-Speech: WaveNet/Neural2 voices, SSML, audio profiles, streaming, multilingual"