ElevenLabs
"ElevenLabs: AI voice synthesis, text-to-speech, voice cloning, streaming audio, voice design, multilingual, WebSocket streaming"
ElevenLabs provides the most natural-sounding AI voice synthesis available. The platform excels at voice cloning, multilingual speech, and low-latency streaming. Build with these principles:
## Key Points
- **Voice quality first** — Use the highest-fidelity model appropriate for your latency budget. Eleven Multilingual v2 for quality, Eleven Turbo v2.5 for speed.
- **Stream everything** — Never wait for full audio generation. Use chunked transfer or WebSocket streaming to deliver audio as it is produced.
- **Clone responsibly** — Voice cloning requires consent. Use instant voice cloning for prototyping and professional voice cloning for production.
- **Cache aggressively** — Identical text + voice + model + settings produces identical audio. Cache results to save quota and latency.
- **Tune voice settings per use case** — Higher stability (0.7-1.0) for narration and audiobooks. Lower stability (0.3-0.5) for conversational and expressive speech.
- **Use appropriate output formats** — `pcm_24000` for real-time playback pipelines. `mp3_44100_128` for storage and download. `ulaw_8000` for telephony.
- **Handle rate limits gracefully** — Implement exponential backoff. The API returns 429 status codes when quota is exceeded.
- **Monitor character usage** — Track usage via the `/v1/user/subscription` endpoint to avoid unexpected quota exhaustion.
- **Provide high-quality clone samples** — Use clean, noise-free recordings of 1-3 minutes for instant cloning. More samples and longer duration improve professional cloning quality.
- **Generating full documents in a single request** — Break long text into paragraphs or sentences. Long inputs increase latency and risk timeouts.
- **Ignoring voice settings** — Using default settings for all use cases produces inconsistent quality. Tune stability and similarity boost per voice and context.
- **Polling for completion** — Use streaming endpoints instead of generating and then downloading. Streaming delivers first audio bytes in under 300ms.
## Quick Example
```typescript
import ElevenLabs from "elevenlabs";
const client = new ElevenLabs({
apiKey: process.env.ELEVENLABS_API_KEY,
});
```
```typescript
import { Readable } from "node:stream";
import { pipeline } from "node:stream/promises";
import { createWriteStream } from "node:fs";
import { WebSocket } from "ws";
```skilldb get voice-speech-services-skills/ElevenLabsFull skill: 236 linesElevenLabs Skill
Core Philosophy
ElevenLabs provides the most natural-sounding AI voice synthesis available. The platform excels at voice cloning, multilingual speech, and low-latency streaming. Build with these principles:
- Voice quality first — Use the highest-fidelity model appropriate for your latency budget. Eleven Multilingual v2 for quality, Eleven Turbo v2.5 for speed.
- Stream everything — Never wait for full audio generation. Use chunked transfer or WebSocket streaming to deliver audio as it is produced.
- Clone responsibly — Voice cloning requires consent. Use instant voice cloning for prototyping and professional voice cloning for production.
- Cache aggressively — Identical text + voice + model + settings produces identical audio. Cache results to save quota and latency.
Setup
Install the official SDK and configure authentication:
import ElevenLabs from "elevenlabs";
const client = new ElevenLabs({
apiKey: process.env.ELEVENLABS_API_KEY,
});
For streaming use cases, install additional dependencies:
import { Readable } from "node:stream";
import { pipeline } from "node:stream/promises";
import { createWriteStream } from "node:fs";
import { WebSocket } from "ws";
Key Techniques
Basic Text-to-Speech
Generate speech from text and save to a file:
async function generateSpeech(text: string, voiceId: string): Promise<Buffer> {
const audio = await client.textToSpeech.convert(voiceId, {
text,
model_id: "eleven_multilingual_v2",
output_format: "mp3_44100_128",
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
style: 0.0,
use_speaker_boost: true,
},
});
const chunks: Buffer[] = [];
for await (const chunk of audio) {
chunks.push(Buffer.from(chunk));
}
return Buffer.concat(chunks);
}
Streaming Audio to a File
Stream generated audio directly to disk without buffering the entire response:
async function streamToFile(
text: string,
voiceId: string,
outputPath: string
): Promise<void> {
const audioStream = await client.textToSpeech.convertAsStream(voiceId, {
text,
model_id: "eleven_turbo_v2_5",
output_format: "mp3_22050_32",
});
const fileStream = createWriteStream(outputPath);
for await (const chunk of audioStream) {
fileStream.write(chunk);
}
fileStream.end();
}
WebSocket Streaming for Real-Time Applications
Use the input-streaming WebSocket endpoint for the lowest latency. Send text chunks as they arrive and receive audio chunks immediately:
interface WSMessage {
audio?: string;
isFinal?: boolean;
normalizedAlignment?: unknown;
}
function createRealtimeStream(
voiceId: string,
onAudioChunk: (chunk: Buffer) => void
): {
sendText: (text: string) => void;
flush: () => void;
close: () => Promise<void>;
} {
const modelId = "eleven_turbo_v2_5";
const wsUrl =
`wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input` +
`?model_id=${modelId}&output_format=pcm_24000`;
const ws = new WebSocket(wsUrl);
let resolveClose: () => void;
const closePromise = new Promise<void>((r) => (resolveClose = r));
ws.on("open", () => {
ws.send(
JSON.stringify({
text: " ",
voice_settings: { stability: 0.5, similarity_boost: 0.75 },
xi_api_key: process.env.ELEVENLABS_API_KEY,
})
);
});
ws.on("message", (data: string) => {
const msg: WSMessage = JSON.parse(data);
if (msg.audio) {
onAudioChunk(Buffer.from(msg.audio, "base64"));
}
if (msg.isFinal) {
resolveClose();
}
});
return {
sendText: (text: string) => ws.send(JSON.stringify({ text })),
flush: () => ws.send(JSON.stringify({ text: "" })),
close: async () => {
ws.send(JSON.stringify({ text: "" }));
await closePromise;
ws.close();
},
};
}
Voice Cloning
Clone a voice from audio samples using instant voice cloning:
import { createReadStream } from "node:fs";
async function cloneVoice(
name: string,
samplePaths: string[],
description: string
): Promise<string> {
const files = samplePaths.map((p) => createReadStream(p));
const voice = await client.voices.add({
name,
description,
files,
labels: JSON.stringify({ accent: "neutral", use_case: "narration" }),
});
return voice.voice_id;
}
Listing and Managing Voices
async function listAvailableVoices() {
const response = await client.voices.getAll();
return response.voices.map((v) => ({
id: v.voice_id,
name: v.name,
category: v.category,
labels: v.labels,
previewUrl: v.preview_url,
}));
}
async function deleteClonedVoice(voiceId: string): Promise<void> {
await client.voices.delete(voiceId);
}
Multilingual Speech Generation
Generate speech in different languages using the multilingual model:
async function generateMultilingual(
text: string,
voiceId: string,
languageCode: string
): Promise<Buffer> {
const audio = await client.textToSpeech.convert(voiceId, {
text,
model_id: "eleven_multilingual_v2",
language_code: languageCode,
output_format: "mp3_44100_128",
});
const chunks: Buffer[] = [];
for await (const chunk of audio) {
chunks.push(Buffer.from(chunk));
}
return Buffer.concat(chunks);
}
Best Practices
- Choose the right model — Use
eleven_turbo_v2_5for conversational or real-time applications where latency matters. Useeleven_multilingual_v2when voice quality and language coverage are paramount. - Tune voice settings per use case — Higher stability (0.7-1.0) for narration and audiobooks. Lower stability (0.3-0.5) for conversational and expressive speech.
- Use appropriate output formats —
pcm_24000for real-time playback pipelines.mp3_44100_128for storage and download.ulaw_8000for telephony. - Handle rate limits gracefully — Implement exponential backoff. The API returns 429 status codes when quota is exceeded.
- Monitor character usage — Track usage via the
/v1/user/subscriptionendpoint to avoid unexpected quota exhaustion. - Provide high-quality clone samples — Use clean, noise-free recordings of 1-3 minutes for instant cloning. More samples and longer duration improve professional cloning quality.
Anti-Patterns
- Generating full documents in a single request — Break long text into paragraphs or sentences. Long inputs increase latency and risk timeouts.
- Ignoring voice settings — Using default settings for all use cases produces inconsistent quality. Tune stability and similarity boost per voice and context.
- Polling for completion — Use streaming endpoints instead of generating and then downloading. Streaming delivers first audio bytes in under 300ms.
- Storing API keys client-side — Never embed keys in frontend code. Proxy through your backend.
- Skipping audio format selection — Defaulting to high-bitrate formats wastes bandwidth for telephony or mobile use cases. Match format to delivery channel.
- Cloning voices without consent — Always obtain explicit permission from the voice owner. ElevenLabs enforces consent verification for professional cloning.
Install this skill directly: skilldb add voice-speech-services-skills
Related Skills
Amazon Polly
"Amazon Polly: AWS text-to-speech, neural/standard voices, SSML, lexicons, speech marks, streaming"
AssemblyAI
"AssemblyAI: speech-to-text, real-time transcription, speaker diarization, content moderation, summarization, sentiment analysis"
Cartesia
Integrate Cartesia's ultra-low-latency voice API for real-time text-to-speech and voice cloning
Deepgram
"Deepgram: speech-to-text, real-time transcription, pre-recorded audio, diarization, sentiment analysis, WebSocket streaming"
Google Cloud Text to Speech
"Google Cloud Text-to-Speech: WaveNet/Neural2 voices, SSML, audio profiles, streaming, multilingual"
OpenAI TTS
"OpenAI TTS: text-to-speech API, voice selection (alloy/echo/fable/onyx/nova/shimmer), streaming, HD voices, audio formats"