Technology & EngineeringVoice Speech Services236 lines

ElevenLabs

"ElevenLabs: AI voice synthesis, text-to-speech, voice cloning, streaming audio, voice design, multilingual, WebSocket streaming"

Quick Summary35 lines

ElevenLabs provides the most natural-sounding AI voice synthesis available. The platform excels at voice cloning, multilingual speech, and low-latency streaming. Build with these principles:

## Key Points

- **Voice quality first** — Use the highest-fidelity model appropriate for your latency budget. Eleven Multilingual v2 for quality, Eleven Turbo v2.5 for speed.
- **Stream everything** — Never wait for full audio generation. Use chunked transfer or WebSocket streaming to deliver audio as it is produced.
- **Clone responsibly** — Voice cloning requires consent. Use instant voice cloning for prototyping and professional voice cloning for production.
- **Cache aggressively** — Identical text + voice + model + settings produces identical audio. Cache results to save quota and latency.
- **Tune voice settings per use case** — Higher stability (0.7-1.0) for narration and audiobooks. Lower stability (0.3-0.5) for conversational and expressive speech.
- **Use appropriate output formats** — `pcm_24000` for real-time playback pipelines. `mp3_44100_128` for storage and download. `ulaw_8000` for telephony.
- **Handle rate limits gracefully** — Implement exponential backoff. The API returns 429 status codes when quota is exceeded.
- **Monitor character usage** — Track usage via the `/v1/user/subscription` endpoint to avoid unexpected quota exhaustion.
- **Provide high-quality clone samples** — Use clean, noise-free recordings of 1-3 minutes for instant cloning. More samples and longer duration improve professional cloning quality.
- **Generating full documents in a single request** — Break long text into paragraphs or sentences. Long inputs increase latency and risk timeouts.
- **Ignoring voice settings** — Using default settings for all use cases produces inconsistent quality. Tune stability and similarity boost per voice and context.
- **Polling for completion** — Use streaming endpoints instead of generating and then downloading. Streaming delivers first audio bytes in under 300ms.

## Quick Example

```typescript
import ElevenLabs from "elevenlabs";

const client = new ElevenLabs({
  apiKey: process.env.ELEVENLABS_API_KEY,
});
```

```typescript
import { Readable } from "node:stream";
import { pipeline } from "node:stream/promises";
import { createWriteStream } from "node:fs";
import { WebSocket } from "ws";
```

skilldb get voice-speech-services-skills/ElevenLabsFull skill: 236 lines

Paste into your CLAUDE.md or agent config

ElevenLabs Skill

Core Philosophy

ElevenLabs provides the most natural-sounding AI voice synthesis available. The platform excels at voice cloning, multilingual speech, and low-latency streaming. Build with these principles:

Voice quality first — Use the highest-fidelity model appropriate for your latency budget. Eleven Multilingual v2 for quality, Eleven Turbo v2.5 for speed.
Stream everything — Never wait for full audio generation. Use chunked transfer or WebSocket streaming to deliver audio as it is produced.
Clone responsibly — Voice cloning requires consent. Use instant voice cloning for prototyping and professional voice cloning for production.
Cache aggressively — Identical text + voice + model + settings produces identical audio. Cache results to save quota and latency.

Setup

Install the official SDK and configure authentication:

import ElevenLabs from "elevenlabs";

const client = new ElevenLabs({
  apiKey: process.env.ELEVENLABS_API_KEY,
});

For streaming use cases, install additional dependencies:

import { Readable } from "node:stream";
import { pipeline } from "node:stream/promises";
import { createWriteStream } from "node:fs";
import { WebSocket } from "ws";

Key Techniques

Basic Text-to-Speech

Generate speech from text and save to a file:

async function generateSpeech(text: string, voiceId: string): Promise<Buffer> {
  const audio = await client.textToSpeech.convert(voiceId, {
    text,
    model_id: "eleven_multilingual_v2",
    output_format: "mp3_44100_128",
    voice_settings: {
      stability: 0.5,
      similarity_boost: 0.75,
      style: 0.0,
      use_speaker_boost: true,
    },
  });

  const chunks: Buffer[] = [];
  for await (const chunk of audio) {
    chunks.push(Buffer.from(chunk));
  }
  return Buffer.concat(chunks);
}

Streaming Audio to a File

Stream generated audio directly to disk without buffering the entire response:

async function streamToFile(
  text: string,
  voiceId: string,
  outputPath: string
): Promise<void> {
  const audioStream = await client.textToSpeech.convertAsStream(voiceId, {
    text,
    model_id: "eleven_turbo_v2_5",
    output_format: "mp3_22050_32",
  });

  const fileStream = createWriteStream(outputPath);
  for await (const chunk of audioStream) {
    fileStream.write(chunk);
  }
  fileStream.end();
}

WebSocket Streaming for Real-Time Applications

Use the input-streaming WebSocket endpoint for the lowest latency. Send text chunks as they arrive and receive audio chunks immediately:

interface WSMessage {
  audio?: string;
  isFinal?: boolean;
  normalizedAlignment?: unknown;
}

function createRealtimeStream(
  voiceId: string,
  onAudioChunk: (chunk: Buffer) => void
): {
  sendText: (text: string) => void;
  flush: () => void;
  close: () => Promise<void>;
} {
  const modelId = "eleven_turbo_v2_5";
  const wsUrl =
    `wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input` +
    `?model_id=${modelId}&output_format=pcm_24000`;

  const ws = new WebSocket(wsUrl);
  let resolveClose: () => void;
  const closePromise = new Promise<void>((r) => (resolveClose = r));

  ws.on("open", () => {
    ws.send(
      JSON.stringify({
        text: " ",
        voice_settings: { stability: 0.5, similarity_boost: 0.75 },
        xi_api_key: process.env.ELEVENLABS_API_KEY,
      })
    );
  });

  ws.on("message", (data: string) => {
    const msg: WSMessage = JSON.parse(data);
    if (msg.audio) {
      onAudioChunk(Buffer.from(msg.audio, "base64"));
    }
    if (msg.isFinal) {
      resolveClose();
    }
  });

  return {
    sendText: (text: string) => ws.send(JSON.stringify({ text })),
    flush: () => ws.send(JSON.stringify({ text: "" })),
    close: async () => {
      ws.send(JSON.stringify({ text: "" }));
      await closePromise;
      ws.close();
    },
  };
}

Voice Cloning

Clone a voice from audio samples using instant voice cloning:

import { createReadStream } from "node:fs";

async function cloneVoice(
  name: string,
  samplePaths: string[],
  description: string
): Promise<string> {
  const files = samplePaths.map((p) => createReadStream(p));

  const voice = await client.voices.add({
    name,
    description,
    files,
    labels: JSON.stringify({ accent: "neutral", use_case: "narration" }),
  });

  return voice.voice_id;
}

Listing and Managing Voices

async function listAvailableVoices() {
  const response = await client.voices.getAll();
  return response.voices.map((v) => ({
    id: v.voice_id,
    name: v.name,
    category: v.category,
    labels: v.labels,
    previewUrl: v.preview_url,
  }));
}

async function deleteClonedVoice(voiceId: string): Promise<void> {
  await client.voices.delete(voiceId);
}

Multilingual Speech Generation

Generate speech in different languages using the multilingual model:

async function generateMultilingual(
  text: string,
  voiceId: string,
  languageCode: string
): Promise<Buffer> {
  const audio = await client.textToSpeech.convert(voiceId, {
    text,
    model_id: "eleven_multilingual_v2",
    language_code: languageCode,
    output_format: "mp3_44100_128",
  });

  const chunks: Buffer[] = [];
  for await (const chunk of audio) {
    chunks.push(Buffer.from(chunk));
  }
  return Buffer.concat(chunks);
}

Best Practices

Choose the right model — Use eleven_turbo_v2_5 for conversational or real-time applications where latency matters. Use eleven_multilingual_v2 when voice quality and language coverage are paramount.
Tune voice settings per use case — Higher stability (0.7-1.0) for narration and audiobooks. Lower stability (0.3-0.5) for conversational and expressive speech.
Use appropriate output formats — pcm_24000 for real-time playback pipelines. mp3_44100_128 for storage and download. ulaw_8000 for telephony.
Handle rate limits gracefully — Implement exponential backoff. The API returns 429 status codes when quota is exceeded.
Monitor character usage — Track usage via the /v1/user/subscription endpoint to avoid unexpected quota exhaustion.
Provide high-quality clone samples — Use clean, noise-free recordings of 1-3 minutes for instant cloning. More samples and longer duration improve professional cloning quality.

Anti-Patterns

Generating full documents in a single request — Break long text into paragraphs or sentences. Long inputs increase latency and risk timeouts.
Ignoring voice settings — Using default settings for all use cases produces inconsistent quality. Tune stability and similarity boost per voice and context.
Polling for completion — Use streaming endpoints instead of generating and then downloading. Streaming delivers first audio bytes in under 300ms.
Storing API keys client-side — Never embed keys in frontend code. Proxy through your backend.
Skipping audio format selection — Defaulting to high-bitrate formats wastes bandwidth for telephony or mobile use cases. Match format to delivery channel.
Cloning voices without consent — Always obtain explicit permission from the voice owner. ElevenLabs enforces consent verification for professional cloning.

Install this skill directly: skilldb add voice-speech-services-skills

Get CLI access →

Related Skills

Amazon Polly

"Amazon Polly: AWS text-to-speech, neural/standard voices, SSML, lexicons, speech marks, streaming"

Voice Speech Services•366L

AssemblyAI

"AssemblyAI: speech-to-text, real-time transcription, speaker diarization, content moderation, summarization, sentiment analysis"

Voice Speech Services•326L

Cartesia

Integrate Cartesia's ultra-low-latency voice API for real-time text-to-speech and voice cloning

Voice Speech Services•215L

Deepgram

"Deepgram: speech-to-text, real-time transcription, pre-recorded audio, diarization, sentiment analysis, WebSocket streaming"

Voice Speech Services•304L

Google Cloud Text to Speech

"Google Cloud Text-to-Speech: WaveNet/Neural2 voices, SSML, audio profiles, streaming, multilingual"

Voice Speech Services•312L

OpenAI TTS

"OpenAI TTS: text-to-speech API, voice selection (alloy/echo/fable/onyx/nova/shimmer), streaming, HD voices, audio formats"

Voice Speech Services•277L