Technology & EngineeringVoice Speech Services215 lines

Cartesia

Integrate Cartesia's ultra-low-latency voice API for real-time text-to-speech and voice cloning

Quick Summary17 lines

You are an expert in integrating Cartesia for real-time voice synthesis and speech generation.

## Quick Example

```bash
npm install @cartesia/cartesia-js
# or
pip install cartesia
```

```python
from cartesia import Cartesia

client = Cartesia(api_key="your-api-key")
```

skilldb get voice-speech-services-skills/CartesiaFull skill: 215 lines

Paste into your CLAUDE.md or agent config

Cartesia — Voice & Speech

You are an expert in integrating Cartesia for real-time voice synthesis and speech generation.

Core Philosophy

Overview

Cartesia provides an ultra-low-latency text-to-speech API built on the Sonic model family. It supports streaming audio output, voice cloning from short samples, multilingual synthesis, and WebSocket-based real-time connections. Cartesia is designed for conversational AI, voice agents, and any application where latency matters.

Setup & Configuration

Installation

npm install @cartesia/cartesia-js
# or
pip install cartesia

Authentication

from cartesia import Cartesia

client = Cartesia(api_key="your-api-key")

import Cartesia from "@cartesia/cartesia-js";

const cartesia = new Cartesia({ apiKey: "your-api-key" });

Environment Setup

export CARTESIA_API_KEY="your-api-key"

import os
from cartesia import Cartesia

client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

Core Patterns

Basic Text-to-Speech (REST)

from cartesia import Cartesia

client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

audio_data = client.tts.bytes(
    model_id="sonic-2",
    transcript="Hello, welcome to our application.",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
    output_format={
        "container": "wav",
        "encoding": "pcm_f32le",
        "sample_rate": 44100,
    },
)

with open("output.wav", "wb") as f:
    f.write(audio_data)

Streaming TTS via WebSocket

import asyncio
from cartesia import AsyncCartesia

async def stream_speech():
    client = AsyncCartesia(api_key=os.environ["CARTESIA_API_KEY"])
    ws = await client.tts.websocket()

    voice_id = "a0e99841-438c-4a64-b679-ae501e7d6091"

    async for chunk in ws.send(
        model_id="sonic-2",
        transcript="This is a streaming example with very low latency.",
        voice_id=voice_id,
        output_format={
            "container": "raw",
            "encoding": "pcm_f32le",
            "sample_rate": 24000,
        },
        stream=True,
    ):
        # Process each audio chunk as it arrives
        process_audio_chunk(chunk)

    await ws.close()
    await client.close()

asyncio.run(stream_speech())

Streaming TTS (JavaScript)

import Cartesia from "@cartesia/cartesia-js";

const cartesia = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY });

const websocket = await cartesia.tts.websocket({
  container: "raw",
  encoding: "pcm_f32le",
  sampleRate: 44100,
});

const response = await websocket.send({
  modelId: "sonic-2",
  voice: { mode: "id", id: "a0e99841-438c-4a64-b679-ae501e7d6091" },
  transcript: "Hello from Cartesia!",
});

response.on("message", (chunk) => {
  // Handle streaming audio chunk
});

Voice Cloning from an Audio Sample

# Clone a voice from a short audio clip (minimum ~5 seconds)
cloned_voice = client.voices.clone(
    clip=open("voice_sample.wav", "rb"),
    name="my-cloned-voice",
    description="Cloned from a sample recording",
)

# Use the cloned voice for synthesis
audio_data = client.tts.bytes(
    model_id="sonic-2",
    transcript="This should sound like the original speaker.",
    voice_id=cloned_voice["id"],
    output_format={
        "container": "wav",
        "encoding": "pcm_f32le",
        "sample_rate": 44100,
    },
)

Voice Embedding Mixing

# Retrieve voice embeddings and mix them
voice_a = client.voices.get(id="voice-id-a")
voice_b = client.voices.get(id="voice-id-b")

embedding_a = voice_a["embedding"]
embedding_b = voice_b["embedding"]

# Blend two voices (50/50 mix)
mixed_embedding = [
    (a + b) / 2 for a, b in zip(embedding_a, embedding_b)
]

audio_data = client.tts.bytes(
    model_id="sonic-2",
    transcript="This is a blended voice.",
    voice_id={"mode": "embedding", "embedding": mixed_embedding},
    output_format={
        "container": "wav",
        "encoding": "pcm_f32le",
        "sample_rate": 44100,
    },
)

Listing Available Voices

voices = client.voices.list()
for voice in voices:
    print(f"{voice['id']}: {voice['name']} — {voice['description']}")

Best Practices

Use WebSocket connections for real-time applications. The WebSocket API provides significantly lower time-to-first-byte compared to REST, making it essential for voice agents and conversational flows.
Choose the right output format for your use case. Use raw PCM for streaming pipelines where you handle playback yourself, and wav for file-based workflows. Match the sample rate to your playback target (24000 Hz for telephony, 44100 Hz for high quality).
Reuse WebSocket connections across multiple utterances. Opening a new connection per request adds unnecessary latency. Send multiple transcripts over the same connection and use context IDs to track responses.

Common Pitfalls

Ignoring connection lifecycle management. Failing to properly close WebSocket connections leads to resource leaks. Always call close() on both the WebSocket and the client when done, especially in async contexts.
Using overly short clips for voice cloning. Cartesia needs a minimum of roughly 5 seconds of clean speech for reliable cloning. Clips with background noise or multiple speakers produce poor results. Use clean, single-speaker recordings of 10-30 seconds for best quality.

Anti-Patterns

Using the service without understanding its pricing model. Cloud services bill differently — per request, per GB, per seat. Deploying without modeling expected costs leads to surprise invoices.

Hardcoding configuration instead of using environment variables. API keys, endpoints, and feature flags change between environments. Hardcoded values break deployments and leak secrets.

Ignoring the service's rate limits and quotas. Every external API has throughput limits. Failing to implement backoff, queuing, or caching results in dropped requests under load.

Treating the service as always available. External services go down. Without circuit breakers, fallbacks, or graceful degradation, a third-party outage becomes your outage.

Coupling your architecture to a single provider's API. Building directly against provider-specific interfaces makes migration painful. Wrap external services in thin adapter layers.

Install this skill directly: skilldb add voice-speech-services-skills

Get CLI access →