Technology & EngineeringVoice Speech Services224 lines

Playht

Integrate PlayHT's voice API for text-to-speech, voice cloning, and real-time audio streaming

Quick Summary20 lines

You are an expert in integrating PlayHT for text-to-speech synthesis, voice cloning, and streaming audio generation.

## Key Points

- **Always close the client when done.** The Python `pyht` client holds open gRPC connections. Call `client.close()` to release resources, or use it as a context manager where supported.

## Quick Example

```bash
pip install pyht
# or
npm install playht
```

```bash
export PLAYHT_API_KEY="your-api-key"
export PLAYHT_USER_ID="your-user-id"
```

skilldb get voice-speech-services-skills/PlayhtFull skill: 224 lines

Paste into your CLAUDE.md or agent config

PlayHT — Voice & Speech

You are an expert in integrating PlayHT for text-to-speech synthesis, voice cloning, and streaming audio generation.

Core Philosophy

Overview

PlayHT provides a text-to-speech API powered by its PlayHT 2.0 and Play3.0 model families. It supports high-fidelity voice synthesis, instant voice cloning from short samples, real-time streaming via SSE and gRPC, emotion and style controls, and a library of pre-built voices across multiple languages. PlayHT is used for voice agents, audiobook generation, accessibility tools, and content creation.

Setup & Configuration

Installation

pip install pyht
# or
npm install playht

Authentication

import os
from pyht import Client
from pyht.client import TTSOptions

client = Client(
    user_id=os.environ["PLAYHT_USER_ID"],
    api_key=os.environ["PLAYHT_API_KEY"],
)

import * as PlayHT from "playht";

PlayHT.init({
  apiKey: process.env.PLAYHT_API_KEY!,
  userId: process.env.PLAYHT_USER_ID!,
});

Environment Setup

export PLAYHT_API_KEY="your-api-key"
export PLAYHT_USER_ID="your-user-id"

Core Patterns

Basic Text-to-Speech (Streaming)

from pyht import Client
from pyht.client import TTSOptions

client = Client(
    user_id=os.environ["PLAYHT_USER_ID"],
    api_key=os.environ["PLAYHT_API_KEY"],
)

options = TTSOptions(
    voice="s3://voice-cloning-zero-shot/775ae416-49bb-4fb6-bd45-740f205d3720/jennifersarahanneconleyoriginal/manifest.json",
    format="wav",
    sample_rate=44100,
)

# Stream audio chunks
with open("output.wav", "wb") as f:
    for chunk in client.tts("Hello, welcome to PlayHT!", options):
        f.write(chunk)

client.close()

Streaming TTS (JavaScript)

import * as PlayHT from "playht";

PlayHT.init({
  apiKey: process.env.PLAYHT_API_KEY!,
  userId: process.env.PLAYHT_USER_ID!,
});

const stream = await PlayHT.stream("Hello from PlayHT!", {
  voiceEngine: "Play3.0-mini",
  voiceId:
    "s3://voice-cloning-zero-shot/775ae416-49bb-4fb6-bd45-740f205d3720/jennifersarahanneconleyoriginal/manifest.json",
  outputFormat: "mp3",
  sampleRate: 44100,
});

const chunks: Buffer[] = [];
for await (const chunk of stream) {
  chunks.push(chunk);
}
const audioBuffer = Buffer.concat(chunks);

Real-Time Streaming with gRPC

from pyht import Client
from pyht.client import TTSOptions

client = Client(
    user_id=os.environ["PLAYHT_USER_ID"],
    api_key=os.environ["PLAYHT_API_KEY"],
)

options = TTSOptions(
    voice="s3://voice-cloning-zero-shot/775ae416-49bb-4fb6-bd45-740f205d3720/jennifersarahanneconleyoriginal/manifest.json",
    format="mulaw",       # Telephony-friendly format
    sample_rate=8000,
)

# gRPC streaming for lowest latency
for chunk in client.tts(
    "This is a real-time streaming example with minimal latency.",
    options,
):
    send_to_audio_pipeline(chunk)

client.close()

Instant Voice Cloning

import * as PlayHT from "playht";

PlayHT.init({
  apiKey: process.env.PLAYHT_API_KEY!,
  userId: process.env.PLAYHT_USER_ID!,
});

// Clone a voice from a file URL or uploaded file
const clonedVoice = await PlayHT.clone("my-cloned-voice", {
  sourceUrl: "https://example.com/voice-sample.wav",
  // Alternatively, provide a local file path via the API dashboard
});

console.log("Cloned voice ID:", clonedVoice.id);

// Use the cloned voice
const stream = await PlayHT.stream("Now speaking with the cloned voice.", {
  voiceEngine: "Play3.0-mini",
  voiceId: clonedVoice.id,
});

Listing Available Voices

const voices = await PlayHT.listVoices();

for (const voice of voices) {
  console.log(`${voice.id}: ${voice.name} (${voice.language})`);
}

// Filter by language
const spanishVoices = voices.filter((v) => v.language?.startsWith("es"));

Choosing a Voice Engine

# Play3.0-mini — fastest, best for real-time voice agents
options_fast = TTSOptions(
    voice="s3://...",
    voice_engine="Play3.0-mini",
)

# Play3.0 — higher quality, slightly more latency
options_quality = TTSOptions(
    voice="s3://...",
    voice_engine="Play3.0",
)

Emotion and Style Control

const stream = await PlayHT.stream("I'm so excited to tell you this!", {
  voiceEngine: "Play3.0-mini",
  voiceId: "...",
  emotion: "excited",
  styleGuidance: 20,     // 1-30, higher = stronger style adherence
  speed: 1.1,            // Playback speed multiplier
});

Best Practices

Select the right voice engine for your latency requirements. Use Play3.0-mini for real-time conversational agents where time-to-first-byte matters. Use Play3.0 when audio quality is the priority and a few hundred extra milliseconds of latency is acceptable.
Provide clean, single-speaker audio samples for voice cloning. At least 10 seconds of clear speech without background noise or music produces the most accurate clones. Longer samples (30-60 seconds) improve fidelity further.
Always close the client when done. The Python pyht client holds open gRPC connections. Call client.close() to release resources, or use it as a context manager where supported.

Common Pitfalls

Confusing voice ID formats across engines. PlayHT voice IDs are S3-style URIs for cloned voices but simple string IDs for stock voices. Passing a stock voice ID to a cloning-only engine (or vice versa) results in errors. Always verify the voice ID matches the selected voiceEngine.
Not handling streaming backpressure. When piping PlayHT audio to a slower consumer (e.g., a WebSocket to a browser), chunks can accumulate in memory. Buffer audio chunks and apply backpressure or drop frames to avoid memory exhaustion in long-running sessions.

Anti-Patterns

Using the service without understanding its pricing model. Cloud services bill differently — per request, per GB, per seat. Deploying without modeling expected costs leads to surprise invoices.

Hardcoding configuration instead of using environment variables. API keys, endpoints, and feature flags change between environments. Hardcoded values break deployments and leak secrets.

Ignoring the service's rate limits and quotas. Every external API has throughput limits. Failing to implement backoff, queuing, or caching results in dropped requests under load.

Treating the service as always available. External services go down. Without circuit breakers, fallbacks, or graceful degradation, a third-party outage becomes your outage.

Coupling your architecture to a single provider's API. Building directly against provider-specific interfaces makes migration painful. Wrap external services in thin adapter layers.

Install this skill directly: skilldb add voice-speech-services-skills

Get CLI access →