Playht
Integrate PlayHT's voice API for text-to-speech, voice cloning, and real-time audio streaming
You are an expert in integrating PlayHT for text-to-speech synthesis, voice cloning, and streaming audio generation. ## Key Points - **Always close the client when done.** The Python `pyht` client holds open gRPC connections. Call `client.close()` to release resources, or use it as a context manager where supported. ## Quick Example ```bash pip install pyht # or npm install playht ``` ```bash export PLAYHT_API_KEY="your-api-key" export PLAYHT_USER_ID="your-user-id" ```
skilldb get voice-speech-services-skills/PlayhtFull skill: 224 linesPlayHT — Voice & Speech
You are an expert in integrating PlayHT for text-to-speech synthesis, voice cloning, and streaming audio generation.
Core Philosophy
Overview
PlayHT provides a text-to-speech API powered by its PlayHT 2.0 and Play3.0 model families. It supports high-fidelity voice synthesis, instant voice cloning from short samples, real-time streaming via SSE and gRPC, emotion and style controls, and a library of pre-built voices across multiple languages. PlayHT is used for voice agents, audiobook generation, accessibility tools, and content creation.
Setup & Configuration
Installation
pip install pyht
# or
npm install playht
Authentication
Sign up at play.ht and obtain your API key and user ID from the API Access page.
import os
from pyht import Client
from pyht.client import TTSOptions
client = Client(
user_id=os.environ["PLAYHT_USER_ID"],
api_key=os.environ["PLAYHT_API_KEY"],
)
import * as PlayHT from "playht";
PlayHT.init({
apiKey: process.env.PLAYHT_API_KEY!,
userId: process.env.PLAYHT_USER_ID!,
});
Environment Setup
export PLAYHT_API_KEY="your-api-key"
export PLAYHT_USER_ID="your-user-id"
Core Patterns
Basic Text-to-Speech (Streaming)
from pyht import Client
from pyht.client import TTSOptions
client = Client(
user_id=os.environ["PLAYHT_USER_ID"],
api_key=os.environ["PLAYHT_API_KEY"],
)
options = TTSOptions(
voice="s3://voice-cloning-zero-shot/775ae416-49bb-4fb6-bd45-740f205d3720/jennifersarahanneconleyoriginal/manifest.json",
format="wav",
sample_rate=44100,
)
# Stream audio chunks
with open("output.wav", "wb") as f:
for chunk in client.tts("Hello, welcome to PlayHT!", options):
f.write(chunk)
client.close()
Streaming TTS (JavaScript)
import * as PlayHT from "playht";
PlayHT.init({
apiKey: process.env.PLAYHT_API_KEY!,
userId: process.env.PLAYHT_USER_ID!,
});
const stream = await PlayHT.stream("Hello from PlayHT!", {
voiceEngine: "Play3.0-mini",
voiceId:
"s3://voice-cloning-zero-shot/775ae416-49bb-4fb6-bd45-740f205d3720/jennifersarahanneconleyoriginal/manifest.json",
outputFormat: "mp3",
sampleRate: 44100,
});
const chunks: Buffer[] = [];
for await (const chunk of stream) {
chunks.push(chunk);
}
const audioBuffer = Buffer.concat(chunks);
Real-Time Streaming with gRPC
from pyht import Client
from pyht.client import TTSOptions
client = Client(
user_id=os.environ["PLAYHT_USER_ID"],
api_key=os.environ["PLAYHT_API_KEY"],
)
options = TTSOptions(
voice="s3://voice-cloning-zero-shot/775ae416-49bb-4fb6-bd45-740f205d3720/jennifersarahanneconleyoriginal/manifest.json",
format="mulaw", # Telephony-friendly format
sample_rate=8000,
)
# gRPC streaming for lowest latency
for chunk in client.tts(
"This is a real-time streaming example with minimal latency.",
options,
):
send_to_audio_pipeline(chunk)
client.close()
Instant Voice Cloning
import * as PlayHT from "playht";
PlayHT.init({
apiKey: process.env.PLAYHT_API_KEY!,
userId: process.env.PLAYHT_USER_ID!,
});
// Clone a voice from a file URL or uploaded file
const clonedVoice = await PlayHT.clone("my-cloned-voice", {
sourceUrl: "https://example.com/voice-sample.wav",
// Alternatively, provide a local file path via the API dashboard
});
console.log("Cloned voice ID:", clonedVoice.id);
// Use the cloned voice
const stream = await PlayHT.stream("Now speaking with the cloned voice.", {
voiceEngine: "Play3.0-mini",
voiceId: clonedVoice.id,
});
Listing Available Voices
const voices = await PlayHT.listVoices();
for (const voice of voices) {
console.log(`${voice.id}: ${voice.name} (${voice.language})`);
}
// Filter by language
const spanishVoices = voices.filter((v) => v.language?.startsWith("es"));
Choosing a Voice Engine
# Play3.0-mini — fastest, best for real-time voice agents
options_fast = TTSOptions(
voice="s3://...",
voice_engine="Play3.0-mini",
)
# Play3.0 — higher quality, slightly more latency
options_quality = TTSOptions(
voice="s3://...",
voice_engine="Play3.0",
)
Emotion and Style Control
const stream = await PlayHT.stream("I'm so excited to tell you this!", {
voiceEngine: "Play3.0-mini",
voiceId: "...",
emotion: "excited",
styleGuidance: 20, // 1-30, higher = stronger style adherence
speed: 1.1, // Playback speed multiplier
});
Best Practices
- Select the right voice engine for your latency requirements. Use
Play3.0-minifor real-time conversational agents where time-to-first-byte matters. UsePlay3.0when audio quality is the priority and a few hundred extra milliseconds of latency is acceptable. - Provide clean, single-speaker audio samples for voice cloning. At least 10 seconds of clear speech without background noise or music produces the most accurate clones. Longer samples (30-60 seconds) improve fidelity further.
- Always close the client when done. The Python
pyhtclient holds open gRPC connections. Callclient.close()to release resources, or use it as a context manager where supported.
Common Pitfalls
- Confusing voice ID formats across engines. PlayHT voice IDs are S3-style URIs for cloned voices but simple string IDs for stock voices. Passing a stock voice ID to a cloning-only engine (or vice versa) results in errors. Always verify the voice ID matches the selected
voiceEngine. - Not handling streaming backpressure. When piping PlayHT audio to a slower consumer (e.g., a WebSocket to a browser), chunks can accumulate in memory. Buffer audio chunks and apply backpressure or drop frames to avoid memory exhaustion in long-running sessions.
Anti-Patterns
Using the service without understanding its pricing model. Cloud services bill differently — per request, per GB, per seat. Deploying without modeling expected costs leads to surprise invoices.
Hardcoding configuration instead of using environment variables. API keys, endpoints, and feature flags change between environments. Hardcoded values break deployments and leak secrets.
Ignoring the service's rate limits and quotas. Every external API has throughput limits. Failing to implement backoff, queuing, or caching results in dropped requests under load.
Treating the service as always available. External services go down. Without circuit breakers, fallbacks, or graceful degradation, a third-party outage becomes your outage.
Coupling your architecture to a single provider's API. Building directly against provider-specific interfaces makes migration painful. Wrap external services in thin adapter layers.
Install this skill directly: skilldb add voice-speech-services-skills
Related Skills
Amazon Polly
"Amazon Polly: AWS text-to-speech, neural/standard voices, SSML, lexicons, speech marks, streaming"
AssemblyAI
"AssemblyAI: speech-to-text, real-time transcription, speaker diarization, content moderation, summarization, sentiment analysis"
Cartesia
Integrate Cartesia's ultra-low-latency voice API for real-time text-to-speech and voice cloning
Deepgram
"Deepgram: speech-to-text, real-time transcription, pre-recorded audio, diarization, sentiment analysis, WebSocket streaming"
ElevenLabs
"ElevenLabs: AI voice synthesis, text-to-speech, voice cloning, streaming audio, voice design, multilingual, WebSocket streaming"
Google Cloud Text to Speech
"Google Cloud Text-to-Speech: WaveNet/Neural2 voices, SSML, audio profiles, streaming, multilingual"