Google Cloud Text to Speech
"Google Cloud Text-to-Speech: WaveNet/Neural2 voices, SSML, audio profiles, streaming, multilingual"
Google Cloud Text-to-Speech converts text into natural-sounding speech using WaveNet and Neural2 models. It supports SSML for fine-grained speech control, audio profiles for optimizing output for different devices, and a wide range of languages and voices. Build with these principles: ## Key Points - **Control speech with SSML** — Plain text gives you no control over pronunciation, pauses, or emphasis. Use SSML for any production-quality speech output. - **Optimize with audio profiles** — Apply device-specific audio profiles to optimize playback quality for headphones, phone speakers, car audio, or smart speakers. - **Batch and cache** — Synthesis calls cost money. Cache results and batch requests where possible. - **Prefer Neural2 over WaveNet** — Neural2 voices are newer, higher quality, and often cheaper. Check availability for your target language before falling back to WaveNet. - **Use SSML for production content** — SSML gives you control over pauses, pronunciation, and emphasis that plain text cannot provide. Invest in learning SSML tags. - **Apply audio profiles** — A single audio profile string can dramatically improve perceived quality on specific devices. Always set one for known playback targets. - **Set appropriate sample rates** — 24000 Hz for high quality. 16000 Hz for telephony. 8000 Hz for legacy systems. Higher is not always better when bandwidth is limited. - **Cache synthesized audio** — Google charges per character. Hash your inputs and cache results to avoid re-synthesizing identical content. - **Use OGG_OPUS for web** — Opus provides better quality at lower bitrates than MP3. Use it for web applications where browser support is available. - **Using Standard voices when Neural2 is available** — Standard voices are noticeably lower quality. The cost difference is minimal for most applications. - **Sending text longer than 5000 bytes** — The API has a per-request limit. Split long text at sentence boundaries and concatenate the resulting audio. - **Ignoring SSML escaping** — Characters like `<`, `>`, and `&` must be escaped in SSML. Failing to escape them causes synthesis errors.
skilldb get voice-speech-services-skills/Google Cloud Text to SpeechFull skill: 312 linesGoogle Cloud Text-to-Speech Skill
Core Philosophy
Google Cloud Text-to-Speech converts text into natural-sounding speech using WaveNet and Neural2 models. It supports SSML for fine-grained speech control, audio profiles for optimizing output for different devices, and a wide range of languages and voices. Build with these principles:
- Use Neural2 voices by default — Neural2 provides the best quality-to-cost ratio. Use WaveNet only when Neural2 is unavailable for your language. Use Standard voices only for high-volume, cost-sensitive scenarios.
- Control speech with SSML — Plain text gives you no control over pronunciation, pauses, or emphasis. Use SSML for any production-quality speech output.
- Optimize with audio profiles — Apply device-specific audio profiles to optimize playback quality for headphones, phone speakers, car audio, or smart speakers.
- Batch and cache — Synthesis calls cost money. Cache results and batch requests where possible.
Setup
Install the Google Cloud client library:
import textToSpeech from "@google-cloud/text-to-speech";
import { writeFile } from "node:fs/promises";
const client = new textToSpeech.TextToSpeechClient({
keyFilename: process.env.GOOGLE_APPLICATION_CREDENTIALS,
});
Key Techniques
Basic Text-to-Speech
Synthesize speech from plain text:
async function synthesizeSpeech(
text: string,
languageCode: string = "en-US",
voiceName: string = "en-US-Neural2-C"
): Promise<Buffer> {
const [response] = await client.synthesizeSpeech({
input: { text },
voice: {
languageCode,
name: voiceName,
},
audioConfig: {
audioEncoding: "MP3",
sampleRateHertz: 24000,
speakingRate: 1.0,
pitch: 0.0,
volumeGainDb: 0.0,
},
});
return Buffer.from(response.audioContent as Uint8Array);
}
// Usage
const audio = await synthesizeSpeech("Hello, welcome to our service.");
await writeFile("output.mp3", audio);
SSML Speech Synthesis
Use SSML for precise control over pronunciation, pauses, emphasis, and prosody:
async function synthesizeSSML(ssml: string): Promise<Buffer> {
const [response] = await client.synthesizeSpeech({
input: { ssml },
voice: {
languageCode: "en-US",
name: "en-US-Neural2-F",
},
audioConfig: {
audioEncoding: "MP3",
sampleRateHertz: 24000,
},
});
return Buffer.from(response.audioContent as Uint8Array);
}
// Build SSML dynamically
function buildSSML(segments: Array<{
text: string;
rate?: string;
pitch?: string;
emphasis?: "strong" | "moderate" | "reduced";
breakTime?: string;
}>): string {
let ssml = "<speak>";
for (const segment of segments) {
if (segment.breakTime) {
ssml += `<break time="${segment.breakTime}"/>`;
}
if (segment.emphasis) {
ssml += `<emphasis level="${segment.emphasis}">`;
}
if (segment.rate || segment.pitch) {
const rate = segment.rate ? ` rate="${segment.rate}"` : "";
const pitch = segment.pitch ? ` pitch="${segment.pitch}"` : "";
ssml += `<prosody${rate}${pitch}>${segment.text}</prosody>`;
} else {
ssml += segment.text;
}
if (segment.emphasis) {
ssml += "</emphasis>";
}
}
ssml += "</speak>";
return ssml;
}
// Example: generate a notification announcement
const ssml = buildSSML([
{ text: "Attention.", emphasis: "strong", breakTime: "500ms" },
{ text: "Your order has been shipped.", rate: "medium", pitch: "+2st" },
{ breakTime: "300ms", text: "" },
{ text: "Expected delivery is tomorrow.", rate: "slow" },
]);
Audio Profiles
Optimize output for specific playback devices:
type AudioProfile =
| "wearable-class-device"
| "handset-class-device"
| "headphone-class-device"
| "small-bluetooth-speaker-class-device"
| "medium-bluetooth-speaker-class-device"
| "large-home-entertainment-class-device"
| "large-automotive-class-device"
| "telephony-class-application";
async function synthesizeForDevice(
text: string,
profile: AudioProfile
): Promise<Buffer> {
const [response] = await client.synthesizeSpeech({
input: { text },
voice: {
languageCode: "en-US",
name: "en-US-Neural2-D",
},
audioConfig: {
audioEncoding: "MP3",
effectsProfileId: [profile],
sampleRateHertz: 24000,
},
});
return Buffer.from(response.audioContent as Uint8Array);
}
Listing Available Voices
Query available voices filtered by language:
interface VoiceInfo {
name: string;
languageCodes: string[];
ssmlGender: string;
naturalSampleRateHertz: number;
}
async function listVoices(languageCode?: string): Promise<VoiceInfo[]> {
const [response] = await client.listVoices({
languageCode: languageCode ?? "",
});
return (response.voices ?? []).map((v) => ({
name: v.name ?? "",
languageCodes: v.languageCodes ?? [],
ssmlGender: v.ssmlGender ?? "NEUTRAL",
naturalSampleRateHertz: v.naturalSampleRateHertz ?? 24000,
}));
}
async function findBestVoice(
languageCode: string,
gender: "MALE" | "FEMALE" | "NEUTRAL"
): Promise<string | null> {
const voices = await listVoices(languageCode);
const neural2 = voices.find(
(v) => v.name.includes("Neural2") && v.ssmlGender === gender
);
const wavenet = voices.find(
(v) => v.name.includes("Wavenet") && v.ssmlGender === gender
);
return neural2?.name ?? wavenet?.name ?? voices[0]?.name ?? null;
}
Multilingual Speech with Language Detection
Generate speech in multiple languages:
interface MultilingualSegment {
text: string;
languageCode: string;
voiceName: string;
}
async function synthesizeMultilingual(
segments: MultilingualSegment[]
): Promise<Buffer[]> {
const results: Buffer[] = [];
for (const segment of segments) {
const [response] = await client.synthesizeSpeech({
input: { text: segment.text },
voice: {
languageCode: segment.languageCode,
name: segment.voiceName,
},
audioConfig: {
audioEncoding: "MP3",
sampleRateHertz: 24000,
},
});
results.push(Buffer.from(response.audioContent as Uint8Array));
}
return results;
}
// Usage with language-specific voices
const segments: MultilingualSegment[] = [
{ text: "Hello and welcome.", languageCode: "en-US", voiceName: "en-US-Neural2-C" },
{ text: "Bienvenue sur notre plateforme.", languageCode: "fr-FR", voiceName: "fr-FR-Neural2-A" },
{ text: "Willkommen auf unserer Plattform.", languageCode: "de-DE", voiceName: "de-DE-Neural2-B" },
];
Batch Synthesis with Caching
Synthesize multiple texts with result caching:
import { createHash } from "node:crypto";
const audioCache = new Map<string, Buffer>();
function cacheKey(text: string, voice: string, encoding: string): string {
return createHash("sha256")
.update(`${text}|${voice}|${encoding}`)
.digest("hex");
}
async function synthesizeBatch(
items: Array<{ text: string; voiceName: string }>,
encoding: "MP3" | "OGG_OPUS" | "LINEAR16" = "MP3"
): Promise<Map<string, Buffer>> {
const results = new Map<string, Buffer>();
for (const item of items) {
const key = cacheKey(item.text, item.voiceName, encoding);
if (audioCache.has(key)) {
results.set(item.text, audioCache.get(key)!);
continue;
}
const langCode = item.voiceName.split("-").slice(0, 2).join("-");
const [response] = await client.synthesizeSpeech({
input: { text: item.text },
voice: { languageCode: langCode, name: item.voiceName },
audioConfig: { audioEncoding: encoding },
});
const buffer = Buffer.from(response.audioContent as Uint8Array);
audioCache.set(key, buffer);
results.set(item.text, buffer);
}
return results;
}
Best Practices
- Prefer Neural2 over WaveNet — Neural2 voices are newer, higher quality, and often cheaper. Check availability for your target language before falling back to WaveNet.
- Use SSML for production content — SSML gives you control over pauses, pronunciation, and emphasis that plain text cannot provide. Invest in learning SSML tags.
- Apply audio profiles — A single audio profile string can dramatically improve perceived quality on specific devices. Always set one for known playback targets.
- Set appropriate sample rates — 24000 Hz for high quality. 16000 Hz for telephony. 8000 Hz for legacy systems. Higher is not always better when bandwidth is limited.
- Cache synthesized audio — Google charges per character. Hash your inputs and cache results to avoid re-synthesizing identical content.
- Use OGG_OPUS for web — Opus provides better quality at lower bitrates than MP3. Use it for web applications where browser support is available.
Anti-Patterns
- Using Standard voices when Neural2 is available — Standard voices are noticeably lower quality. The cost difference is minimal for most applications.
- Sending text longer than 5000 bytes — The API has a per-request limit. Split long text at sentence boundaries and concatenate the resulting audio.
- Ignoring SSML escaping — Characters like
<,>, and&must be escaped in SSML. Failing to escape them causes synthesis errors. - Hardcoding voice names — Voice availability changes over time. Use
listVoicesto dynamically select voices, or maintain a fallback list. - Skipping audio profiles for known devices — Not setting an audio profile when you know the target device misses free quality improvement.
- Ignoring quotas and billing — Google Cloud TTS has per-minute and per-day quotas. Monitor usage and set billing alerts to avoid surprises.
Install this skill directly: skilldb add voice-speech-services-skills
Related Skills
Amazon Polly
"Amazon Polly: AWS text-to-speech, neural/standard voices, SSML, lexicons, speech marks, streaming"
AssemblyAI
"AssemblyAI: speech-to-text, real-time transcription, speaker diarization, content moderation, summarization, sentiment analysis"
Cartesia
Integrate Cartesia's ultra-low-latency voice API for real-time text-to-speech and voice cloning
Deepgram
"Deepgram: speech-to-text, real-time transcription, pre-recorded audio, diarization, sentiment analysis, WebSocket streaming"
ElevenLabs
"ElevenLabs: AI voice synthesis, text-to-speech, voice cloning, streaming audio, voice design, multilingual, WebSocket streaming"
OpenAI TTS
"OpenAI TTS: text-to-speech API, voice selection (alloy/echo/fable/onyx/nova/shimmer), streaming, HD voices, audio formats"