Skip to main content
Technology & EngineeringVoice Speech Services312 lines

Google Cloud Text to Speech

"Google Cloud Text-to-Speech: WaveNet/Neural2 voices, SSML, audio profiles, streaming, multilingual"

Quick Summary18 lines
Google Cloud Text-to-Speech converts text into natural-sounding speech using WaveNet and Neural2 models. It supports SSML for fine-grained speech control, audio profiles for optimizing output for different devices, and a wide range of languages and voices. Build with these principles:

## Key Points

- **Control speech with SSML** — Plain text gives you no control over pronunciation, pauses, or emphasis. Use SSML for any production-quality speech output.
- **Optimize with audio profiles** — Apply device-specific audio profiles to optimize playback quality for headphones, phone speakers, car audio, or smart speakers.
- **Batch and cache** — Synthesis calls cost money. Cache results and batch requests where possible.
- **Prefer Neural2 over WaveNet** — Neural2 voices are newer, higher quality, and often cheaper. Check availability for your target language before falling back to WaveNet.
- **Use SSML for production content** — SSML gives you control over pauses, pronunciation, and emphasis that plain text cannot provide. Invest in learning SSML tags.
- **Apply audio profiles** — A single audio profile string can dramatically improve perceived quality on specific devices. Always set one for known playback targets.
- **Set appropriate sample rates** — 24000 Hz for high quality. 16000 Hz for telephony. 8000 Hz for legacy systems. Higher is not always better when bandwidth is limited.
- **Cache synthesized audio** — Google charges per character. Hash your inputs and cache results to avoid re-synthesizing identical content.
- **Use OGG_OPUS for web** — Opus provides better quality at lower bitrates than MP3. Use it for web applications where browser support is available.
- **Using Standard voices when Neural2 is available** — Standard voices are noticeably lower quality. The cost difference is minimal for most applications.
- **Sending text longer than 5000 bytes** — The API has a per-request limit. Split long text at sentence boundaries and concatenate the resulting audio.
- **Ignoring SSML escaping** — Characters like `<`, `>`, and `&` must be escaped in SSML. Failing to escape them causes synthesis errors.
skilldb get voice-speech-services-skills/Google Cloud Text to SpeechFull skill: 312 lines
Paste into your CLAUDE.md or agent config

Google Cloud Text-to-Speech Skill

Core Philosophy

Google Cloud Text-to-Speech converts text into natural-sounding speech using WaveNet and Neural2 models. It supports SSML for fine-grained speech control, audio profiles for optimizing output for different devices, and a wide range of languages and voices. Build with these principles:

  • Use Neural2 voices by default — Neural2 provides the best quality-to-cost ratio. Use WaveNet only when Neural2 is unavailable for your language. Use Standard voices only for high-volume, cost-sensitive scenarios.
  • Control speech with SSML — Plain text gives you no control over pronunciation, pauses, or emphasis. Use SSML for any production-quality speech output.
  • Optimize with audio profiles — Apply device-specific audio profiles to optimize playback quality for headphones, phone speakers, car audio, or smart speakers.
  • Batch and cache — Synthesis calls cost money. Cache results and batch requests where possible.

Setup

Install the Google Cloud client library:

import textToSpeech from "@google-cloud/text-to-speech";
import { writeFile } from "node:fs/promises";

const client = new textToSpeech.TextToSpeechClient({
  keyFilename: process.env.GOOGLE_APPLICATION_CREDENTIALS,
});

Key Techniques

Basic Text-to-Speech

Synthesize speech from plain text:

async function synthesizeSpeech(
  text: string,
  languageCode: string = "en-US",
  voiceName: string = "en-US-Neural2-C"
): Promise<Buffer> {
  const [response] = await client.synthesizeSpeech({
    input: { text },
    voice: {
      languageCode,
      name: voiceName,
    },
    audioConfig: {
      audioEncoding: "MP3",
      sampleRateHertz: 24000,
      speakingRate: 1.0,
      pitch: 0.0,
      volumeGainDb: 0.0,
    },
  });

  return Buffer.from(response.audioContent as Uint8Array);
}

// Usage
const audio = await synthesizeSpeech("Hello, welcome to our service.");
await writeFile("output.mp3", audio);

SSML Speech Synthesis

Use SSML for precise control over pronunciation, pauses, emphasis, and prosody:

async function synthesizeSSML(ssml: string): Promise<Buffer> {
  const [response] = await client.synthesizeSpeech({
    input: { ssml },
    voice: {
      languageCode: "en-US",
      name: "en-US-Neural2-F",
    },
    audioConfig: {
      audioEncoding: "MP3",
      sampleRateHertz: 24000,
    },
  });

  return Buffer.from(response.audioContent as Uint8Array);
}

// Build SSML dynamically
function buildSSML(segments: Array<{
  text: string;
  rate?: string;
  pitch?: string;
  emphasis?: "strong" | "moderate" | "reduced";
  breakTime?: string;
}>): string {
  let ssml = "<speak>";

  for (const segment of segments) {
    if (segment.breakTime) {
      ssml += `<break time="${segment.breakTime}"/>`;
    }

    if (segment.emphasis) {
      ssml += `<emphasis level="${segment.emphasis}">`;
    }

    if (segment.rate || segment.pitch) {
      const rate = segment.rate ? ` rate="${segment.rate}"` : "";
      const pitch = segment.pitch ? ` pitch="${segment.pitch}"` : "";
      ssml += `<prosody${rate}${pitch}>${segment.text}</prosody>`;
    } else {
      ssml += segment.text;
    }

    if (segment.emphasis) {
      ssml += "</emphasis>";
    }
  }

  ssml += "</speak>";
  return ssml;
}

// Example: generate a notification announcement
const ssml = buildSSML([
  { text: "Attention.", emphasis: "strong", breakTime: "500ms" },
  { text: "Your order has been shipped.", rate: "medium", pitch: "+2st" },
  { breakTime: "300ms", text: "" },
  { text: "Expected delivery is tomorrow.", rate: "slow" },
]);

Audio Profiles

Optimize output for specific playback devices:

type AudioProfile =
  | "wearable-class-device"
  | "handset-class-device"
  | "headphone-class-device"
  | "small-bluetooth-speaker-class-device"
  | "medium-bluetooth-speaker-class-device"
  | "large-home-entertainment-class-device"
  | "large-automotive-class-device"
  | "telephony-class-application";

async function synthesizeForDevice(
  text: string,
  profile: AudioProfile
): Promise<Buffer> {
  const [response] = await client.synthesizeSpeech({
    input: { text },
    voice: {
      languageCode: "en-US",
      name: "en-US-Neural2-D",
    },
    audioConfig: {
      audioEncoding: "MP3",
      effectsProfileId: [profile],
      sampleRateHertz: 24000,
    },
  });

  return Buffer.from(response.audioContent as Uint8Array);
}

Listing Available Voices

Query available voices filtered by language:

interface VoiceInfo {
  name: string;
  languageCodes: string[];
  ssmlGender: string;
  naturalSampleRateHertz: number;
}

async function listVoices(languageCode?: string): Promise<VoiceInfo[]> {
  const [response] = await client.listVoices({
    languageCode: languageCode ?? "",
  });

  return (response.voices ?? []).map((v) => ({
    name: v.name ?? "",
    languageCodes: v.languageCodes ?? [],
    ssmlGender: v.ssmlGender ?? "NEUTRAL",
    naturalSampleRateHertz: v.naturalSampleRateHertz ?? 24000,
  }));
}

async function findBestVoice(
  languageCode: string,
  gender: "MALE" | "FEMALE" | "NEUTRAL"
): Promise<string | null> {
  const voices = await listVoices(languageCode);
  const neural2 = voices.find(
    (v) => v.name.includes("Neural2") && v.ssmlGender === gender
  );
  const wavenet = voices.find(
    (v) => v.name.includes("Wavenet") && v.ssmlGender === gender
  );
  return neural2?.name ?? wavenet?.name ?? voices[0]?.name ?? null;
}

Multilingual Speech with Language Detection

Generate speech in multiple languages:

interface MultilingualSegment {
  text: string;
  languageCode: string;
  voiceName: string;
}

async function synthesizeMultilingual(
  segments: MultilingualSegment[]
): Promise<Buffer[]> {
  const results: Buffer[] = [];

  for (const segment of segments) {
    const [response] = await client.synthesizeSpeech({
      input: { text: segment.text },
      voice: {
        languageCode: segment.languageCode,
        name: segment.voiceName,
      },
      audioConfig: {
        audioEncoding: "MP3",
        sampleRateHertz: 24000,
      },
    });

    results.push(Buffer.from(response.audioContent as Uint8Array));
  }

  return results;
}

// Usage with language-specific voices
const segments: MultilingualSegment[] = [
  { text: "Hello and welcome.", languageCode: "en-US", voiceName: "en-US-Neural2-C" },
  { text: "Bienvenue sur notre plateforme.", languageCode: "fr-FR", voiceName: "fr-FR-Neural2-A" },
  { text: "Willkommen auf unserer Plattform.", languageCode: "de-DE", voiceName: "de-DE-Neural2-B" },
];

Batch Synthesis with Caching

Synthesize multiple texts with result caching:

import { createHash } from "node:crypto";

const audioCache = new Map<string, Buffer>();

function cacheKey(text: string, voice: string, encoding: string): string {
  return createHash("sha256")
    .update(`${text}|${voice}|${encoding}`)
    .digest("hex");
}

async function synthesizeBatch(
  items: Array<{ text: string; voiceName: string }>,
  encoding: "MP3" | "OGG_OPUS" | "LINEAR16" = "MP3"
): Promise<Map<string, Buffer>> {
  const results = new Map<string, Buffer>();

  for (const item of items) {
    const key = cacheKey(item.text, item.voiceName, encoding);

    if (audioCache.has(key)) {
      results.set(item.text, audioCache.get(key)!);
      continue;
    }

    const langCode = item.voiceName.split("-").slice(0, 2).join("-");
    const [response] = await client.synthesizeSpeech({
      input: { text: item.text },
      voice: { languageCode: langCode, name: item.voiceName },
      audioConfig: { audioEncoding: encoding },
    });

    const buffer = Buffer.from(response.audioContent as Uint8Array);
    audioCache.set(key, buffer);
    results.set(item.text, buffer);
  }

  return results;
}

Best Practices

  • Prefer Neural2 over WaveNet — Neural2 voices are newer, higher quality, and often cheaper. Check availability for your target language before falling back to WaveNet.
  • Use SSML for production content — SSML gives you control over pauses, pronunciation, and emphasis that plain text cannot provide. Invest in learning SSML tags.
  • Apply audio profiles — A single audio profile string can dramatically improve perceived quality on specific devices. Always set one for known playback targets.
  • Set appropriate sample rates — 24000 Hz for high quality. 16000 Hz for telephony. 8000 Hz for legacy systems. Higher is not always better when bandwidth is limited.
  • Cache synthesized audio — Google charges per character. Hash your inputs and cache results to avoid re-synthesizing identical content.
  • Use OGG_OPUS for web — Opus provides better quality at lower bitrates than MP3. Use it for web applications where browser support is available.

Anti-Patterns

  • Using Standard voices when Neural2 is available — Standard voices are noticeably lower quality. The cost difference is minimal for most applications.
  • Sending text longer than 5000 bytes — The API has a per-request limit. Split long text at sentence boundaries and concatenate the resulting audio.
  • Ignoring SSML escaping — Characters like <, >, and & must be escaped in SSML. Failing to escape them causes synthesis errors.
  • Hardcoding voice names — Voice availability changes over time. Use listVoices to dynamically select voices, or maintain a fallback list.
  • Skipping audio profiles for known devices — Not setting an audio profile when you know the target device misses free quality improvement.
  • Ignoring quotas and billing — Google Cloud TTS has per-minute and per-day quotas. Monitor usage and set billing alerts to avoid surprises.

Install this skill directly: skilldb add voice-speech-services-skills

Get CLI access →