Technology & EngineeringVoice Speech Services366 lines

Amazon Polly

"Amazon Polly: AWS text-to-speech, neural/standard voices, SSML, lexicons, speech marks, streaming"

Quick Summary18 lines

Amazon Polly is AWS's text-to-speech service providing neural and standard voices across dozens of languages. It integrates naturally with the AWS ecosystem and offers unique features like custom lexicons, speech marks for lip-sync, and direct S3 output for batch tasks. Build with these principles:

## Key Points

- **Neural voices for quality** — Always use neural engine voices when available for your language. Standard voices exist for backward compatibility and edge cases.
- **Leverage speech marks** — Polly uniquely provides word-level timing, viseme (mouth shape), and SSML mark data. Use these for animation, subtitles, and synchronized experiences.
- **Use lexicons for domain terminology** — Custom pronunciation lexicons let you control how Polly pronounces brand names, acronyms, and technical terms without modifying input text.
- **Integrate with AWS services** — Store output in S3, trigger synthesis from Lambda, and use SQS for batch queues. Polly fits naturally in AWS architectures.
- **Use neural engine whenever possible** — Neural voices sound significantly more natural. Check voice availability with `DescribeVoices` and filter by `SupportedEngines`.
- **Use async tasks for long text** — The synchronous API has a 3000-character limit. For longer content, use `StartSpeechSynthesisTask` which writes directly to S3.
- **Leverage speech marks for multimedia** — Speech marks enable word highlighting in karaoke-style apps, lip-sync for avatars, and precise subtitle timing.
- **Create lexicons for your domain** — Define pronunciations for brand names, acronyms, and technical terms once and reference them across all synthesis requests.
- **Use Polly-specific SSML** — Amazon Polly supports `<amazon:domain>` for newscaster and conversational styles and `<amazon:effect>` for whispered speech. These are unique to Polly.
- **Choose output format by use case** — Use `mp3` for general playback, `ogg_vorbis` for web streaming, and `pcm` for audio processing pipelines.
- **Using standard engine for new projects** — Standard voices exist for backward compatibility. Always start with neural voices and fall back only when a language lacks neural support.
- **Exceeding the synchronous character limit** — Sending more than 3000 characters (6000 for SSML) to `SynthesizeSpeech` returns an error. Use async tasks for longer content.

skilldb get voice-speech-services-skills/Amazon PollyFull skill: 366 lines

Paste into your CLAUDE.md or agent config

Amazon Polly Skill

Core Philosophy

Neural voices for quality — Always use neural engine voices when available for your language. Standard voices exist for backward compatibility and edge cases.
Leverage speech marks — Polly uniquely provides word-level timing, viseme (mouth shape), and SSML mark data. Use these for animation, subtitles, and synchronized experiences.
Use lexicons for domain terminology — Custom pronunciation lexicons let you control how Polly pronounces brand names, acronyms, and technical terms without modifying input text.
Integrate with AWS services — Store output in S3, trigger synthesis from Lambda, and use SQS for batch queues. Polly fits naturally in AWS architectures.

Setup

Install the AWS SDK v3 for Polly:

import {
  PollyClient,
  SynthesizeSpeechCommand,
  DescribeVoicesCommand,
  PutLexiconCommand,
  GetSpeechSynthesisTaskCommand,
  StartSpeechSynthesisTaskCommand,
} from "@aws-sdk/client-polly";
import { writeFile } from "node:fs/promises";
import { Readable } from "node:stream";

const polly = new PollyClient({
  region: process.env.AWS_REGION ?? "us-east-1",
});

Key Techniques

Basic Text-to-Speech

Synthesize speech and save to file:

async function synthesizeSpeech(
  text: string,
  voiceId: string = "Joanna",
  engine: "neural" | "standard" = "neural"
): Promise<Buffer> {
  const command = new SynthesizeSpeechCommand({
    Text: text,
    VoiceId: voiceId,
    Engine: engine,
    OutputFormat: "mp3",
    SampleRate: "24000",
  });

  const response = await polly.send(command);

  if (!response.AudioStream) {
    throw new Error("No audio stream returned");
  }

  const chunks: Buffer[] = [];
  const readable = response.AudioStream as Readable;
  for await (const chunk of readable) {
    chunks.push(Buffer.from(chunk));
  }
  return Buffer.concat(chunks);
}

// Usage
const audio = await synthesizeSpeech(
  "Welcome to our application. Your account is ready.",
  "Matthew",
  "neural"
);
await writeFile("welcome.mp3", audio);

SSML Synthesis

Use SSML for advanced speech control:

async function synthesizeSSML(
  ssml: string,
  voiceId: string = "Joanna"
): Promise<Buffer> {
  const command = new SynthesizeSpeechCommand({
    Text: ssml,
    TextType: "ssml",
    VoiceId: voiceId,
    Engine: "neural",
    OutputFormat: "mp3",
    SampleRate: "24000",
  });

  const response = await polly.send(command);
  const chunks: Buffer[] = [];
  const readable = response.AudioStream as Readable;
  for await (const chunk of readable) {
    chunks.push(Buffer.from(chunk));
  }
  return Buffer.concat(chunks);
}

// Build SSML with Polly-specific features
function buildPollySSML(options: {
  text: string;
  newscasterStyle?: boolean;
  conversationalStyle?: boolean;
  whisper?: boolean;
  breathe?: boolean;
}): string {
  let ssml = "<speak>";

  if (options.newscasterStyle) {
    ssml += '<amazon:domain name="news">';
    ssml += options.text;
    ssml += "</amazon:domain>";
  } else if (options.conversationalStyle) {
    ssml += '<amazon:domain name="conversational">';
    ssml += options.text;
    ssml += "</amazon:domain>";
  } else if (options.whisper) {
    ssml += '<amazon:effect name="whispered">';
    ssml += options.text;
    ssml += "</amazon:effect>";
  } else {
    ssml += options.text;
  }

  ssml += "</speak>";
  return ssml;
}

// Newscaster voice
const newsSSML = buildPollySSML({
  text: "Today's top story. Markets reached an all-time high.",
  newscasterStyle: true,
});

Speech Marks for Synchronization

Generate timing data for word highlighting, subtitles, or lip-sync animation:

interface SpeechMark {
  time: number;
  type: "word" | "sentence" | "viseme" | "ssml";
  start?: number;
  end?: number;
  value: string;
}

async function getSpeechMarks(
  text: string,
  voiceId: string = "Joanna",
  markTypes: Array<"word" | "sentence" | "viseme" | "ssml"> = ["word", "sentence"]
): Promise<SpeechMark[]> {
  const command = new SynthesizeSpeechCommand({
    Text: text,
    VoiceId: voiceId,
    Engine: "neural",
    OutputFormat: "json",
    SpeechMarkTypes: markTypes,
  });

  const response = await polly.send(command);
  const chunks: Buffer[] = [];
  const readable = response.AudioStream as Readable;
  for await (const chunk of readable) {
    chunks.push(Buffer.from(chunk));
  }

  const jsonLines = Buffer.concat(chunks).toString("utf-8").trim();
  return jsonLines
    .split("\n")
    .filter(Boolean)
    .map((line) => JSON.parse(line) as SpeechMark);
}

// Generate both audio and speech marks for synchronized playback
async function synthesizeWithMarks(
  text: string,
  voiceId: string
): Promise<{ audio: Buffer; marks: SpeechMark[] }> {
  const [audio, marks] = await Promise.all([
    synthesizeSpeech(text, voiceId),
    getSpeechMarks(text, voiceId, ["word", "sentence", "viseme"]),
  ]);
  return { audio, marks };
}

Custom Lexicons

Define pronunciation rules for domain-specific terms:

async function createLexicon(
  name: string,
  entries: Array<{ grapheme: string; alias?: string; phoneme?: string }>
): Promise<void> {
  let content = '<?xml version="1.0" encoding="UTF-8"?>\n';
  content +=
    '<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" ';
  content += 'xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ';
  content +=
    'xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon" ';
  content += 'alphabet="ipa" xml:lang="en-US">\n';

  for (const entry of entries) {
    content += "  <lexeme>\n";
    content += `    <grapheme>${entry.grapheme}</grapheme>\n`;
    if (entry.alias) {
      content += `    <alias>${entry.alias}</alias>\n`;
    }
    if (entry.phoneme) {
      content += `    <phoneme>${entry.phoneme}</phoneme>\n`;
    }
    content += "  </lexeme>\n";
  }

  content += "</lexicon>";

  const command = new PutLexiconCommand({
    Name: name,
    Content: content,
  });

  await polly.send(command);
}

// Usage: define pronunciation for company names
await createLexicon("company-terms", [
  { grapheme: "AWS", alias: "Amazon Web Services" },
  { grapheme: "kubectl", alias: "kube control" },
  { grapheme: "nginx", alias: "engine x" },
]);

// Use the lexicon in synthesis
async function synthesizeWithLexicon(
  text: string,
  lexiconNames: string[]
): Promise<Buffer> {
  const command = new SynthesizeSpeechCommand({
    Text: text,
    VoiceId: "Joanna",
    Engine: "neural",
    OutputFormat: "mp3",
    LexiconNames: lexiconNames,
  });

  const response = await polly.send(command);
  const chunks: Buffer[] = [];
  const readable = response.AudioStream as Readable;
  for await (const chunk of readable) {
    chunks.push(Buffer.from(chunk));
  }
  return Buffer.concat(chunks);
}

Async Synthesis for Long Content

Use async tasks with S3 output for text longer than 3000 characters:

async function startLongSynthesis(
  text: string,
  outputBucket: string,
  outputKey: string,
  voiceId: string = "Joanna"
): Promise<string> {
  const command = new StartSpeechSynthesisTaskCommand({
    Text: text,
    VoiceId: voiceId,
    Engine: "neural",
    OutputFormat: "mp3",
    OutputS3BucketName: outputBucket,
    OutputS3KeyPrefix: outputKey,
    SampleRate: "24000",
  });

  const response = await polly.send(command);
  return response.SynthesisTask?.TaskId ?? "";
}

async function checkSynthesisStatus(
  taskId: string
): Promise<{ status: string; outputUri?: string }> {
  const command = new GetSpeechSynthesisTaskCommand({ TaskId: taskId });
  const response = await polly.send(command);
  const task = response.SynthesisTask;

  return {
    status: task?.TaskStatus ?? "unknown",
    outputUri: task?.OutputUri,
  };
}

Listing Available Voices

Query available voices with filtering:

interface PollyVoice {
  id: string;
  name: string;
  gender: string;
  languageCode: string;
  languageName: string;
  engine: string[];
}

async function listVoices(languageCode?: string): Promise<PollyVoice[]> {
  const command = new DescribeVoicesCommand({
    LanguageCode: languageCode,
  });

  const response = await polly.send(command);

  return (response.Voices ?? []).map((v) => ({
    id: v.Id ?? "",
    name: v.Name ?? "",
    gender: v.Gender ?? "",
    languageCode: v.LanguageCode ?? "",
    languageName: v.LanguageName ?? "",
    engine: v.SupportedEngines ?? [],
  }));
}

async function findNeuralVoice(
  languageCode: string,
  gender: "Male" | "Female"
): Promise<string | null> {
  const voices = await listVoices(languageCode);
  const match = voices.find(
    (v) => v.engine.includes("neural") && v.gender === gender
  );
  return match?.id ?? null;
}

Best Practices

Use neural engine whenever possible — Neural voices sound significantly more natural. Check voice availability with DescribeVoices and filter by SupportedEngines.
Use async tasks for long text — The synchronous API has a 3000-character limit. For longer content, use StartSpeechSynthesisTask which writes directly to S3.
Leverage speech marks for multimedia — Speech marks enable word highlighting in karaoke-style apps, lip-sync for avatars, and precise subtitle timing.
Create lexicons for your domain — Define pronunciations for brand names, acronyms, and technical terms once and reference them across all synthesis requests.
Use Polly-specific SSML — Amazon Polly supports <amazon:domain> for newscaster and conversational styles and <amazon:effect> for whispered speech. These are unique to Polly.
Choose output format by use case — Use mp3 for general playback, ogg_vorbis for web streaming, and pcm for audio processing pipelines.

Anti-Patterns

Using standard engine for new projects — Standard voices exist for backward compatibility. Always start with neural voices and fall back only when a language lacks neural support.
Exceeding the synchronous character limit — Sending more than 3000 characters (6000 for SSML) to SynthesizeSpeech returns an error. Use async tasks for longer content.
Ignoring speech marks — Building manual timing for subtitles or animations when Polly provides precise, free speech mark data is unnecessary work.
Creating too many lexicons — Polly limits lexicons per region. Consolidate related entries into fewer, well-organized lexicons.
Not handling throttling — Polly has per-account rate limits. Implement exponential backoff and request queuing for high-volume applications.
Storing credentials in code — Use IAM roles for EC2/Lambda or environment variables for local development. Never hardcode AWS credentials.

Install this skill directly: skilldb add voice-speech-services-skills

Get CLI access →

Amazon Polly

Amazon Polly Skill

Core Philosophy

Setup

Key Techniques

Basic Text-to-Speech

SSML Synthesis

Speech Marks for Synchronization

Custom Lexicons

Async Synthesis for Long Content

Listing Available Voices

Best Practices

Anti-Patterns

Related Skills

AssemblyAI

Cartesia

Deepgram

ElevenLabs

Google Cloud Text to Speech

OpenAI TTS