Amazon Polly
"Amazon Polly: AWS text-to-speech, neural/standard voices, SSML, lexicons, speech marks, streaming"
Amazon Polly is AWS's text-to-speech service providing neural and standard voices across dozens of languages. It integrates naturally with the AWS ecosystem and offers unique features like custom lexicons, speech marks for lip-sync, and direct S3 output for batch tasks. Build with these principles: ## Key Points - **Neural voices for quality** — Always use neural engine voices when available for your language. Standard voices exist for backward compatibility and edge cases. - **Leverage speech marks** — Polly uniquely provides word-level timing, viseme (mouth shape), and SSML mark data. Use these for animation, subtitles, and synchronized experiences. - **Use lexicons for domain terminology** — Custom pronunciation lexicons let you control how Polly pronounces brand names, acronyms, and technical terms without modifying input text. - **Integrate with AWS services** — Store output in S3, trigger synthesis from Lambda, and use SQS for batch queues. Polly fits naturally in AWS architectures. - **Use neural engine whenever possible** — Neural voices sound significantly more natural. Check voice availability with `DescribeVoices` and filter by `SupportedEngines`. - **Use async tasks for long text** — The synchronous API has a 3000-character limit. For longer content, use `StartSpeechSynthesisTask` which writes directly to S3. - **Leverage speech marks for multimedia** — Speech marks enable word highlighting in karaoke-style apps, lip-sync for avatars, and precise subtitle timing. - **Create lexicons for your domain** — Define pronunciations for brand names, acronyms, and technical terms once and reference them across all synthesis requests. - **Use Polly-specific SSML** — Amazon Polly supports `<amazon:domain>` for newscaster and conversational styles and `<amazon:effect>` for whispered speech. These are unique to Polly. - **Choose output format by use case** — Use `mp3` for general playback, `ogg_vorbis` for web streaming, and `pcm` for audio processing pipelines. - **Using standard engine for new projects** — Standard voices exist for backward compatibility. Always start with neural voices and fall back only when a language lacks neural support. - **Exceeding the synchronous character limit** — Sending more than 3000 characters (6000 for SSML) to `SynthesizeSpeech` returns an error. Use async tasks for longer content.
skilldb get voice-speech-services-skills/Amazon PollyFull skill: 366 linesAmazon Polly Skill
Core Philosophy
Amazon Polly is AWS's text-to-speech service providing neural and standard voices across dozens of languages. It integrates naturally with the AWS ecosystem and offers unique features like custom lexicons, speech marks for lip-sync, and direct S3 output for batch tasks. Build with these principles:
- Neural voices for quality — Always use neural engine voices when available for your language. Standard voices exist for backward compatibility and edge cases.
- Leverage speech marks — Polly uniquely provides word-level timing, viseme (mouth shape), and SSML mark data. Use these for animation, subtitles, and synchronized experiences.
- Use lexicons for domain terminology — Custom pronunciation lexicons let you control how Polly pronounces brand names, acronyms, and technical terms without modifying input text.
- Integrate with AWS services — Store output in S3, trigger synthesis from Lambda, and use SQS for batch queues. Polly fits naturally in AWS architectures.
Setup
Install the AWS SDK v3 for Polly:
import {
PollyClient,
SynthesizeSpeechCommand,
DescribeVoicesCommand,
PutLexiconCommand,
GetSpeechSynthesisTaskCommand,
StartSpeechSynthesisTaskCommand,
} from "@aws-sdk/client-polly";
import { writeFile } from "node:fs/promises";
import { Readable } from "node:stream";
const polly = new PollyClient({
region: process.env.AWS_REGION ?? "us-east-1",
});
Key Techniques
Basic Text-to-Speech
Synthesize speech and save to file:
async function synthesizeSpeech(
text: string,
voiceId: string = "Joanna",
engine: "neural" | "standard" = "neural"
): Promise<Buffer> {
const command = new SynthesizeSpeechCommand({
Text: text,
VoiceId: voiceId,
Engine: engine,
OutputFormat: "mp3",
SampleRate: "24000",
});
const response = await polly.send(command);
if (!response.AudioStream) {
throw new Error("No audio stream returned");
}
const chunks: Buffer[] = [];
const readable = response.AudioStream as Readable;
for await (const chunk of readable) {
chunks.push(Buffer.from(chunk));
}
return Buffer.concat(chunks);
}
// Usage
const audio = await synthesizeSpeech(
"Welcome to our application. Your account is ready.",
"Matthew",
"neural"
);
await writeFile("welcome.mp3", audio);
SSML Synthesis
Use SSML for advanced speech control:
async function synthesizeSSML(
ssml: string,
voiceId: string = "Joanna"
): Promise<Buffer> {
const command = new SynthesizeSpeechCommand({
Text: ssml,
TextType: "ssml",
VoiceId: voiceId,
Engine: "neural",
OutputFormat: "mp3",
SampleRate: "24000",
});
const response = await polly.send(command);
const chunks: Buffer[] = [];
const readable = response.AudioStream as Readable;
for await (const chunk of readable) {
chunks.push(Buffer.from(chunk));
}
return Buffer.concat(chunks);
}
// Build SSML with Polly-specific features
function buildPollySSML(options: {
text: string;
newscasterStyle?: boolean;
conversationalStyle?: boolean;
whisper?: boolean;
breathe?: boolean;
}): string {
let ssml = "<speak>";
if (options.newscasterStyle) {
ssml += '<amazon:domain name="news">';
ssml += options.text;
ssml += "</amazon:domain>";
} else if (options.conversationalStyle) {
ssml += '<amazon:domain name="conversational">';
ssml += options.text;
ssml += "</amazon:domain>";
} else if (options.whisper) {
ssml += '<amazon:effect name="whispered">';
ssml += options.text;
ssml += "</amazon:effect>";
} else {
ssml += options.text;
}
ssml += "</speak>";
return ssml;
}
// Newscaster voice
const newsSSML = buildPollySSML({
text: "Today's top story. Markets reached an all-time high.",
newscasterStyle: true,
});
Speech Marks for Synchronization
Generate timing data for word highlighting, subtitles, or lip-sync animation:
interface SpeechMark {
time: number;
type: "word" | "sentence" | "viseme" | "ssml";
start?: number;
end?: number;
value: string;
}
async function getSpeechMarks(
text: string,
voiceId: string = "Joanna",
markTypes: Array<"word" | "sentence" | "viseme" | "ssml"> = ["word", "sentence"]
): Promise<SpeechMark[]> {
const command = new SynthesizeSpeechCommand({
Text: text,
VoiceId: voiceId,
Engine: "neural",
OutputFormat: "json",
SpeechMarkTypes: markTypes,
});
const response = await polly.send(command);
const chunks: Buffer[] = [];
const readable = response.AudioStream as Readable;
for await (const chunk of readable) {
chunks.push(Buffer.from(chunk));
}
const jsonLines = Buffer.concat(chunks).toString("utf-8").trim();
return jsonLines
.split("\n")
.filter(Boolean)
.map((line) => JSON.parse(line) as SpeechMark);
}
// Generate both audio and speech marks for synchronized playback
async function synthesizeWithMarks(
text: string,
voiceId: string
): Promise<{ audio: Buffer; marks: SpeechMark[] }> {
const [audio, marks] = await Promise.all([
synthesizeSpeech(text, voiceId),
getSpeechMarks(text, voiceId, ["word", "sentence", "viseme"]),
]);
return { audio, marks };
}
Custom Lexicons
Define pronunciation rules for domain-specific terms:
async function createLexicon(
name: string,
entries: Array<{ grapheme: string; alias?: string; phoneme?: string }>
): Promise<void> {
let content = '<?xml version="1.0" encoding="UTF-8"?>\n';
content +=
'<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" ';
content += 'xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ';
content +=
'xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon" ';
content += 'alphabet="ipa" xml:lang="en-US">\n';
for (const entry of entries) {
content += " <lexeme>\n";
content += ` <grapheme>${entry.grapheme}</grapheme>\n`;
if (entry.alias) {
content += ` <alias>${entry.alias}</alias>\n`;
}
if (entry.phoneme) {
content += ` <phoneme>${entry.phoneme}</phoneme>\n`;
}
content += " </lexeme>\n";
}
content += "</lexicon>";
const command = new PutLexiconCommand({
Name: name,
Content: content,
});
await polly.send(command);
}
// Usage: define pronunciation for company names
await createLexicon("company-terms", [
{ grapheme: "AWS", alias: "Amazon Web Services" },
{ grapheme: "kubectl", alias: "kube control" },
{ grapheme: "nginx", alias: "engine x" },
]);
// Use the lexicon in synthesis
async function synthesizeWithLexicon(
text: string,
lexiconNames: string[]
): Promise<Buffer> {
const command = new SynthesizeSpeechCommand({
Text: text,
VoiceId: "Joanna",
Engine: "neural",
OutputFormat: "mp3",
LexiconNames: lexiconNames,
});
const response = await polly.send(command);
const chunks: Buffer[] = [];
const readable = response.AudioStream as Readable;
for await (const chunk of readable) {
chunks.push(Buffer.from(chunk));
}
return Buffer.concat(chunks);
}
Async Synthesis for Long Content
Use async tasks with S3 output for text longer than 3000 characters:
async function startLongSynthesis(
text: string,
outputBucket: string,
outputKey: string,
voiceId: string = "Joanna"
): Promise<string> {
const command = new StartSpeechSynthesisTaskCommand({
Text: text,
VoiceId: voiceId,
Engine: "neural",
OutputFormat: "mp3",
OutputS3BucketName: outputBucket,
OutputS3KeyPrefix: outputKey,
SampleRate: "24000",
});
const response = await polly.send(command);
return response.SynthesisTask?.TaskId ?? "";
}
async function checkSynthesisStatus(
taskId: string
): Promise<{ status: string; outputUri?: string }> {
const command = new GetSpeechSynthesisTaskCommand({ TaskId: taskId });
const response = await polly.send(command);
const task = response.SynthesisTask;
return {
status: task?.TaskStatus ?? "unknown",
outputUri: task?.OutputUri,
};
}
Listing Available Voices
Query available voices with filtering:
interface PollyVoice {
id: string;
name: string;
gender: string;
languageCode: string;
languageName: string;
engine: string[];
}
async function listVoices(languageCode?: string): Promise<PollyVoice[]> {
const command = new DescribeVoicesCommand({
LanguageCode: languageCode,
});
const response = await polly.send(command);
return (response.Voices ?? []).map((v) => ({
id: v.Id ?? "",
name: v.Name ?? "",
gender: v.Gender ?? "",
languageCode: v.LanguageCode ?? "",
languageName: v.LanguageName ?? "",
engine: v.SupportedEngines ?? [],
}));
}
async function findNeuralVoice(
languageCode: string,
gender: "Male" | "Female"
): Promise<string | null> {
const voices = await listVoices(languageCode);
const match = voices.find(
(v) => v.engine.includes("neural") && v.gender === gender
);
return match?.id ?? null;
}
Best Practices
- Use neural engine whenever possible — Neural voices sound significantly more natural. Check voice availability with
DescribeVoicesand filter bySupportedEngines. - Use async tasks for long text — The synchronous API has a 3000-character limit. For longer content, use
StartSpeechSynthesisTaskwhich writes directly to S3. - Leverage speech marks for multimedia — Speech marks enable word highlighting in karaoke-style apps, lip-sync for avatars, and precise subtitle timing.
- Create lexicons for your domain — Define pronunciations for brand names, acronyms, and technical terms once and reference them across all synthesis requests.
- Use Polly-specific SSML — Amazon Polly supports
<amazon:domain>for newscaster and conversational styles and<amazon:effect>for whispered speech. These are unique to Polly. - Choose output format by use case — Use
mp3for general playback,ogg_vorbisfor web streaming, andpcmfor audio processing pipelines.
Anti-Patterns
- Using standard engine for new projects — Standard voices exist for backward compatibility. Always start with neural voices and fall back only when a language lacks neural support.
- Exceeding the synchronous character limit — Sending more than 3000 characters (6000 for SSML) to
SynthesizeSpeechreturns an error. Use async tasks for longer content. - Ignoring speech marks — Building manual timing for subtitles or animations when Polly provides precise, free speech mark data is unnecessary work.
- Creating too many lexicons — Polly limits lexicons per region. Consolidate related entries into fewer, well-organized lexicons.
- Not handling throttling — Polly has per-account rate limits. Implement exponential backoff and request queuing for high-volume applications.
- Storing credentials in code — Use IAM roles for EC2/Lambda or environment variables for local development. Never hardcode AWS credentials.
Install this skill directly: skilldb add voice-speech-services-skills
Related Skills
AssemblyAI
"AssemblyAI: speech-to-text, real-time transcription, speaker diarization, content moderation, summarization, sentiment analysis"
Cartesia
Integrate Cartesia's ultra-low-latency voice API for real-time text-to-speech and voice cloning
Deepgram
"Deepgram: speech-to-text, real-time transcription, pre-recorded audio, diarization, sentiment analysis, WebSocket streaming"
ElevenLabs
"ElevenLabs: AI voice synthesis, text-to-speech, voice cloning, streaming audio, voice design, multilingual, WebSocket streaming"
Google Cloud Text to Speech
"Google Cloud Text-to-Speech: WaveNet/Neural2 voices, SSML, audio profiles, streaming, multilingual"
OpenAI TTS
"OpenAI TTS: text-to-speech API, voice selection (alloy/echo/fable/onyx/nova/shimmer), streaming, HD voices, audio formats"