AssemblyAI
"AssemblyAI: speech-to-text, real-time transcription, speaker diarization, content moderation, summarization, sentiment analysis"
AssemblyAI provides speech-to-text with built-in audio intelligence features like summarization, sentiment analysis, content moderation, and topic detection. The API is designed around a simple submit-and-poll model for pre-recorded audio and WebSocket streaming for real-time use cases. Build with these principles: ## Key Points - **Audio intelligence as a first-class feature** — Do not build summarization or sentiment analysis yourself. AssemblyAI provides these as native features on every transcription. - **Submit and poll for batch, stream for live** — Use the async transcription API for pre-recorded audio. Use real-time WebSocket streaming only for live audio that must be transcribed immediately. - **Enable only what you need** — Each audio intelligence feature adds processing time and cost. Enable features selectively per request. - **Use LeMUR for AI-powered analysis** — AssemblyAI's LeMUR framework lets you ask questions about transcripts using an LLM, eliminating custom analysis pipelines. - **Use the SDK's built-in polling** — The `transcribe` method handles polling automatically. Do not implement your own poll loop. - **Set language_code explicitly** — Auto-detection works but specifying the language improves accuracy and reduces processing time. - **Use summary_model wisely** — Choose `informative` for factual summaries, `conversational` for meeting notes, and `catchy` for headlines. - **Batch related transcripts for LeMUR** — Pass multiple transcript IDs to a single LeMUR request to analyze conversations across multiple recordings. - **Store transcript IDs** — Keep transcript IDs so you can retrieve results later, run LeMUR queries, or search without re-transcribing. - **Handle status errors explicitly** — Always check `transcript.status` after transcription. A completed request can still have an error status. - **Enabling all intelligence features by default** — Each feature adds processing time and cost. Enable only the features your application actually uses. - **Polling manually instead of using the SDK** — The SDK's `transcribe` method handles polling with proper backoff. Manual polling wastes resources and may hit rate limits.
skilldb get voice-speech-services-skills/AssemblyAIFull skill: 326 linesAssemblyAI Skill
Core Philosophy
AssemblyAI provides speech-to-text with built-in audio intelligence features like summarization, sentiment analysis, content moderation, and topic detection. The API is designed around a simple submit-and-poll model for pre-recorded audio and WebSocket streaming for real-time use cases. Build with these principles:
- Audio intelligence as a first-class feature — Do not build summarization or sentiment analysis yourself. AssemblyAI provides these as native features on every transcription.
- Submit and poll for batch, stream for live — Use the async transcription API for pre-recorded audio. Use real-time WebSocket streaming only for live audio that must be transcribed immediately.
- Enable only what you need — Each audio intelligence feature adds processing time and cost. Enable features selectively per request.
- Use LeMUR for AI-powered analysis — AssemblyAI's LeMUR framework lets you ask questions about transcripts using an LLM, eliminating custom analysis pipelines.
Setup
Install the AssemblyAI SDK:
import AssemblyAI, {
RealtimeTranscriber,
TranscriptStatus,
} from "assemblyai";
import { readFile } from "node:fs/promises";
const client = new AssemblyAI({
apiKey: process.env.ASSEMBLYAI_API_KEY!,
});
Key Techniques
Basic Transcription
Transcribe an audio file by uploading it or providing a URL:
async function transcribeFile(filePath: string): Promise<string> {
const transcript = await client.transcripts.transcribe({
audio: filePath,
language_code: "en",
punctuate: true,
format_text: true,
});
if (transcript.status === TranscriptStatus.Error) {
throw new Error(`Transcription failed: ${transcript.error}`);
}
return transcript.text ?? "";
}
async function transcribeUrl(audioUrl: string): Promise<string> {
const transcript = await client.transcripts.transcribe({
audio_url: audioUrl,
language_code: "en",
});
if (transcript.status === TranscriptStatus.Error) {
throw new Error(`Transcription failed: ${transcript.error}`);
}
return transcript.text ?? "";
}
Speaker Diarization
Identify and label different speakers in a conversation:
interface SpeakerUtterance {
speaker: string;
text: string;
start: number;
end: number;
confidence: number;
}
async function transcribeWithSpeakers(
audioUrl: string,
expectedSpeakers?: number
): Promise<SpeakerUtterance[]> {
const transcript = await client.transcripts.transcribe({
audio_url: audioUrl,
speaker_labels: true,
speakers_expected: expectedSpeakers,
});
if (transcript.status === TranscriptStatus.Error) {
throw new Error(`Transcription failed: ${transcript.error}`);
}
return (transcript.utterances ?? []).map((u) => ({
speaker: u.speaker,
text: u.text,
start: u.start,
end: u.end,
confidence: u.confidence,
}));
}
Audio Intelligence Features
Enable summarization, sentiment analysis, topic detection, and content moderation in a single request:
interface AudioIntelligence {
transcript: string;
summary: string;
sentiments: Array<{
text: string;
sentiment: "POSITIVE" | "NEGATIVE" | "NEUTRAL";
confidence: number;
}>;
topics: Array<{
text: string;
labels: Array<{ label: string; relevance: number }>;
}>;
contentSafety: Array<{
text: string;
labels: Array<{ label: string; confidence: number; severity: number }>;
}>;
}
async function transcribeWithIntelligence(
audioUrl: string
): Promise<AudioIntelligence> {
const transcript = await client.transcripts.transcribe({
audio_url: audioUrl,
summarization: true,
summary_model: "informative",
summary_type: "bullets",
sentiment_analysis: true,
iab_categories: true,
content_safety: true,
});
if (transcript.status === TranscriptStatus.Error) {
throw new Error(`Transcription failed: ${transcript.error}`);
}
return {
transcript: transcript.text ?? "",
summary: transcript.summary ?? "",
sentiments: (transcript.sentiment_analysis_results ?? []).map((s) => ({
text: s.text,
sentiment: s.sentiment,
confidence: s.confidence,
})),
topics: (transcript.iab_categories_result?.results ?? []).map((t) => ({
text: t.text,
labels: t.labels.map((l) => ({
label: l.label,
relevance: l.relevance,
})),
})),
contentSafety: (transcript.content_safety_labels?.results ?? []).map(
(c) => ({
text: c.text,
labels: c.labels.map((l) => ({
label: l.label,
confidence: l.confidence,
severity: l.severity,
})),
})
),
};
}
Real-Time Transcription
Stream live audio via WebSocket for real-time transcription:
interface RealtimeCallbacks {
onPartialTranscript: (text: string) => void;
onFinalTranscript: (text: string) => void;
onError: (error: Error) => void;
}
async function createRealtimeSession(
callbacks: RealtimeCallbacks
): Promise<{
sendAudio: (chunk: Buffer) => void;
close: () => Promise<void>;
}> {
const transcriber = client.realtime.transcriber({
sampleRate: 16000,
encoding: "pcm_s16le",
});
transcriber.on("transcript", (msg) => {
if (msg.message_type === "PartialTranscript" && msg.text) {
callbacks.onPartialTranscript(msg.text);
}
if (msg.message_type === "FinalTranscript" && msg.text) {
callbacks.onFinalTranscript(msg.text);
}
});
transcriber.on("error", (err) => {
callbacks.onError(err instanceof Error ? err : new Error(String(err)));
});
await transcriber.connect();
return {
sendAudio: (chunk: Buffer) => transcriber.sendAudio(chunk),
close: async () => {
await transcriber.close();
},
};
}
LeMUR Analysis
Use the LeMUR framework to ask questions about transcripts:
async function askAboutTranscript(
transcriptId: string,
question: string
): Promise<string> {
const response = await client.lemur.questionAnswer({
transcript_ids: [transcriptId],
questions: [{ question, answer_format: "short" }],
final_model: "anthropic/claude-3-5-sonnet",
});
return response.response[0]?.answer ?? "";
}
async function summarizeTranscript(
transcriptId: string,
context: string
): Promise<string> {
const response = await client.lemur.summary({
transcript_ids: [transcriptId],
context,
final_model: "anthropic/claude-3-5-sonnet",
answer_format: "bullet points",
});
return response.response;
}
async function extractActionItems(
transcriptIds: string[]
): Promise<string> {
const response = await client.lemur.task({
transcript_ids: transcriptIds,
prompt:
"Extract all action items from this meeting. For each action item, " +
"include who is responsible and any deadline mentioned. Format as a " +
"numbered list.",
final_model: "anthropic/claude-3-5-sonnet",
});
return response.response;
}
Word-Level Timestamps and Search
Get precise word timestamps and search within transcripts:
async function getWordTimestamps(
audioUrl: string
): Promise<Array<{ word: string; start: number; end: number }>> {
const transcript = await client.transcripts.transcribe({
audio_url: audioUrl,
});
if (transcript.status === TranscriptStatus.Error) {
throw new Error(`Transcription failed: ${transcript.error}`);
}
return (transcript.words ?? []).map((w) => ({
word: w.text,
start: w.start,
end: w.end,
}));
}
async function searchTranscript(
transcriptId: string,
words: string[]
): Promise<
Array<{ count: number; matches: Array<{ text: string; timestamps: Array<{ start: number; end: number }> }> }>
> {
const results = await client.transcripts.wordSearch(transcriptId, words);
return results.matches.map((m) => ({
count: m.count,
matches: m.timestamps.map((t) => ({
text: m.text,
timestamps: [{ start: t.start, end: t.end }],
})),
}));
}
Best Practices
- Use the SDK's built-in polling — The
transcribemethod handles polling automatically. Do not implement your own poll loop. - Set language_code explicitly — Auto-detection works but specifying the language improves accuracy and reduces processing time.
- Use summary_model wisely — Choose
informativefor factual summaries,conversationalfor meeting notes, andcatchyfor headlines. - Batch related transcripts for LeMUR — Pass multiple transcript IDs to a single LeMUR request to analyze conversations across multiple recordings.
- Store transcript IDs — Keep transcript IDs so you can retrieve results later, run LeMUR queries, or search without re-transcribing.
- Handle status errors explicitly — Always check
transcript.statusafter transcription. A completed request can still have an error status.
Anti-Patterns
- Enabling all intelligence features by default — Each feature adds processing time and cost. Enable only the features your application actually uses.
- Polling manually instead of using the SDK — The SDK's
transcribemethod handles polling with proper backoff. Manual polling wastes resources and may hit rate limits. - Using real-time streaming for batch processing — Real-time streaming is designed for live audio. For pre-recorded files, the async transcription API is faster and more cost-effective.
- Ignoring content safety results — If your application handles user-generated audio, always enable and act on content safety labels.
- Re-transcribing to ask new questions — Use LeMUR on existing transcript IDs. There is no need to re-transcribe audio to perform new analysis.
- Not setting speakers_expected — When you know the number of speakers, set this parameter. It significantly improves diarization accuracy.
Install this skill directly: skilldb add voice-speech-services-skills
Related Skills
Amazon Polly
"Amazon Polly: AWS text-to-speech, neural/standard voices, SSML, lexicons, speech marks, streaming"
Cartesia
Integrate Cartesia's ultra-low-latency voice API for real-time text-to-speech and voice cloning
Deepgram
"Deepgram: speech-to-text, real-time transcription, pre-recorded audio, diarization, sentiment analysis, WebSocket streaming"
ElevenLabs
"ElevenLabs: AI voice synthesis, text-to-speech, voice cloning, streaming audio, voice design, multilingual, WebSocket streaming"
Google Cloud Text to Speech
"Google Cloud Text-to-Speech: WaveNet/Neural2 voices, SSML, audio profiles, streaming, multilingual"
OpenAI TTS
"OpenAI TTS: text-to-speech API, voice selection (alloy/echo/fable/onyx/nova/shimmer), streaming, HD voices, audio formats"