Skip to main content
Technology & EngineeringVoice Speech Services326 lines

AssemblyAI

"AssemblyAI: speech-to-text, real-time transcription, speaker diarization, content moderation, summarization, sentiment analysis"

Quick Summary18 lines
AssemblyAI provides speech-to-text with built-in audio intelligence features like summarization, sentiment analysis, content moderation, and topic detection. The API is designed around a simple submit-and-poll model for pre-recorded audio and WebSocket streaming for real-time use cases. Build with these principles:

## Key Points

- **Audio intelligence as a first-class feature** — Do not build summarization or sentiment analysis yourself. AssemblyAI provides these as native features on every transcription.
- **Submit and poll for batch, stream for live** — Use the async transcription API for pre-recorded audio. Use real-time WebSocket streaming only for live audio that must be transcribed immediately.
- **Enable only what you need** — Each audio intelligence feature adds processing time and cost. Enable features selectively per request.
- **Use LeMUR for AI-powered analysis** — AssemblyAI's LeMUR framework lets you ask questions about transcripts using an LLM, eliminating custom analysis pipelines.
- **Use the SDK's built-in polling** — The `transcribe` method handles polling automatically. Do not implement your own poll loop.
- **Set language_code explicitly** — Auto-detection works but specifying the language improves accuracy and reduces processing time.
- **Use summary_model wisely** — Choose `informative` for factual summaries, `conversational` for meeting notes, and `catchy` for headlines.
- **Batch related transcripts for LeMUR** — Pass multiple transcript IDs to a single LeMUR request to analyze conversations across multiple recordings.
- **Store transcript IDs** — Keep transcript IDs so you can retrieve results later, run LeMUR queries, or search without re-transcribing.
- **Handle status errors explicitly** — Always check `transcript.status` after transcription. A completed request can still have an error status.
- **Enabling all intelligence features by default** — Each feature adds processing time and cost. Enable only the features your application actually uses.
- **Polling manually instead of using the SDK** — The SDK's `transcribe` method handles polling with proper backoff. Manual polling wastes resources and may hit rate limits.
skilldb get voice-speech-services-skills/AssemblyAIFull skill: 326 lines
Paste into your CLAUDE.md or agent config

AssemblyAI Skill

Core Philosophy

AssemblyAI provides speech-to-text with built-in audio intelligence features like summarization, sentiment analysis, content moderation, and topic detection. The API is designed around a simple submit-and-poll model for pre-recorded audio and WebSocket streaming for real-time use cases. Build with these principles:

  • Audio intelligence as a first-class feature — Do not build summarization or sentiment analysis yourself. AssemblyAI provides these as native features on every transcription.
  • Submit and poll for batch, stream for live — Use the async transcription API for pre-recorded audio. Use real-time WebSocket streaming only for live audio that must be transcribed immediately.
  • Enable only what you need — Each audio intelligence feature adds processing time and cost. Enable features selectively per request.
  • Use LeMUR for AI-powered analysis — AssemblyAI's LeMUR framework lets you ask questions about transcripts using an LLM, eliminating custom analysis pipelines.

Setup

Install the AssemblyAI SDK:

import AssemblyAI, {
  RealtimeTranscriber,
  TranscriptStatus,
} from "assemblyai";
import { readFile } from "node:fs/promises";

const client = new AssemblyAI({
  apiKey: process.env.ASSEMBLYAI_API_KEY!,
});

Key Techniques

Basic Transcription

Transcribe an audio file by uploading it or providing a URL:

async function transcribeFile(filePath: string): Promise<string> {
  const transcript = await client.transcripts.transcribe({
    audio: filePath,
    language_code: "en",
    punctuate: true,
    format_text: true,
  });

  if (transcript.status === TranscriptStatus.Error) {
    throw new Error(`Transcription failed: ${transcript.error}`);
  }

  return transcript.text ?? "";
}

async function transcribeUrl(audioUrl: string): Promise<string> {
  const transcript = await client.transcripts.transcribe({
    audio_url: audioUrl,
    language_code: "en",
  });

  if (transcript.status === TranscriptStatus.Error) {
    throw new Error(`Transcription failed: ${transcript.error}`);
  }

  return transcript.text ?? "";
}

Speaker Diarization

Identify and label different speakers in a conversation:

interface SpeakerUtterance {
  speaker: string;
  text: string;
  start: number;
  end: number;
  confidence: number;
}

async function transcribeWithSpeakers(
  audioUrl: string,
  expectedSpeakers?: number
): Promise<SpeakerUtterance[]> {
  const transcript = await client.transcripts.transcribe({
    audio_url: audioUrl,
    speaker_labels: true,
    speakers_expected: expectedSpeakers,
  });

  if (transcript.status === TranscriptStatus.Error) {
    throw new Error(`Transcription failed: ${transcript.error}`);
  }

  return (transcript.utterances ?? []).map((u) => ({
    speaker: u.speaker,
    text: u.text,
    start: u.start,
    end: u.end,
    confidence: u.confidence,
  }));
}

Audio Intelligence Features

Enable summarization, sentiment analysis, topic detection, and content moderation in a single request:

interface AudioIntelligence {
  transcript: string;
  summary: string;
  sentiments: Array<{
    text: string;
    sentiment: "POSITIVE" | "NEGATIVE" | "NEUTRAL";
    confidence: number;
  }>;
  topics: Array<{
    text: string;
    labels: Array<{ label: string; relevance: number }>;
  }>;
  contentSafety: Array<{
    text: string;
    labels: Array<{ label: string; confidence: number; severity: number }>;
  }>;
}

async function transcribeWithIntelligence(
  audioUrl: string
): Promise<AudioIntelligence> {
  const transcript = await client.transcripts.transcribe({
    audio_url: audioUrl,
    summarization: true,
    summary_model: "informative",
    summary_type: "bullets",
    sentiment_analysis: true,
    iab_categories: true,
    content_safety: true,
  });

  if (transcript.status === TranscriptStatus.Error) {
    throw new Error(`Transcription failed: ${transcript.error}`);
  }

  return {
    transcript: transcript.text ?? "",
    summary: transcript.summary ?? "",
    sentiments: (transcript.sentiment_analysis_results ?? []).map((s) => ({
      text: s.text,
      sentiment: s.sentiment,
      confidence: s.confidence,
    })),
    topics: (transcript.iab_categories_result?.results ?? []).map((t) => ({
      text: t.text,
      labels: t.labels.map((l) => ({
        label: l.label,
        relevance: l.relevance,
      })),
    })),
    contentSafety: (transcript.content_safety_labels?.results ?? []).map(
      (c) => ({
        text: c.text,
        labels: c.labels.map((l) => ({
          label: l.label,
          confidence: l.confidence,
          severity: l.severity,
        })),
      })
    ),
  };
}

Real-Time Transcription

Stream live audio via WebSocket for real-time transcription:

interface RealtimeCallbacks {
  onPartialTranscript: (text: string) => void;
  onFinalTranscript: (text: string) => void;
  onError: (error: Error) => void;
}

async function createRealtimeSession(
  callbacks: RealtimeCallbacks
): Promise<{
  sendAudio: (chunk: Buffer) => void;
  close: () => Promise<void>;
}> {
  const transcriber = client.realtime.transcriber({
    sampleRate: 16000,
    encoding: "pcm_s16le",
  });

  transcriber.on("transcript", (msg) => {
    if (msg.message_type === "PartialTranscript" && msg.text) {
      callbacks.onPartialTranscript(msg.text);
    }
    if (msg.message_type === "FinalTranscript" && msg.text) {
      callbacks.onFinalTranscript(msg.text);
    }
  });

  transcriber.on("error", (err) => {
    callbacks.onError(err instanceof Error ? err : new Error(String(err)));
  });

  await transcriber.connect();

  return {
    sendAudio: (chunk: Buffer) => transcriber.sendAudio(chunk),
    close: async () => {
      await transcriber.close();
    },
  };
}

LeMUR Analysis

Use the LeMUR framework to ask questions about transcripts:

async function askAboutTranscript(
  transcriptId: string,
  question: string
): Promise<string> {
  const response = await client.lemur.questionAnswer({
    transcript_ids: [transcriptId],
    questions: [{ question, answer_format: "short" }],
    final_model: "anthropic/claude-3-5-sonnet",
  });

  return response.response[0]?.answer ?? "";
}

async function summarizeTranscript(
  transcriptId: string,
  context: string
): Promise<string> {
  const response = await client.lemur.summary({
    transcript_ids: [transcriptId],
    context,
    final_model: "anthropic/claude-3-5-sonnet",
    answer_format: "bullet points",
  });

  return response.response;
}

async function extractActionItems(
  transcriptIds: string[]
): Promise<string> {
  const response = await client.lemur.task({
    transcript_ids: transcriptIds,
    prompt:
      "Extract all action items from this meeting. For each action item, " +
      "include who is responsible and any deadline mentioned. Format as a " +
      "numbered list.",
    final_model: "anthropic/claude-3-5-sonnet",
  });

  return response.response;
}

Word-Level Timestamps and Search

Get precise word timestamps and search within transcripts:

async function getWordTimestamps(
  audioUrl: string
): Promise<Array<{ word: string; start: number; end: number }>> {
  const transcript = await client.transcripts.transcribe({
    audio_url: audioUrl,
  });

  if (transcript.status === TranscriptStatus.Error) {
    throw new Error(`Transcription failed: ${transcript.error}`);
  }

  return (transcript.words ?? []).map((w) => ({
    word: w.text,
    start: w.start,
    end: w.end,
  }));
}

async function searchTranscript(
  transcriptId: string,
  words: string[]
): Promise<
  Array<{ count: number; matches: Array<{ text: string; timestamps: Array<{ start: number; end: number }> }> }>
> {
  const results = await client.transcripts.wordSearch(transcriptId, words);
  return results.matches.map((m) => ({
    count: m.count,
    matches: m.timestamps.map((t) => ({
      text: m.text,
      timestamps: [{ start: t.start, end: t.end }],
    })),
  }));
}

Best Practices

  • Use the SDK's built-in polling — The transcribe method handles polling automatically. Do not implement your own poll loop.
  • Set language_code explicitly — Auto-detection works but specifying the language improves accuracy and reduces processing time.
  • Use summary_model wisely — Choose informative for factual summaries, conversational for meeting notes, and catchy for headlines.
  • Batch related transcripts for LeMUR — Pass multiple transcript IDs to a single LeMUR request to analyze conversations across multiple recordings.
  • Store transcript IDs — Keep transcript IDs so you can retrieve results later, run LeMUR queries, or search without re-transcribing.
  • Handle status errors explicitly — Always check transcript.status after transcription. A completed request can still have an error status.

Anti-Patterns

  • Enabling all intelligence features by default — Each feature adds processing time and cost. Enable only the features your application actually uses.
  • Polling manually instead of using the SDK — The SDK's transcribe method handles polling with proper backoff. Manual polling wastes resources and may hit rate limits.
  • Using real-time streaming for batch processing — Real-time streaming is designed for live audio. For pre-recorded files, the async transcription API is faster and more cost-effective.
  • Ignoring content safety results — If your application handles user-generated audio, always enable and act on content safety labels.
  • Re-transcribing to ask new questions — Use LeMUR on existing transcript IDs. There is no need to re-transcribe audio to perform new analysis.
  • Not setting speakers_expected — When you know the number of speakers, set this parameter. It significantly improves diarization accuracy.

Install this skill directly: skilldb add voice-speech-services-skills

Get CLI access →