Technology & EngineeringVoice Speech Services304 lines

Deepgram

"Deepgram: speech-to-text, real-time transcription, pre-recorded audio, diarization, sentiment analysis, WebSocket streaming"

Quick Summary28 lines

Deepgram provides fast, accurate speech-to-text using deep learning models optimized for different domains. It excels at real-time transcription via WebSocket streaming and offers rich post-processing features. Build with these principles:

## Key Points

- **Real-time by default** — Use WebSocket streaming for live audio. Reserve REST endpoints for pre-recorded files only.
- **Choose the right model** — Deepgram offers domain-specific models (general, meeting, phonecall, finance, etc.). Select the model that matches your audio source for best accuracy.
- **Leverage smart formatting** — Enable features like punctuation, paragraphs, and numerals to get production-ready transcripts without post-processing.
- **Use callbacks for async processing** — For large pre-recorded files, use callback URLs instead of waiting synchronously.
- **Use Nova-2 as your default model** — It provides the best accuracy across most use cases. Only switch to domain-specific models if you have tested and confirmed improvement.
- **Enable smart_format** — It handles punctuation, capitalization, numerals, and formatting automatically. This removes the need for most post-processing.
- **Send keep-alive messages** — For long-running WebSocket connections, send periodic keep-alive data to prevent timeout disconnections.
- **Use linear16 encoding for live audio** — PCM 16-bit at 16kHz is the most reliable format for real-time streaming. Avoid compressed formats for live input.
- **Handle interim results properly** — Interim results may change. Only treat `is_final: true` results as committed transcript. Use `speech_final` to detect end of utterance.
- **Sending compressed audio over WebSocket** — WebSocket streaming expects raw audio. Sending MP3 or AAC adds decode latency and can cause errors. Use linear16 PCM.
- **Ignoring diarization confidence** — Speaker labels can be unreliable for overlapping speech. Validate speaker assignments in multi-speaker scenarios.
- **Polling for async results** — Use callback URLs instead of polling. Polling wastes resources and adds latency.

## Quick Example

```typescript
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
import { createReadStream } from "node:fs";
import { readFile } from "node:fs/promises";

const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
```

skilldb get voice-speech-services-skills/DeepgramFull skill: 304 lines

Paste into your CLAUDE.md or agent config

Deepgram Skill

Core Philosophy

Deepgram provides fast, accurate speech-to-text using deep learning models optimized for different domains. It excels at real-time transcription via WebSocket streaming and offers rich post-processing features. Build with these principles:

Real-time by default — Use WebSocket streaming for live audio. Reserve REST endpoints for pre-recorded files only.
Choose the right model — Deepgram offers domain-specific models (general, meeting, phonecall, finance, etc.). Select the model that matches your audio source for best accuracy.
Leverage smart formatting — Enable features like punctuation, paragraphs, and numerals to get production-ready transcripts without post-processing.
Use callbacks for async processing — For large pre-recorded files, use callback URLs instead of waiting synchronously.

Setup

Install the Deepgram SDK:

import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
import { createReadStream } from "node:fs";
import { readFile } from "node:fs/promises";

const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);

Key Techniques

Pre-Recorded Audio Transcription

Transcribe an audio file from disk:

interface TranscriptionResult {
  transcript: string;
  confidence: number;
  words: Array<{
    word: string;
    start: number;
    end: number;
    confidence: number;
  }>;
}

async function transcribeFile(filePath: string): Promise<TranscriptionResult> {
  const audioBuffer = await readFile(filePath);

  const { result } = await deepgram.listen.prerecorded.transcribeFile(
    audioBuffer,
    {
      model: "nova-2",
      smart_format: true,
      punctuate: true,
      paragraphs: true,
      diarize: true,
      language: "en",
    }
  );

  const channel = result.results.channels[0];
  const alternative = channel.alternatives[0];

  return {
    transcript: alternative.transcript,
    confidence: alternative.confidence,
    words: alternative.words.map((w) => ({
      word: w.word,
      start: w.start,
      end: w.end,
      confidence: w.confidence,
    })),
  };
}

Transcribe from URL

Transcribe audio hosted at a URL without downloading it first:

async function transcribeUrl(audioUrl: string): Promise<string> {
  const { result } = await deepgram.listen.prerecorded.transcribeUrl(
    { url: audioUrl },
    {
      model: "nova-2",
      smart_format: true,
      punctuate: true,
      summarize: "v2",
      topics: true,
      intents: true,
    }
  );

  return result.results.channels[0].alternatives[0].transcript;
}

Real-Time WebSocket Streaming

Stream live audio from a microphone or audio source for real-time transcription:

interface LiveTranscript {
  text: string;
  isFinal: boolean;
  speechFinal: boolean;
  start: number;
  duration: number;
}

function createLiveTranscription(
  onTranscript: (transcript: LiveTranscript) => void,
  onError: (error: Error) => void
): {
  sendAudio: (chunk: Buffer) => void;
  close: () => void;
} {
  const connection = deepgram.listen.live({
    model: "nova-2",
    language: "en",
    smart_format: true,
    punctuate: true,
    interim_results: true,
    utterance_end_ms: 1000,
    vad_events: true,
    encoding: "linear16",
    sample_rate: 16000,
    channels: 1,
  });

  connection.on(LiveTranscriptionEvents.Open, () => {
    console.log("Deepgram connection established");
  });

  connection.on(LiveTranscriptionEvents.Transcript, (data) => {
    const alt = data.channel.alternatives[0];
    if (alt && alt.transcript) {
      onTranscript({
        text: alt.transcript,
        isFinal: data.is_final,
        speechFinal: data.speech_final,
        start: data.start,
        duration: data.duration,
      });
    }
  });

  connection.on(LiveTranscriptionEvents.Error, (err) => {
    onError(new Error(String(err)));
  });

  return {
    sendAudio: (chunk: Buffer) => connection.send(chunk),
    close: () => connection.requestClose(),
  };
}

Speaker Diarization

Identify distinct speakers in audio:

interface DiarizedSegment {
  speaker: number;
  text: string;
  start: number;
  end: number;
}

async function transcribeWithSpeakers(
  filePath: string
): Promise<DiarizedSegment[]> {
  const audioBuffer = await readFile(filePath);

  const { result } = await deepgram.listen.prerecorded.transcribeFile(
    audioBuffer,
    {
      model: "nova-2",
      smart_format: true,
      diarize: true,
      punctuate: true,
    }
  );

  const words = result.results.channels[0].alternatives[0].words;
  const segments: DiarizedSegment[] = [];
  let currentSegment: DiarizedSegment | null = null;

  for (const word of words) {
    if (!currentSegment || currentSegment.speaker !== word.speaker) {
      if (currentSegment) segments.push(currentSegment);
      currentSegment = {
        speaker: word.speaker ?? 0,
        text: word.punctuated_word ?? word.word,
        start: word.start,
        end: word.end,
      };
    } else {
      currentSegment.text += ` ${word.punctuated_word ?? word.word}`;
      currentSegment.end = word.end;
    }
  }
  if (currentSegment) segments.push(currentSegment);

  return segments;
}

Sentiment and Topic Detection

Extract sentiment and topics from transcribed audio:

interface AnalysisResult {
  transcript: string;
  summary: string;
  topics: Array<{ topic: string; confidence: number }>;
  sentiments: Array<{
    text: string;
    sentiment: string;
    confidence: number;
  }>;
}

async function analyzeAudio(audioUrl: string): Promise<AnalysisResult> {
  const { result } = await deepgram.listen.prerecorded.transcribeUrl(
    { url: audioUrl },
    {
      model: "nova-2",
      smart_format: true,
      summarize: "v2",
      topics: true,
      sentiment: true,
      intents: true,
    }
  );

  const channel = result.results.channels[0];
  const alt = channel.alternatives[0];

  return {
    transcript: alt.transcript,
    summary: result.results.summary?.short ?? "",
    topics:
      result.results.topics?.segments.flatMap((s) =>
        s.topics.map((t) => ({
          topic: t.topic,
          confidence: t.confidence_score,
        }))
      ) ?? [],
    sentiments:
      result.results.sentiments?.segments.map((s) => ({
        text: s.text,
        sentiment: s.sentiment,
        confidence: s.confidence,
      })) ?? [],
  };
}

Callback-Based Processing

Use callbacks for processing large files asynchronously:

async function transcribeWithCallback(
  audioUrl: string,
  callbackUrl: string
): Promise<{ requestId: string }> {
  const { result } = await deepgram.listen.prerecorded.transcribeUrl(
    { url: audioUrl },
    {
      model: "nova-2",
      smart_format: true,
      callback: callbackUrl,
      callback_method: "post",
    }
  );

  return { requestId: result.metadata?.request_id ?? "" };
}

Best Practices

Use Nova-2 as your default model — It provides the best accuracy across most use cases. Only switch to domain-specific models if you have tested and confirmed improvement.
Enable smart_format — It handles punctuation, capitalization, numerals, and formatting automatically. This removes the need for most post-processing.
Set appropriate utterance_end_ms — For real-time transcription, tune this value. Lower values (500ms) give faster responses but may split sentences. Higher values (1500ms) produce more complete utterances.
Send keep-alive messages — For long-running WebSocket connections, send periodic keep-alive data to prevent timeout disconnections.
Use linear16 encoding for live audio — PCM 16-bit at 16kHz is the most reliable format for real-time streaming. Avoid compressed formats for live input.
Handle interim results properly — Interim results may change. Only treat is_final: true results as committed transcript. Use speech_final to detect end of utterance.

Anti-Patterns

Sending compressed audio over WebSocket — WebSocket streaming expects raw audio. Sending MP3 or AAC adds decode latency and can cause errors. Use linear16 PCM.
Ignoring diarization confidence — Speaker labels can be unreliable for overlapping speech. Validate speaker assignments in multi-speaker scenarios.
Polling for async results — Use callback URLs instead of polling. Polling wastes resources and adds latency.
Using one model for all domains — A meeting transcription model performs poorly on phone calls and vice versa. Test model variants against your actual audio.
Buffering entire audio before sending — For live streaming, send audio chunks as they arrive (every 100-250ms). Buffering defeats the purpose of real-time transcription.
Skipping error handling on WebSocket — WebSocket connections can drop. Always implement reconnection logic with exponential backoff.

Install this skill directly: skilldb add voice-speech-services-skills

Get CLI access →

Related Skills

Amazon Polly

"Amazon Polly: AWS text-to-speech, neural/standard voices, SSML, lexicons, speech marks, streaming"

Voice Speech Services•366L

AssemblyAI

"AssemblyAI: speech-to-text, real-time transcription, speaker diarization, content moderation, summarization, sentiment analysis"

Voice Speech Services•326L

Cartesia

Integrate Cartesia's ultra-low-latency voice API for real-time text-to-speech and voice cloning

Voice Speech Services•215L

ElevenLabs

"ElevenLabs: AI voice synthesis, text-to-speech, voice cloning, streaming audio, voice design, multilingual, WebSocket streaming"

Voice Speech Services•236L

Google Cloud Text to Speech

"Google Cloud Text-to-Speech: WaveNet/Neural2 voices, SSML, audio profiles, streaming, multilingual"

Voice Speech Services•312L

OpenAI TTS

"OpenAI TTS: text-to-speech API, voice selection (alloy/echo/fable/onyx/nova/shimmer), streaming, HD voices, audio formats"

Voice Speech Services•277L