Deepgram
"Deepgram: speech-to-text, real-time transcription, pre-recorded audio, diarization, sentiment analysis, WebSocket streaming"
Deepgram provides fast, accurate speech-to-text using deep learning models optimized for different domains. It excels at real-time transcription via WebSocket streaming and offers rich post-processing features. Build with these principles:
## Key Points
- **Real-time by default** — Use WebSocket streaming for live audio. Reserve REST endpoints for pre-recorded files only.
- **Choose the right model** — Deepgram offers domain-specific models (general, meeting, phonecall, finance, etc.). Select the model that matches your audio source for best accuracy.
- **Leverage smart formatting** — Enable features like punctuation, paragraphs, and numerals to get production-ready transcripts without post-processing.
- **Use callbacks for async processing** — For large pre-recorded files, use callback URLs instead of waiting synchronously.
- **Use Nova-2 as your default model** — It provides the best accuracy across most use cases. Only switch to domain-specific models if you have tested and confirmed improvement.
- **Enable smart_format** — It handles punctuation, capitalization, numerals, and formatting automatically. This removes the need for most post-processing.
- **Send keep-alive messages** — For long-running WebSocket connections, send periodic keep-alive data to prevent timeout disconnections.
- **Use linear16 encoding for live audio** — PCM 16-bit at 16kHz is the most reliable format for real-time streaming. Avoid compressed formats for live input.
- **Handle interim results properly** — Interim results may change. Only treat `is_final: true` results as committed transcript. Use `speech_final` to detect end of utterance.
- **Sending compressed audio over WebSocket** — WebSocket streaming expects raw audio. Sending MP3 or AAC adds decode latency and can cause errors. Use linear16 PCM.
- **Ignoring diarization confidence** — Speaker labels can be unreliable for overlapping speech. Validate speaker assignments in multi-speaker scenarios.
- **Polling for async results** — Use callback URLs instead of polling. Polling wastes resources and adds latency.
## Quick Example
```typescript
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
import { createReadStream } from "node:fs";
import { readFile } from "node:fs/promises";
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
```skilldb get voice-speech-services-skills/DeepgramFull skill: 304 linesDeepgram Skill
Core Philosophy
Deepgram provides fast, accurate speech-to-text using deep learning models optimized for different domains. It excels at real-time transcription via WebSocket streaming and offers rich post-processing features. Build with these principles:
- Real-time by default — Use WebSocket streaming for live audio. Reserve REST endpoints for pre-recorded files only.
- Choose the right model — Deepgram offers domain-specific models (general, meeting, phonecall, finance, etc.). Select the model that matches your audio source for best accuracy.
- Leverage smart formatting — Enable features like punctuation, paragraphs, and numerals to get production-ready transcripts without post-processing.
- Use callbacks for async processing — For large pre-recorded files, use callback URLs instead of waiting synchronously.
Setup
Install the Deepgram SDK:
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
import { createReadStream } from "node:fs";
import { readFile } from "node:fs/promises";
const deepgram = createClient(process.env.DEEPGRAM_API_KEY!);
Key Techniques
Pre-Recorded Audio Transcription
Transcribe an audio file from disk:
interface TranscriptionResult {
transcript: string;
confidence: number;
words: Array<{
word: string;
start: number;
end: number;
confidence: number;
}>;
}
async function transcribeFile(filePath: string): Promise<TranscriptionResult> {
const audioBuffer = await readFile(filePath);
const { result } = await deepgram.listen.prerecorded.transcribeFile(
audioBuffer,
{
model: "nova-2",
smart_format: true,
punctuate: true,
paragraphs: true,
diarize: true,
language: "en",
}
);
const channel = result.results.channels[0];
const alternative = channel.alternatives[0];
return {
transcript: alternative.transcript,
confidence: alternative.confidence,
words: alternative.words.map((w) => ({
word: w.word,
start: w.start,
end: w.end,
confidence: w.confidence,
})),
};
}
Transcribe from URL
Transcribe audio hosted at a URL without downloading it first:
async function transcribeUrl(audioUrl: string): Promise<string> {
const { result } = await deepgram.listen.prerecorded.transcribeUrl(
{ url: audioUrl },
{
model: "nova-2",
smart_format: true,
punctuate: true,
summarize: "v2",
topics: true,
intents: true,
}
);
return result.results.channels[0].alternatives[0].transcript;
}
Real-Time WebSocket Streaming
Stream live audio from a microphone or audio source for real-time transcription:
interface LiveTranscript {
text: string;
isFinal: boolean;
speechFinal: boolean;
start: number;
duration: number;
}
function createLiveTranscription(
onTranscript: (transcript: LiveTranscript) => void,
onError: (error: Error) => void
): {
sendAudio: (chunk: Buffer) => void;
close: () => void;
} {
const connection = deepgram.listen.live({
model: "nova-2",
language: "en",
smart_format: true,
punctuate: true,
interim_results: true,
utterance_end_ms: 1000,
vad_events: true,
encoding: "linear16",
sample_rate: 16000,
channels: 1,
});
connection.on(LiveTranscriptionEvents.Open, () => {
console.log("Deepgram connection established");
});
connection.on(LiveTranscriptionEvents.Transcript, (data) => {
const alt = data.channel.alternatives[0];
if (alt && alt.transcript) {
onTranscript({
text: alt.transcript,
isFinal: data.is_final,
speechFinal: data.speech_final,
start: data.start,
duration: data.duration,
});
}
});
connection.on(LiveTranscriptionEvents.Error, (err) => {
onError(new Error(String(err)));
});
return {
sendAudio: (chunk: Buffer) => connection.send(chunk),
close: () => connection.requestClose(),
};
}
Speaker Diarization
Identify distinct speakers in audio:
interface DiarizedSegment {
speaker: number;
text: string;
start: number;
end: number;
}
async function transcribeWithSpeakers(
filePath: string
): Promise<DiarizedSegment[]> {
const audioBuffer = await readFile(filePath);
const { result } = await deepgram.listen.prerecorded.transcribeFile(
audioBuffer,
{
model: "nova-2",
smart_format: true,
diarize: true,
punctuate: true,
}
);
const words = result.results.channels[0].alternatives[0].words;
const segments: DiarizedSegment[] = [];
let currentSegment: DiarizedSegment | null = null;
for (const word of words) {
if (!currentSegment || currentSegment.speaker !== word.speaker) {
if (currentSegment) segments.push(currentSegment);
currentSegment = {
speaker: word.speaker ?? 0,
text: word.punctuated_word ?? word.word,
start: word.start,
end: word.end,
};
} else {
currentSegment.text += ` ${word.punctuated_word ?? word.word}`;
currentSegment.end = word.end;
}
}
if (currentSegment) segments.push(currentSegment);
return segments;
}
Sentiment and Topic Detection
Extract sentiment and topics from transcribed audio:
interface AnalysisResult {
transcript: string;
summary: string;
topics: Array<{ topic: string; confidence: number }>;
sentiments: Array<{
text: string;
sentiment: string;
confidence: number;
}>;
}
async function analyzeAudio(audioUrl: string): Promise<AnalysisResult> {
const { result } = await deepgram.listen.prerecorded.transcribeUrl(
{ url: audioUrl },
{
model: "nova-2",
smart_format: true,
summarize: "v2",
topics: true,
sentiment: true,
intents: true,
}
);
const channel = result.results.channels[0];
const alt = channel.alternatives[0];
return {
transcript: alt.transcript,
summary: result.results.summary?.short ?? "",
topics:
result.results.topics?.segments.flatMap((s) =>
s.topics.map((t) => ({
topic: t.topic,
confidence: t.confidence_score,
}))
) ?? [],
sentiments:
result.results.sentiments?.segments.map((s) => ({
text: s.text,
sentiment: s.sentiment,
confidence: s.confidence,
})) ?? [],
};
}
Callback-Based Processing
Use callbacks for processing large files asynchronously:
async function transcribeWithCallback(
audioUrl: string,
callbackUrl: string
): Promise<{ requestId: string }> {
const { result } = await deepgram.listen.prerecorded.transcribeUrl(
{ url: audioUrl },
{
model: "nova-2",
smart_format: true,
callback: callbackUrl,
callback_method: "post",
}
);
return { requestId: result.metadata?.request_id ?? "" };
}
Best Practices
- Use Nova-2 as your default model — It provides the best accuracy across most use cases. Only switch to domain-specific models if you have tested and confirmed improvement.
- Enable smart_format — It handles punctuation, capitalization, numerals, and formatting automatically. This removes the need for most post-processing.
- Set appropriate utterance_end_ms — For real-time transcription, tune this value. Lower values (500ms) give faster responses but may split sentences. Higher values (1500ms) produce more complete utterances.
- Send keep-alive messages — For long-running WebSocket connections, send periodic keep-alive data to prevent timeout disconnections.
- Use linear16 encoding for live audio — PCM 16-bit at 16kHz is the most reliable format for real-time streaming. Avoid compressed formats for live input.
- Handle interim results properly — Interim results may change. Only treat
is_final: trueresults as committed transcript. Usespeech_finalto detect end of utterance.
Anti-Patterns
- Sending compressed audio over WebSocket — WebSocket streaming expects raw audio. Sending MP3 or AAC adds decode latency and can cause errors. Use linear16 PCM.
- Ignoring diarization confidence — Speaker labels can be unreliable for overlapping speech. Validate speaker assignments in multi-speaker scenarios.
- Polling for async results — Use callback URLs instead of polling. Polling wastes resources and adds latency.
- Using one model for all domains — A meeting transcription model performs poorly on phone calls and vice versa. Test model variants against your actual audio.
- Buffering entire audio before sending — For live streaming, send audio chunks as they arrive (every 100-250ms). Buffering defeats the purpose of real-time transcription.
- Skipping error handling on WebSocket — WebSocket connections can drop. Always implement reconnection logic with exponential backoff.
Install this skill directly: skilldb add voice-speech-services-skills
Related Skills
Amazon Polly
"Amazon Polly: AWS text-to-speech, neural/standard voices, SSML, lexicons, speech marks, streaming"
AssemblyAI
"AssemblyAI: speech-to-text, real-time transcription, speaker diarization, content moderation, summarization, sentiment analysis"
Cartesia
Integrate Cartesia's ultra-low-latency voice API for real-time text-to-speech and voice cloning
ElevenLabs
"ElevenLabs: AI voice synthesis, text-to-speech, voice cloning, streaming audio, voice design, multilingual, WebSocket streaming"
Google Cloud Text to Speech
"Google Cloud Text-to-Speech: WaveNet/Neural2 voices, SSML, audio profiles, streaming, multilingual"
OpenAI TTS
"OpenAI TTS: text-to-speech API, voice selection (alloy/echo/fable/onyx/nova/shimmer), streaming, HD voices, audio formats"