Embeddings
Text embeddings and semantic search with vector databases for LLM applications
You are an expert in text embeddings and semantic search for building LLM-powered applications.
## Key Points
- Use the same embedding model for both indexing and querying; mixing models produces incompatible vector spaces.
- Chunk documents into 200-800 token segments with overlap for best retrieval quality.
- Store the original text alongside the vector so you can return it without a separate lookup.
- Use `text-embedding-3-small` as the default; only upgrade to `large` if retrieval quality is measurably insufficient.
- Create appropriate vector indexes (IVFFlat, HNSW) for datasets over 10,000 vectors.
- Normalize embeddings before storing if your database does not do it automatically.
- Include metadata (source, page number, date) with each vector for filtering at query time.
- Using different embedding models for indexing and querying, which produces meaningless similarity scores.
- Embedding entire documents without chunking, which dilutes the semantic signal and exceeds token limits.
- Not batching embedding requests, leading to rate limit errors on large datasets.
- Storing embeddings in a regular database column without a vector index, making search O(n) instead of approximate nearest neighbor.
- Ignoring the token limit of embedding models (8191 tokens for OpenAI v3 models); oversized inputs are silently truncated.
## Quick Example
```typescript
const response = await openai.embeddings.create({
model: "text-embedding-3-large",
input: "Some text to embed",
dimensions: 512, // Reduce from 3072 to 512
});
```skilldb get llm-integration-skills/EmbeddingsFull skill: 269 linesEmbeddings — LLM Integration
You are an expert in text embeddings and semantic search for building LLM-powered applications.
Overview
Text embeddings convert text into dense numerical vectors that capture semantic meaning. Similar texts produce vectors that are close together in vector space, enabling semantic search, clustering, classification, and retrieval-augmented generation (RAG). Embedding models from OpenAI, Cohere, and open-source projects (e.g., Sentence Transformers) power these capabilities, paired with vector databases for efficient similarity search at scale.
Core Concepts
Generating Embeddings with OpenAI
import OpenAI from "openai";
const openai = new OpenAI();
async function embed(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return response.data.map((d) => d.embedding);
}
// Single text
const [vector] = await embed(["How do I reset my password?"]);
// vector is a float array of 1536 dimensions
Cosine Similarity
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
// Score ranges from -1 (opposite) to 1 (identical meaning)
const score = cosineSimilarity(vectorA, vectorB);
Embedding Models Comparison
| Model | Dimensions | Use Case |
|---|---|---|
text-embedding-3-small | 1536 | Cost-effective general purpose |
text-embedding-3-large | 3072 | Higher accuracy, larger index |
Cohere embed-english-v3.0 | 1024 | English-optimized, search/classify modes |
all-MiniLM-L6-v2 | 384 | Open-source, runs locally |
Dimension Reduction
OpenAI's v3 models support native dimension reduction:
const response = await openai.embeddings.create({
model: "text-embedding-3-large",
input: "Some text to embed",
dimensions: 512, // Reduce from 3072 to 512
});
Implementation Patterns
Semantic Search with Pinecone
import { Pinecone } from "@pinecone-database/pinecone";
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pinecone.index("knowledge-base");
// Index documents
async function indexDocuments(docs: { id: string; text: string; metadata: any }[]) {
const texts = docs.map((d) => d.text);
const embeddings = await embed(texts);
const vectors = docs.map((doc, i) => ({
id: doc.id,
values: embeddings[i],
metadata: { ...doc.metadata, text: doc.text },
}));
// Upsert in batches of 100
for (let i = 0; i < vectors.length; i += 100) {
await index.upsert(vectors.slice(i, i + 100));
}
}
// Query
async function search(query: string, topK = 5) {
const [queryVector] = await embed([query]);
const results = await index.query({
vector: queryVector,
topK,
includeMetadata: true,
});
return results.matches.map((m) => ({
text: m.metadata?.text as string,
score: m.score,
}));
}
Semantic Search with pgvector (PostgreSQL)
import { Pool } from "pg";
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
// Setup (run once)
async function setupVectorTable() {
await pool.query("CREATE EXTENSION IF NOT EXISTS vector");
await pool.query(`
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536),
metadata JSONB DEFAULT '{}'
)
`);
await pool.query(`
CREATE INDEX IF NOT EXISTS documents_embedding_idx
ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100)
`);
}
// Insert
async function insertDocument(content: string, metadata: any) {
const [vector] = await embed([content]);
await pool.query(
"INSERT INTO documents (content, embedding, metadata) VALUES ($1, $2, $3)",
[content, JSON.stringify(vector), JSON.stringify(metadata)]
);
}
// Search
async function searchDocuments(query: string, limit = 5) {
const [queryVector] = await embed([query]);
const result = await pool.query(
`SELECT content, metadata, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT $2`,
[JSON.stringify(queryVector), limit]
);
return result.rows;
}
Chunking Strategy for Long Documents
interface Chunk {
text: string;
index: number;
metadata: Record<string, any>;
}
function chunkText(
text: string,
chunkSize = 500,
overlap = 100
): Chunk[] {
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const chunks: Chunk[] = [];
let current = "";
let chunkIndex = 0;
for (const sentence of sentences) {
if ((current + sentence).length > chunkSize && current.length > 0) {
chunks.push({ text: current.trim(), index: chunkIndex++, metadata: {} });
// Keep overlap by retaining the last portion
const words = current.split(" ");
const overlapWords = words.slice(-Math.floor(overlap / 5));
current = overlapWords.join(" ") + " ";
}
current += sentence;
}
if (current.trim()) {
chunks.push({ text: current.trim(), index: chunkIndex, metadata: {} });
}
return chunks;
}
Batch Embedding with Rate Limiting
async function embedBatched(
texts: string[],
batchSize = 100,
delayMs = 200
): Promise<number[][]> {
const allEmbeddings: number[][] = [];
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize);
const embeddings = await embed(batch);
allEmbeddings.push(...embeddings);
if (i + batchSize < texts.length) {
await new Promise((r) => setTimeout(r, delayMs));
}
}
return allEmbeddings;
}
Best Practices
- Use the same embedding model for both indexing and querying; mixing models produces incompatible vector spaces.
- Chunk documents into 200-800 token segments with overlap for best retrieval quality.
- Store the original text alongside the vector so you can return it without a separate lookup.
- Use
text-embedding-3-smallas the default; only upgrade tolargeif retrieval quality is measurably insufficient. - Create appropriate vector indexes (IVFFlat, HNSW) for datasets over 10,000 vectors.
- Normalize embeddings before storing if your database does not do it automatically.
- Include metadata (source, page number, date) with each vector for filtering at query time.
Core Philosophy
Embeddings are the bridge between human language and machine computation. By projecting text into a continuous vector space where semantic similarity corresponds to geometric proximity, embeddings enable operations that are impossible on raw text: nearest-neighbor search, clustering, classification, and retrieval at scale. The embedding model is not a supporting cast member in a RAG system -- it is the foundation. If the embeddings do not capture the semantic distinctions that matter for your use case, no amount of prompt engineering on the generation side will compensate.
Consistency is the cardinal rule. The embedding model used to encode documents at indexing time must be the same model used to encode queries at search time. Mixing models produces vectors in incompatible spaces, and the resulting similarity scores are meaningless. This constraint extends to preprocessing: if you strip HTML, lowercase text, or truncate to a token limit during indexing, the same transformations must be applied to queries.
Chunking strategy determines retrieval quality more than any other design decision. Chunks that are too large dilute the semantic signal with irrelevant context, making it hard for the query embedding to find a close match. Chunks that are too small lose the surrounding context needed to understand the content. The right chunk size depends on the nature of the documents and the specificity of expected queries. There is no universal answer; the only way to find the right balance is to evaluate retrieval quality empirically on representative queries.
Anti-Patterns
-
Mixing embedding models between indexing and querying: Using one model (e.g.,
text-embedding-3-small) to embed documents and a different model (e.g.,all-MiniLM-L6-v2) to embed queries. The resulting vectors exist in different spaces, making similarity scores meaningless. -
Embedding entire documents without chunking: Passing a 50-page document as a single string to the embedding model. The model truncates it silently (at its token limit), and the resulting vector represents only the beginning of the document. The rest is invisible to search.
-
Storing vectors without the original text: Saving only the embedding vector in the database without the source text or a reference to it. When a search returns a match, there is no way to display the content to the user or pass it to an LLM without a costly secondary lookup.
-
No vector index on the database: Storing embeddings in a standard database column and performing brute-force cosine similarity over all rows. This is O(n) per query and becomes unusable beyond a few thousand documents. Use IVFFlat or HNSW indexes for approximate nearest neighbor search.
-
Re-embedding the entire corpus on every update: Deleting and re-embedding all documents whenever any document changes, instead of upserting only the modified documents. This wastes API credits, time, and risks introducing inconsistencies if the embedding model has been updated between runs.
Common Pitfalls
- Using different embedding models for indexing and querying, which produces meaningless similarity scores.
- Embedding entire documents without chunking, which dilutes the semantic signal and exceeds token limits.
- Not batching embedding requests, leading to rate limit errors on large datasets.
- Storing embeddings in a regular database column without a vector index, making search O(n) instead of approximate nearest neighbor.
- Ignoring the token limit of embedding models (8191 tokens for OpenAI v3 models); oversized inputs are silently truncated.
- Using Euclidean distance instead of cosine similarity with non-normalized vectors, producing skewed rankings.
- Re-embedding the entire corpus on every update instead of upserting only changed documents.
Install this skill directly: skilldb add llm-integration-skills
Related Skills
Anthropic API
Anthropic Claude API integration for messages, streaming, and tool use
Function Calling
Function/tool calling patterns for connecting LLMs to external APIs and data sources
Langchain
LangChain orchestration for chains, agents, memory, and retrieval workflows
Openai API
OpenAI API integration patterns for chat completions, embeddings, and assistants
Rag Pipeline
Building retrieval-augmented generation pipelines with document ingestion, retrieval, and synthesis
Streaming
Streaming LLM responses with SSE, WebSockets, and backpressure handling