Rag Pipeline
Building retrieval-augmented generation pipelines with document ingestion, retrieval, and synthesis
You are an expert in building retrieval-augmented generation (RAG) pipelines for grounding LLM responses in external knowledge.
## Key Points
1. Faithfulness: Does the answer only use information from the context?
2. Relevance: Is the context relevant to the question?
3. Correctness: Is the answer factually correct?${groundTruth ? ` Ground truth: ${groundTruth}` : ""}
- Chunk documents at 500-1000 tokens with 10-20% overlap to preserve context across boundaries.
- Include source metadata (filename, page number, URL) with every chunk for citation.
- Use the same embedding model for ingestion and query; never mix models.
- Set `temperature: 0` for the generation step to reduce hallucination.
- Instruct the model to say "I don't know" when the retrieved context is insufficient.
- Implement hybrid search (vector + keyword) for better recall on technical or domain-specific queries.
- Re-rank retrieved chunks before feeding them to the LLM to maximize context quality.
- Version your vector index alongside your document corpus so you can reproduce results.
- Monitor retrieval quality separately from generation quality during evaluation.skilldb get llm-integration-skills/Rag PipelineFull skill: 394 linesRAG Pipeline — LLM Integration
You are an expert in building retrieval-augmented generation (RAG) pipelines for grounding LLM responses in external knowledge.
Overview
RAG combines information retrieval with language model generation. Instead of relying solely on the model's training data, RAG retrieves relevant documents from a knowledge base and includes them as context in the prompt. This produces more accurate, up-to-date, and verifiable answers. A RAG pipeline involves document ingestion, chunking, embedding, indexing, retrieval, and augmented generation.
Core Concepts
RAG Pipeline Architecture
User Query
|
v
[1. Embed Query] --> Vector DB --> [2. Retrieve Top-K Chunks]
| |
v v
[3. Build Prompt with Context] --> [4. Generate Answer with LLM]
|
v
Response (with citations)
Document Ingestion Pipeline
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import * as fs from "fs";
import * as pdf from "pdf-parse";
// Load documents from various sources
async function loadDocument(filePath: string): Promise<string> {
if (filePath.endsWith(".pdf")) {
const buffer = fs.readFileSync(filePath);
const data = await pdf(buffer);
return data.text;
}
if (filePath.endsWith(".md") || filePath.endsWith(".txt")) {
return fs.readFileSync(filePath, "utf-8");
}
throw new Error(`Unsupported file type: ${filePath}`);
}
// Chunk with metadata
interface DocumentChunk {
content: string;
metadata: {
source: string;
chunkIndex: number;
totalChunks: number;
};
}
function chunkDocument(text: string, source: string): DocumentChunk[] {
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 200,
separators: ["\n\n", "\n", ". ", " ", ""],
});
const chunks = splitter.splitText(text);
return chunks.map((content, i) => ({
content,
metadata: {
source,
chunkIndex: i,
totalChunks: chunks.length,
},
}));
}
Embedding and Indexing
import OpenAI from "openai";
import { Pinecone } from "@pinecone-database/pinecone";
const openai = new OpenAI();
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pinecone.index("knowledge-base");
async function embedTexts(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return response.data.map((d) => d.embedding);
}
async function ingestDocuments(chunks: DocumentChunk[]): Promise<void> {
const batchSize = 100;
for (let i = 0; i < chunks.length; i += batchSize) {
const batch = chunks.slice(i, i + batchSize);
const texts = batch.map((c) => c.content);
const embeddings = await embedTexts(texts);
const vectors = batch.map((chunk, j) => ({
id: `${chunk.metadata.source}-${chunk.metadata.chunkIndex}`,
values: embeddings[j],
metadata: {
content: chunk.content,
source: chunk.metadata.source,
chunkIndex: chunk.metadata.chunkIndex,
},
}));
await index.upsert(vectors);
}
}
Retrieval
interface RetrievedChunk {
content: string;
source: string;
score: number;
}
async function retrieve(query: string, topK = 5): Promise<RetrievedChunk[]> {
const [queryEmbedding] = await embedTexts([query]);
const results = await index.query({
vector: queryEmbedding,
topK,
includeMetadata: true,
});
return results.matches.map((match) => ({
content: match.metadata?.content as string,
source: match.metadata?.source as string,
score: match.score ?? 0,
}));
}
Augmented Generation
async function ragAnswer(question: string): Promise<{ answer: string; sources: string[] }> {
const chunks = await retrieve(question, 5);
const context = chunks
.map((c, i) => `[${i + 1}] (Source: ${c.source})\n${c.content}`)
.join("\n\n");
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `Answer the user's question based ONLY on the provided context.
If the context doesn't contain enough information, say so.
Cite sources using [1], [2], etc.
Context:
${context}`,
},
{ role: "user", content: question },
],
temperature: 0,
});
return {
answer: response.choices[0].message.content ?? "",
sources: [...new Set(chunks.map((c) => c.source))],
};
}
Implementation Patterns
Hybrid Search (Vector + Keyword)
async function hybridSearch(
query: string,
topK = 5
): Promise<RetrievedChunk[]> {
// Vector search
const vectorResults = await retrieve(query, topK);
// Keyword search (BM25 via database full-text search)
const keywordResults = await db.query(
`SELECT content, source, ts_rank(to_tsvector(content), plainto_tsquery($1)) AS score
FROM documents
WHERE to_tsvector(content) @@ plainto_tsquery($1)
ORDER BY score DESC
LIMIT $2`,
[query, topK]
);
// Reciprocal Rank Fusion
const scores = new Map<string, number>();
vectorResults.forEach((r, i) => {
const key = `${r.source}-${r.content.slice(0, 50)}`;
scores.set(key, (scores.get(key) ?? 0) + 1 / (i + 60));
});
keywordResults.rows.forEach((r: any, i: number) => {
const key = `${r.source}-${r.content.slice(0, 50)}`;
scores.set(key, (scores.get(key) ?? 0) + 1 / (i + 60));
});
// Merge and re-rank
const allChunks = [...vectorResults, ...keywordResults.rows.map((r: any) => ({
content: r.content,
source: r.source,
score: r.score,
}))];
const unique = new Map<string, RetrievedChunk>();
for (const chunk of allChunks) {
const key = `${chunk.source}-${chunk.content.slice(0, 50)}`;
if (!unique.has(key)) unique.set(key, chunk);
}
return [...unique.values()]
.sort((a, b) => {
const keyA = `${a.source}-${a.content.slice(0, 50)}`;
const keyB = `${b.source}-${b.content.slice(0, 50)}`;
return (scores.get(keyB) ?? 0) - (scores.get(keyA) ?? 0);
})
.slice(0, topK);
}
Query Transformation
async function expandQuery(originalQuery: string): Promise<string[]> {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content:
"Generate 3 alternative search queries for the given question. Return one per line, no numbering.",
},
{ role: "user", content: originalQuery },
],
temperature: 0.7,
});
const alternatives = response.choices[0].message.content?.split("\n").filter(Boolean) ?? [];
return [originalQuery, ...alternatives];
}
async function multiQueryRetrieve(question: string, topK = 5): Promise<RetrievedChunk[]> {
const queries = await expandQuery(question);
const allResults = await Promise.all(queries.map((q) => retrieve(q, topK)));
const flat = allResults.flat();
// Deduplicate by content
const seen = new Set<string>();
return flat.filter((chunk) => {
const key = chunk.content.slice(0, 100);
if (seen.has(key)) return false;
seen.add(key);
return true;
}).slice(0, topK);
}
Contextual Compression / Re-Ranking
async function rerankChunks(
query: string,
chunks: RetrievedChunk[],
topK = 3
): Promise<RetrievedChunk[]> {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `Given the query and the following passages, rank them by relevance.
Return a JSON array of passage indices (0-based) from most to least relevant.
Only include passages that are actually relevant to the query.`,
},
{
role: "user",
content: `Query: ${query}\n\nPassages:\n${chunks
.map((c, i) => `[${i}]: ${c.content}`)
.join("\n\n")}`,
},
],
response_format: { type: "json_object" },
temperature: 0,
});
const parsed = JSON.parse(response.choices[0].message.content!);
const indices: number[] = parsed.rankings ?? parsed.indices ?? [];
return indices.slice(0, topK).map((i) => chunks[i]).filter(Boolean);
}
Evaluation Metrics
interface RAGEvaluation {
faithfulness: number; // Does the answer stick to the context?
relevance: number; // Is the retrieved context relevant to the question?
correctness: number; // Is the answer factually correct?
}
async function evaluateRAGResponse(
question: string,
context: string,
answer: string,
groundTruth?: string
): Promise<RAGEvaluation> {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `Evaluate the RAG response on three criteria (0-1 scale):
1. Faithfulness: Does the answer only use information from the context?
2. Relevance: Is the context relevant to the question?
3. Correctness: Is the answer factually correct?${groundTruth ? ` Ground truth: ${groundTruth}` : ""}
Return JSON: {"faithfulness": float, "relevance": float, "correctness": float}`,
},
{
role: "user",
content: `Question: ${question}\nContext: ${context}\nAnswer: ${answer}`,
},
],
response_format: { type: "json_object" },
temperature: 0,
});
return JSON.parse(response.choices[0].message.content!);
}
Best Practices
- Chunk documents at 500-1000 tokens with 10-20% overlap to preserve context across boundaries.
- Include source metadata (filename, page number, URL) with every chunk for citation.
- Use the same embedding model for ingestion and query; never mix models.
- Set
temperature: 0for the generation step to reduce hallucination. - Instruct the model to say "I don't know" when the retrieved context is insufficient.
- Implement hybrid search (vector + keyword) for better recall on technical or domain-specific queries.
- Re-rank retrieved chunks before feeding them to the LLM to maximize context quality.
- Version your vector index alongside your document corpus so you can reproduce results.
- Monitor retrieval quality separately from generation quality during evaluation.
Core Philosophy
A RAG pipeline is only as good as its retrieval. The most sophisticated generation prompt cannot compensate for retrieving the wrong documents. If the retrieved chunks do not contain the answer, the model will either hallucinate one or correctly say "I don't know." Both outcomes represent a retrieval failure, not a generation failure. Invest the majority of your RAG engineering effort in chunking strategy, embedding model selection, index tuning, and retrieval evaluation before optimizing the generation prompt.
RAG is not a monolithic feature -- it is a pipeline with distinct stages, and each stage can fail independently. Document ingestion can produce bad chunks. Embedding can silently truncate long inputs. The vector index can return irrelevant results. The generation prompt can ignore the context or hallucinate beyond it. Treating RAG as a pipeline means monitoring and evaluating each stage separately, not just measuring end-to-end answer quality.
Ground the model explicitly and verifiably. The generation prompt must instruct the model to base its answer only on the provided context and to cite sources. Without these instructions, the model freely mixes retrieved content with parametric knowledge, producing answers that are partially grounded and partially hallucinated -- the worst of both worlds because the user cannot distinguish which parts are supported by evidence. Requiring citations forces the model to anchor each claim to a specific source, making verification possible.
Anti-Patterns
-
Optimizing the generation prompt before fixing retrieval: Spending hours refining the system prompt while retrieval consistently returns irrelevant documents. If the top-5 retrieved chunks do not contain the answer, no prompt will fix the output. Measure retrieval recall and precision first.
-
No fallback for empty or low-relevance retrieval: Proceeding with generation even when the retrieval step returns no results or results with very low similarity scores. The model will hallucinate an answer with full confidence. Implement a similarity threshold below which the system returns "I don't have information about this."
-
One chunk size for all document types: Using the same 500-token chunk size for API reference docs, legal contracts, and conversational transcripts. Different document types have different information density and structure; chunk sizes and splitting strategies should vary accordingly.
-
Never evaluating retrieval quality separately: Measuring only end-to-end answer quality without instrumenting whether the retrieval step returned relevant documents. A correct answer might come from the model's parametric knowledge despite bad retrieval, masking a pipeline weakness that will fail on the next query.
-
Treating RAG as a set-and-forget system: Building a RAG pipeline, deploying it, and never updating the vector index when source documents change. Stale indexes serve outdated information and erode user trust. Implement incremental index updates triggered by document changes.
Common Pitfalls
- Using chunks that are too large, diluting the relevant signal with irrelevant text.
- Using chunks that are too small, losing context needed to understand the content.
- Not including the source document in the prompt, so the model cannot cite its sources.
- Stuffing too many chunks into the context window, exceeding token limits or drowning the signal.
- Not handling the case where retrieval returns zero relevant results, causing the model to hallucinate.
- Embedding queries and documents with different models, producing incomparable vectors.
- Skipping evaluation; without measuring faithfulness and relevance, you cannot improve the pipeline.
- Not updating the vector index when source documents change, serving stale information.
- Treating RAG as a silver bullet; some questions require reasoning beyond what retrieval can provide.
Install this skill directly: skilldb add llm-integration-skills
Related Skills
Anthropic API
Anthropic Claude API integration for messages, streaming, and tool use
Embeddings
Text embeddings and semantic search with vector databases for LLM applications
Function Calling
Function/tool calling patterns for connecting LLMs to external APIs and data sources
Langchain
LangChain orchestration for chains, agents, memory, and retrieval workflows
Openai API
OpenAI API integration patterns for chat completions, embeddings, and assistants
Streaming
Streaming LLM responses with SSE, WebSockets, and backpressure handling