Technology & EngineeringLlm Integration394 lines

Rag Pipeline

Building retrieval-augmented generation pipelines with document ingestion, retrieval, and synthesis

Quick Summary18 lines

You are an expert in building retrieval-augmented generation (RAG) pipelines for grounding LLM responses in external knowledge.

## Key Points

1. Faithfulness: Does the answer only use information from the context?
2. Relevance: Is the context relevant to the question?
3. Correctness: Is the answer factually correct?${groundTruth ? ` Ground truth: ${groundTruth}` : ""}
- Chunk documents at 500-1000 tokens with 10-20% overlap to preserve context across boundaries.
- Include source metadata (filename, page number, URL) with every chunk for citation.
- Use the same embedding model for ingestion and query; never mix models.
- Set `temperature: 0` for the generation step to reduce hallucination.
- Instruct the model to say "I don't know" when the retrieved context is insufficient.
- Implement hybrid search (vector + keyword) for better recall on technical or domain-specific queries.
- Re-rank retrieved chunks before feeding them to the LLM to maximize context quality.
- Version your vector index alongside your document corpus so you can reproduce results.
- Monitor retrieval quality separately from generation quality during evaluation.

skilldb get llm-integration-skills/Rag PipelineFull skill: 394 lines

Paste into your CLAUDE.md or agent config

RAG Pipeline — LLM Integration

You are an expert in building retrieval-augmented generation (RAG) pipelines for grounding LLM responses in external knowledge.

Overview

RAG combines information retrieval with language model generation. Instead of relying solely on the model's training data, RAG retrieves relevant documents from a knowledge base and includes them as context in the prompt. This produces more accurate, up-to-date, and verifiable answers. A RAG pipeline involves document ingestion, chunking, embedding, indexing, retrieval, and augmented generation.

Core Concepts

RAG Pipeline Architecture

User Query
    |
    v
[1. Embed Query] --> Vector DB --> [2. Retrieve Top-K Chunks]
    |                                       |
    v                                       v
[3. Build Prompt with Context] --> [4. Generate Answer with LLM]
    |
    v
Response (with citations)

Document Ingestion Pipeline

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import * as fs from "fs";
import * as pdf from "pdf-parse";

// Load documents from various sources
async function loadDocument(filePath: string): Promise<string> {
  if (filePath.endsWith(".pdf")) {
    const buffer = fs.readFileSync(filePath);
    const data = await pdf(buffer);
    return data.text;
  }
  if (filePath.endsWith(".md") || filePath.endsWith(".txt")) {
    return fs.readFileSync(filePath, "utf-8");
  }
  throw new Error(`Unsupported file type: ${filePath}`);
}

// Chunk with metadata
interface DocumentChunk {
  content: string;
  metadata: {
    source: string;
    chunkIndex: number;
    totalChunks: number;
  };
}

function chunkDocument(text: string, source: string): DocumentChunk[] {
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 800,
    chunkOverlap: 200,
    separators: ["\n\n", "\n", ". ", " ", ""],
  });

  const chunks = splitter.splitText(text);
  return chunks.map((content, i) => ({
    content,
    metadata: {
      source,
      chunkIndex: i,
      totalChunks: chunks.length,
    },
  }));
}

Embedding and Indexing

import OpenAI from "openai";
import { Pinecone } from "@pinecone-database/pinecone";

const openai = new OpenAI();
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pinecone.index("knowledge-base");

async function embedTexts(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });
  return response.data.map((d) => d.embedding);
}

async function ingestDocuments(chunks: DocumentChunk[]): Promise<void> {
  const batchSize = 100;

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const texts = batch.map((c) => c.content);
    const embeddings = await embedTexts(texts);

    const vectors = batch.map((chunk, j) => ({
      id: `${chunk.metadata.source}-${chunk.metadata.chunkIndex}`,
      values: embeddings[j],
      metadata: {
        content: chunk.content,
        source: chunk.metadata.source,
        chunkIndex: chunk.metadata.chunkIndex,
      },
    }));

    await index.upsert(vectors);
  }
}

Retrieval

interface RetrievedChunk {
  content: string;
  source: string;
  score: number;
}

async function retrieve(query: string, topK = 5): Promise<RetrievedChunk[]> {
  const [queryEmbedding] = await embedTexts([query]);

  const results = await index.query({
    vector: queryEmbedding,
    topK,
    includeMetadata: true,
  });

  return results.matches.map((match) => ({
    content: match.metadata?.content as string,
    source: match.metadata?.source as string,
    score: match.score ?? 0,
  }));
}

Augmented Generation

async function ragAnswer(question: string): Promise<{ answer: string; sources: string[] }> {
  const chunks = await retrieve(question, 5);

  const context = chunks
    .map((c, i) => `[${i + 1}] (Source: ${c.source})\n${c.content}`)
    .join("\n\n");

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Answer the user's question based ONLY on the provided context.
If the context doesn't contain enough information, say so.
Cite sources using [1], [2], etc.

Context:
${context}`,
      },
      { role: "user", content: question },
    ],
    temperature: 0,
  });

  return {
    answer: response.choices[0].message.content ?? "",
    sources: [...new Set(chunks.map((c) => c.source))],
  };
}

Implementation Patterns

Hybrid Search (Vector + Keyword)

async function hybridSearch(
  query: string,
  topK = 5
): Promise<RetrievedChunk[]> {
  // Vector search
  const vectorResults = await retrieve(query, topK);

  // Keyword search (BM25 via database full-text search)
  const keywordResults = await db.query(
    `SELECT content, source, ts_rank(to_tsvector(content), plainto_tsquery($1)) AS score
     FROM documents
     WHERE to_tsvector(content) @@ plainto_tsquery($1)
     ORDER BY score DESC
     LIMIT $2`,
    [query, topK]
  );

  // Reciprocal Rank Fusion
  const scores = new Map<string, number>();

  vectorResults.forEach((r, i) => {
    const key = `${r.source}-${r.content.slice(0, 50)}`;
    scores.set(key, (scores.get(key) ?? 0) + 1 / (i + 60));
  });

  keywordResults.rows.forEach((r: any, i: number) => {
    const key = `${r.source}-${r.content.slice(0, 50)}`;
    scores.set(key, (scores.get(key) ?? 0) + 1 / (i + 60));
  });

  // Merge and re-rank
  const allChunks = [...vectorResults, ...keywordResults.rows.map((r: any) => ({
    content: r.content,
    source: r.source,
    score: r.score,
  }))];

  const unique = new Map<string, RetrievedChunk>();
  for (const chunk of allChunks) {
    const key = `${chunk.source}-${chunk.content.slice(0, 50)}`;
    if (!unique.has(key)) unique.set(key, chunk);
  }

  return [...unique.values()]
    .sort((a, b) => {
      const keyA = `${a.source}-${a.content.slice(0, 50)}`;
      const keyB = `${b.source}-${b.content.slice(0, 50)}`;
      return (scores.get(keyB) ?? 0) - (scores.get(keyA) ?? 0);
    })
    .slice(0, topK);
}

Query Transformation

async function expandQuery(originalQuery: string): Promise<string[]> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content:
          "Generate 3 alternative search queries for the given question. Return one per line, no numbering.",
      },
      { role: "user", content: originalQuery },
    ],
    temperature: 0.7,
  });

  const alternatives = response.choices[0].message.content?.split("\n").filter(Boolean) ?? [];
  return [originalQuery, ...alternatives];
}

async function multiQueryRetrieve(question: string, topK = 5): Promise<RetrievedChunk[]> {
  const queries = await expandQuery(question);

  const allResults = await Promise.all(queries.map((q) => retrieve(q, topK)));
  const flat = allResults.flat();

  // Deduplicate by content
  const seen = new Set<string>();
  return flat.filter((chunk) => {
    const key = chunk.content.slice(0, 100);
    if (seen.has(key)) return false;
    seen.add(key);
    return true;
  }).slice(0, topK);
}

Contextual Compression / Re-Ranking

async function rerankChunks(
  query: string,
  chunks: RetrievedChunk[],
  topK = 3
): Promise<RetrievedChunk[]> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Given the query and the following passages, rank them by relevance.
Return a JSON array of passage indices (0-based) from most to least relevant.
Only include passages that are actually relevant to the query.`,
      },
      {
        role: "user",
        content: `Query: ${query}\n\nPassages:\n${chunks
          .map((c, i) => `[${i}]: ${c.content}`)
          .join("\n\n")}`,
      },
    ],
    response_format: { type: "json_object" },
    temperature: 0,
  });

  const parsed = JSON.parse(response.choices[0].message.content!);
  const indices: number[] = parsed.rankings ?? parsed.indices ?? [];

  return indices.slice(0, topK).map((i) => chunks[i]).filter(Boolean);
}

Evaluation Metrics

interface RAGEvaluation {
  faithfulness: number; // Does the answer stick to the context?
  relevance: number;    // Is the retrieved context relevant to the question?
  correctness: number;  // Is the answer factually correct?
}

async function evaluateRAGResponse(
  question: string,
  context: string,
  answer: string,
  groundTruth?: string
): Promise<RAGEvaluation> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Evaluate the RAG response on three criteria (0-1 scale):
1. Faithfulness: Does the answer only use information from the context?
2. Relevance: Is the context relevant to the question?
3. Correctness: Is the answer factually correct?${groundTruth ? ` Ground truth: ${groundTruth}` : ""}
Return JSON: {"faithfulness": float, "relevance": float, "correctness": float}`,
      },
      {
        role: "user",
        content: `Question: ${question}\nContext: ${context}\nAnswer: ${answer}`,
      },
    ],
    response_format: { type: "json_object" },
    temperature: 0,
  });

  return JSON.parse(response.choices[0].message.content!);
}

Best Practices

Chunk documents at 500-1000 tokens with 10-20% overlap to preserve context across boundaries.
Include source metadata (filename, page number, URL) with every chunk for citation.
Use the same embedding model for ingestion and query; never mix models.
Set temperature: 0 for the generation step to reduce hallucination.
Instruct the model to say "I don't know" when the retrieved context is insufficient.
Implement hybrid search (vector + keyword) for better recall on technical or domain-specific queries.
Re-rank retrieved chunks before feeding them to the LLM to maximize context quality.
Version your vector index alongside your document corpus so you can reproduce results.
Monitor retrieval quality separately from generation quality during evaluation.

Core Philosophy

A RAG pipeline is only as good as its retrieval. The most sophisticated generation prompt cannot compensate for retrieving the wrong documents. If the retrieved chunks do not contain the answer, the model will either hallucinate one or correctly say "I don't know." Both outcomes represent a retrieval failure, not a generation failure. Invest the majority of your RAG engineering effort in chunking strategy, embedding model selection, index tuning, and retrieval evaluation before optimizing the generation prompt.

RAG is not a monolithic feature -- it is a pipeline with distinct stages, and each stage can fail independently. Document ingestion can produce bad chunks. Embedding can silently truncate long inputs. The vector index can return irrelevant results. The generation prompt can ignore the context or hallucinate beyond it. Treating RAG as a pipeline means monitoring and evaluating each stage separately, not just measuring end-to-end answer quality.

Ground the model explicitly and verifiably. The generation prompt must instruct the model to base its answer only on the provided context and to cite sources. Without these instructions, the model freely mixes retrieved content with parametric knowledge, producing answers that are partially grounded and partially hallucinated -- the worst of both worlds because the user cannot distinguish which parts are supported by evidence. Requiring citations forces the model to anchor each claim to a specific source, making verification possible.

Anti-Patterns

Optimizing the generation prompt before fixing retrieval: Spending hours refining the system prompt while retrieval consistently returns irrelevant documents. If the top-5 retrieved chunks do not contain the answer, no prompt will fix the output. Measure retrieval recall and precision first.
No fallback for empty or low-relevance retrieval: Proceeding with generation even when the retrieval step returns no results or results with very low similarity scores. The model will hallucinate an answer with full confidence. Implement a similarity threshold below which the system returns "I don't have information about this."
One chunk size for all document types: Using the same 500-token chunk size for API reference docs, legal contracts, and conversational transcripts. Different document types have different information density and structure; chunk sizes and splitting strategies should vary accordingly.
Never evaluating retrieval quality separately: Measuring only end-to-end answer quality without instrumenting whether the retrieval step returned relevant documents. A correct answer might come from the model's parametric knowledge despite bad retrieval, masking a pipeline weakness that will fail on the next query.
Treating RAG as a set-and-forget system: Building a RAG pipeline, deploying it, and never updating the vector index when source documents change. Stale indexes serve outdated information and erode user trust. Implement incremental index updates triggered by document changes.

Common Pitfalls

Using chunks that are too large, diluting the relevant signal with irrelevant text.
Using chunks that are too small, losing context needed to understand the content.
Not including the source document in the prompt, so the model cannot cite its sources.
Stuffing too many chunks into the context window, exceeding token limits or drowning the signal.
Not handling the case where retrieval returns zero relevant results, causing the model to hallucinate.
Embedding queries and documents with different models, producing incomparable vectors.
Skipping evaluation; without measuring faithfulness and relevance, you cannot improve the pipeline.
Not updating the vector index when source documents change, serving stale information.
Treating RAG as a silver bullet; some questions require reasoning beyond what retrieval can provide.

Install this skill directly: skilldb add llm-integration-skills

Get CLI access →