Skip to main content
Technology & EngineeringRag Pipeline266 lines

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Quick Summary18 lines
Build knowledge-grounded LLM applications by combining retrieval with generation.

## Key Points

- **OpenAI text-embedding-3-small**: Good default, 1536 dims, cheap
- **Cohere embed-v3**: Strong multilingual, supports input types
- **BGE-large-en-v1.5**: Best open-source, self-hostable
- Choose based on: language support, latency, cost, self-hosting needs
- **ChromaDB**: Local dev, simple API, in-memory option
- **Pinecone**: Managed, serverless option, good for production
- **Qdrant**: Self-hosted or cloud, rich filtering, fast
- **pgvector**: Already using Postgres? Add vector search to it
- **Claude 3.5 Sonnet / Opus**: Strong at following context, low hallucination
- **GPT-4o**: Reliable, good at citations
- **Llama 3.1 70B**: Self-hosted option, good quality
- Match model to: context window needs, cost, latency, privacy requirements
skilldb get rag-pipeline-skills/rag-fundamentalsFull skill: 266 lines
Paste into your CLAUDE.md or agent config

RAG Fundamentals

Build knowledge-grounded LLM applications by combining retrieval with generation.


Why RAG Over Fine-Tuning

DimensionRAGFine-Tuning
Knowledge updatesSwap documents, no retrainingRetrain on every update
Hallucination controlCitable sources in contextModel may still hallucinate
CostEmbedding + vector DBGPU hours for training
LatencyRetrieval adds ~100-300msNo retrieval overhead
Domain breadthScales to millions of docsLimited by training data size
TransparencyRetrieved chunks are auditableBlack-box parametric memory

When to use fine-tuning instead: style/tone alignment, structured output formatting, latency-critical paths where retrieval overhead is unacceptable, or when the knowledge is small and static.

When to combine both: Fine-tune for output format and reasoning style, then RAG for factual grounding. This hybrid approach is often the production sweet spot.


Core Architecture

User Query
    |
    v
[Query Processing] --> query rewriting, expansion, HyDE
    |
    v
[Retriever] --> vector search, BM25, hybrid
    |
    v
[Re-Ranker] --> cross-encoder scoring (optional)
    |
    v
[Context Assembly] --> chunk selection, token budget
    |
    v
[Generator (LLM)] --> answer with citations
    |
    v
[Post-Processing] --> fact verification, formatting

Stage 1: Indexing (Offline)

# Conceptual indexing pipeline
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# 1. Load documents
documents = load_documents("./knowledge_base/")

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)

Stage 2: Retrieval (Online)

# Dense retrieval with metadata filtering
results = vectorstore.similarity_search(
    query="How does authentication work?",
    k=5,
    filter={"source": "auth-docs"}
)

# Hybrid retrieval: combine dense + sparse
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever

bm25 = BM25Retriever.from_documents(chunks, k=5)
dense = vectorstore.as_retriever(search_kwargs={"k": 5})

hybrid = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.3, 0.7]
)

Stage 3: Generation

from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = ChatPromptTemplate.from_template("""
Answer the question based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have enough information."

Context:
{context}

Question: {question}

Provide your answer with citations in [Source: filename] format.
""")

# Assemble and generate
context = "\n\n".join([doc.page_content for doc in results])
response = llm.invoke(RAG_PROMPT.format(context=context, question=query))

Component Selection Guide

Embedding Models

  • OpenAI text-embedding-3-small: Good default, 1536 dims, cheap
  • Cohere embed-v3: Strong multilingual, supports input types
  • BGE-large-en-v1.5: Best open-source, self-hostable
  • Choose based on: language support, latency, cost, self-hosting needs

Vector Databases

  • ChromaDB: Local dev, simple API, in-memory option
  • Pinecone: Managed, serverless option, good for production
  • Qdrant: Self-hosted or cloud, rich filtering, fast
  • pgvector: Already using Postgres? Add vector search to it

LLMs for Generation

  • Claude 3.5 Sonnet / Opus: Strong at following context, low hallucination
  • GPT-4o: Reliable, good at citations
  • Llama 3.1 70B: Self-hosted option, good quality
  • Match model to: context window needs, cost, latency, privacy requirements

Latency Budget

Typical end-to-end RAG latency breakdown:

Query embedding:      20-50ms   (API call)
Vector search:        10-50ms   (depends on DB and index size)
Re-ranking:           50-200ms  (cross-encoder, optional)
Context assembly:     5-10ms    (local processing)
LLM generation:       500-3000ms (depends on model and output length)
────────────────────────────────
Total:                600-3300ms

Optimization levers:

  • Cache frequent query embeddings
  • Use approximate nearest neighbor (ANN) for large indexes
  • Skip re-ranking for latency-critical paths
  • Stream LLM output to reduce perceived latency
  • Pre-compute embeddings for known query patterns

Evaluation Metrics

Retrieval Quality

  • Context Precision: Are the retrieved chunks relevant to the query?
  • Context Recall: Did retrieval find all the information needed to answer?
  • Mean Reciprocal Rank (MRR): How high does the first relevant chunk rank?

Generation Quality

  • Faithfulness: Does the answer only use information from retrieved context?
  • Answer Relevance: Does the answer actually address the query?
  • Hallucination Rate: Percentage of claims not grounded in context

End-to-End

  • Answer Correctness: Compared against ground truth answers
  • Latency p50/p95/p99: Response time distribution
  • Cost per Query: Embedding + retrieval + generation costs
# Quick evaluation with RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91, 'context_precision': 0.78}

Common Anti-Patterns

  1. Stuffing the entire document as context -- Wastes tokens, dilutes relevant information, increases cost. Always chunk and retrieve selectively.

  2. Ignoring chunk boundaries -- Splitting mid-sentence or mid-code-block produces incoherent chunks that embed poorly and confuse the LLM.

  3. No metadata filtering -- Searching the entire corpus when you know the document category wastes retrieval quality. Always attach and use metadata.

  4. Skipping evaluation -- "It seems to work" is not a metric. Build an eval set of 50-100 question-answer pairs from day one.

  5. One-size-fits-all chunking -- Code, prose, tables, and lists all need different chunking strategies. Use content-aware splitters.

  6. Forgetting to handle "I don't know" -- RAG systems must gracefully decline when retrieval returns no relevant context rather than hallucinating.


Minimal End-to-End Example

import os
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Index
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# Query
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=db.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True,
)

result = qa.invoke({"query": "How do I reset my password?"})
print(result["result"])
for doc in result["source_documents"]:
    print(f"  Source: {doc.metadata['source']}")

Decision Checklist for New RAG Projects

  • Define the knowledge corpus and update frequency
  • Choose chunking strategy based on document types
  • Select embedding model (cost vs. quality vs. self-hosting)
  • Pick vector database (managed vs. self-hosted, scale needs)
  • Design the retrieval strategy (dense, sparse, hybrid)
  • Set the generation prompt with grounding instructions
  • Build an evaluation dataset (minimum 50 Q&A pairs)
  • Establish latency and cost budgets
  • Plan for incremental index updates
  • Implement monitoring and observability

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline464L

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline343L

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Rag Pipeline357L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline501L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline498L

rag-with-langchain

Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.

Rag Pipeline460L