rag-fundamentals
Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.
Build knowledge-grounded LLM applications by combining retrieval with generation. ## Key Points - **OpenAI text-embedding-3-small**: Good default, 1536 dims, cheap - **Cohere embed-v3**: Strong multilingual, supports input types - **BGE-large-en-v1.5**: Best open-source, self-hostable - Choose based on: language support, latency, cost, self-hosting needs - **ChromaDB**: Local dev, simple API, in-memory option - **Pinecone**: Managed, serverless option, good for production - **Qdrant**: Self-hosted or cloud, rich filtering, fast - **pgvector**: Already using Postgres? Add vector search to it - **Claude 3.5 Sonnet / Opus**: Strong at following context, low hallucination - **GPT-4o**: Reliable, good at citations - **Llama 3.1 70B**: Self-hosted option, good quality - Match model to: context window needs, cost, latency, privacy requirements
skilldb get rag-pipeline-skills/rag-fundamentalsFull skill: 266 linesRAG Fundamentals
Build knowledge-grounded LLM applications by combining retrieval with generation.
Why RAG Over Fine-Tuning
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Knowledge updates | Swap documents, no retraining | Retrain on every update |
| Hallucination control | Citable sources in context | Model may still hallucinate |
| Cost | Embedding + vector DB | GPU hours for training |
| Latency | Retrieval adds ~100-300ms | No retrieval overhead |
| Domain breadth | Scales to millions of docs | Limited by training data size |
| Transparency | Retrieved chunks are auditable | Black-box parametric memory |
When to use fine-tuning instead: style/tone alignment, structured output formatting, latency-critical paths where retrieval overhead is unacceptable, or when the knowledge is small and static.
When to combine both: Fine-tune for output format and reasoning style, then RAG for factual grounding. This hybrid approach is often the production sweet spot.
Core Architecture
User Query
|
v
[Query Processing] --> query rewriting, expansion, HyDE
|
v
[Retriever] --> vector search, BM25, hybrid
|
v
[Re-Ranker] --> cross-encoder scoring (optional)
|
v
[Context Assembly] --> chunk selection, token budget
|
v
[Generator (LLM)] --> answer with citations
|
v
[Post-Processing] --> fact verification, formatting
Stage 1: Indexing (Offline)
# Conceptual indexing pipeline
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# 1. Load documents
documents = load_documents("./knowledge_base/")
# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)
Stage 2: Retrieval (Online)
# Dense retrieval with metadata filtering
results = vectorstore.similarity_search(
query="How does authentication work?",
k=5,
filter={"source": "auth-docs"}
)
# Hybrid retrieval: combine dense + sparse
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever
bm25 = BM25Retriever.from_documents(chunks, k=5)
dense = vectorstore.as_retriever(search_kwargs={"k": 5})
hybrid = EnsembleRetriever(
retrievers=[bm25, dense],
weights=[0.3, 0.7]
)
Stage 3: Generation
from langchain.prompts import ChatPromptTemplate
RAG_PROMPT = ChatPromptTemplate.from_template("""
Answer the question based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have enough information."
Context:
{context}
Question: {question}
Provide your answer with citations in [Source: filename] format.
""")
# Assemble and generate
context = "\n\n".join([doc.page_content for doc in results])
response = llm.invoke(RAG_PROMPT.format(context=context, question=query))
Component Selection Guide
Embedding Models
- OpenAI text-embedding-3-small: Good default, 1536 dims, cheap
- Cohere embed-v3: Strong multilingual, supports input types
- BGE-large-en-v1.5: Best open-source, self-hostable
- Choose based on: language support, latency, cost, self-hosting needs
Vector Databases
- ChromaDB: Local dev, simple API, in-memory option
- Pinecone: Managed, serverless option, good for production
- Qdrant: Self-hosted or cloud, rich filtering, fast
- pgvector: Already using Postgres? Add vector search to it
LLMs for Generation
- Claude 3.5 Sonnet / Opus: Strong at following context, low hallucination
- GPT-4o: Reliable, good at citations
- Llama 3.1 70B: Self-hosted option, good quality
- Match model to: context window needs, cost, latency, privacy requirements
Latency Budget
Typical end-to-end RAG latency breakdown:
Query embedding: 20-50ms (API call)
Vector search: 10-50ms (depends on DB and index size)
Re-ranking: 50-200ms (cross-encoder, optional)
Context assembly: 5-10ms (local processing)
LLM generation: 500-3000ms (depends on model and output length)
────────────────────────────────
Total: 600-3300ms
Optimization levers:
- Cache frequent query embeddings
- Use approximate nearest neighbor (ANN) for large indexes
- Skip re-ranking for latency-critical paths
- Stream LLM output to reduce perceived latency
- Pre-compute embeddings for known query patterns
Evaluation Metrics
Retrieval Quality
- Context Precision: Are the retrieved chunks relevant to the query?
- Context Recall: Did retrieval find all the information needed to answer?
- Mean Reciprocal Rank (MRR): How high does the first relevant chunk rank?
Generation Quality
- Faithfulness: Does the answer only use information from retrieved context?
- Answer Relevance: Does the answer actually address the query?
- Hallucination Rate: Percentage of claims not grounded in context
End-to-End
- Answer Correctness: Compared against ground truth answers
- Latency p50/p95/p99: Response time distribution
- Cost per Query: Embedding + retrieval + generation costs
# Quick evaluation with RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91, 'context_precision': 0.78}
Common Anti-Patterns
-
Stuffing the entire document as context -- Wastes tokens, dilutes relevant information, increases cost. Always chunk and retrieve selectively.
-
Ignoring chunk boundaries -- Splitting mid-sentence or mid-code-block produces incoherent chunks that embed poorly and confuse the LLM.
-
No metadata filtering -- Searching the entire corpus when you know the document category wastes retrieval quality. Always attach and use metadata.
-
Skipping evaluation -- "It seems to work" is not a metric. Build an eval set of 50-100 question-answer pairs from day one.
-
One-size-fits-all chunking -- Code, prose, tables, and lists all need different chunking strategies. Use content-aware splitters.
-
Forgetting to handle "I don't know" -- RAG systems must gracefully decline when retrieval returns no relevant context rather than hallucinating.
Minimal End-to-End Example
import os
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
# Index
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# Query
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=db.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True,
)
result = qa.invoke({"query": "How do I reset my password?"})
print(result["result"])
for doc in result["source_documents"]:
print(f" Source: {doc.metadata['source']}")
Decision Checklist for New RAG Projects
- Define the knowledge corpus and update frequency
- Choose chunking strategy based on document types
- Select embedding model (cost vs. quality vs. self-hosting)
- Pick vector database (managed vs. self-hosted, scale needs)
- Design the retrieval strategy (dense, sparse, hybrid)
- Set the generation prompt with grounding instructions
- Build an evaluation dataset (minimum 50 Q&A pairs)
- Establish latency and cost budgets
- Plan for incremental index updates
- Implement monitoring and observability
Install this skill directly: skilldb add rag-pipeline-skills
Related Skills
advanced-rag
Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.
chunking-strategies
Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.
embedding-models
Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.
rag-evaluation
Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.
rag-production
Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.
rag-with-langchain
Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.