Technology & EngineeringRag Pipeline266 lines

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Quick Summary18 lines

Build knowledge-grounded LLM applications by combining retrieval with generation.

## Key Points

- **OpenAI text-embedding-3-small**: Good default, 1536 dims, cheap
- **Cohere embed-v3**: Strong multilingual, supports input types
- **BGE-large-en-v1.5**: Best open-source, self-hostable
- Choose based on: language support, latency, cost, self-hosting needs
- **ChromaDB**: Local dev, simple API, in-memory option
- **Pinecone**: Managed, serverless option, good for production
- **Qdrant**: Self-hosted or cloud, rich filtering, fast
- **pgvector**: Already using Postgres? Add vector search to it
- **Claude 3.5 Sonnet / Opus**: Strong at following context, low hallucination
- **GPT-4o**: Reliable, good at citations
- **Llama 3.1 70B**: Self-hosted option, good quality
- Match model to: context window needs, cost, latency, privacy requirements

skilldb get rag-pipeline-skills/rag-fundamentalsFull skill: 266 lines

Paste into your CLAUDE.md or agent config

RAG Fundamentals

Build knowledge-grounded LLM applications by combining retrieval with generation.

Why RAG Over Fine-Tuning

Dimension	RAG	Fine-Tuning
Knowledge updates	Swap documents, no retraining	Retrain on every update
Hallucination control	Citable sources in context	Model may still hallucinate
Cost	Embedding + vector DB	GPU hours for training
Latency	Retrieval adds ~100-300ms	No retrieval overhead
Domain breadth	Scales to millions of docs	Limited by training data size
Transparency	Retrieved chunks are auditable	Black-box parametric memory

When to use fine-tuning instead: style/tone alignment, structured output formatting, latency-critical paths where retrieval overhead is unacceptable, or when the knowledge is small and static.

When to combine both: Fine-tune for output format and reasoning style, then RAG for factual grounding. This hybrid approach is often the production sweet spot.

Core Architecture

User Query
    |
    v
[Query Processing] --> query rewriting, expansion, HyDE
    |
    v
[Retriever] --> vector search, BM25, hybrid
    |
    v
[Re-Ranker] --> cross-encoder scoring (optional)
    |
    v
[Context Assembly] --> chunk selection, token budget
    |
    v
[Generator (LLM)] --> answer with citations
    |
    v
[Post-Processing] --> fact verification, formatting

Stage 1: Indexing (Offline)

# Conceptual indexing pipeline
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# 1. Load documents
documents = load_documents("./knowledge_base/")

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)

Stage 2: Retrieval (Online)

# Dense retrieval with metadata filtering
results = vectorstore.similarity_search(
    query="How does authentication work?",
    k=5,
    filter={"source": "auth-docs"}
)

# Hybrid retrieval: combine dense + sparse
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever

bm25 = BM25Retriever.from_documents(chunks, k=5)
dense = vectorstore.as_retriever(search_kwargs={"k": 5})

hybrid = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.3, 0.7]
)

Stage 3: Generation

from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = ChatPromptTemplate.from_template("""
Answer the question based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have enough information."

Context:
{context}

Question: {question}

Provide your answer with citations in [Source: filename] format.
""")

# Assemble and generate
context = "\n\n".join([doc.page_content for doc in results])
response = llm.invoke(RAG_PROMPT.format(context=context, question=query))

Component Selection Guide

Embedding Models

OpenAI text-embedding-3-small: Good default, 1536 dims, cheap
Cohere embed-v3: Strong multilingual, supports input types
BGE-large-en-v1.5: Best open-source, self-hostable
Choose based on: language support, latency, cost, self-hosting needs

Vector Databases

ChromaDB: Local dev, simple API, in-memory option
Pinecone: Managed, serverless option, good for production
Qdrant: Self-hosted or cloud, rich filtering, fast
pgvector: Already using Postgres? Add vector search to it

LLMs for Generation

Claude 3.5 Sonnet / Opus: Strong at following context, low hallucination
GPT-4o: Reliable, good at citations
Llama 3.1 70B: Self-hosted option, good quality
Match model to: context window needs, cost, latency, privacy requirements

Latency Budget

Typical end-to-end RAG latency breakdown:

Query embedding:      20-50ms   (API call)
Vector search:        10-50ms   (depends on DB and index size)
Re-ranking:           50-200ms  (cross-encoder, optional)
Context assembly:     5-10ms    (local processing)
LLM generation:       500-3000ms (depends on model and output length)
────────────────────────────────
Total:                600-3300ms

Optimization levers:

Cache frequent query embeddings
Use approximate nearest neighbor (ANN) for large indexes
Skip re-ranking for latency-critical paths
Stream LLM output to reduce perceived latency
Pre-compute embeddings for known query patterns

Evaluation Metrics

Retrieval Quality

Context Precision: Are the retrieved chunks relevant to the query?
Context Recall: Did retrieval find all the information needed to answer?
Mean Reciprocal Rank (MRR): How high does the first relevant chunk rank?

Generation Quality

Faithfulness: Does the answer only use information from retrieved context?
Answer Relevance: Does the answer actually address the query?
Hallucination Rate: Percentage of claims not grounded in context

End-to-End

Answer Correctness: Compared against ground truth answers
Latency p50/p95/p99: Response time distribution
Cost per Query: Embedding + retrieval + generation costs

# Quick evaluation with RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91, 'context_precision': 0.78}

Common Anti-Patterns

Stuffing the entire document as context -- Wastes tokens, dilutes relevant information, increases cost. Always chunk and retrieve selectively.
Ignoring chunk boundaries -- Splitting mid-sentence or mid-code-block produces incoherent chunks that embed poorly and confuse the LLM.
No metadata filtering -- Searching the entire corpus when you know the document category wastes retrieval quality. Always attach and use metadata.
Skipping evaluation -- "It seems to work" is not a metric. Build an eval set of 50-100 question-answer pairs from day one.
One-size-fits-all chunking -- Code, prose, tables, and lists all need different chunking strategies. Use content-aware splitters.
Forgetting to handle "I don't know" -- RAG systems must gracefully decline when retrieval returns no relevant context rather than hallucinating.

Minimal End-to-End Example

import os
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Index
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# Query
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=db.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True,
)

result = qa.invoke({"query": "How do I reset my password?"})
print(result["result"])
for doc in result["source_documents"]:
    print(f"  Source: {doc.metadata['source']}")

Decision Checklist for New RAG Projects

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline•464L

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline•343L

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Rag Pipeline•357L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline•501L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline•498L

rag-with-langchain

Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.

Rag Pipeline•460L

RAG Fundamentals

Why RAG Over Fine-Tuning

Core Architecture

Stage 1: Indexing (Offline)

Conceptual indexing pipeline

1. Load documents

2. Chunk

3. Embed and store

Stage 2: Retrieval (Online)

Dense retrieval with metadata filtering

Hybrid retrieval: combine dense + sparse

Stage 3: Generation

Assemble and generate

Component Selection Guide

Embedding Models

Vector Databases

LLMs for Generation

Latency Budget

Evaluation Metrics

Retrieval Quality

Generation Quality

End-to-End

Quick evaluation with RAGAS

{'faithfulness': 0.87, 'answer_relevancy': 0.91, 'context_precision': 0.78}

Common Anti-Patterns

Minimal End-to-End Example

Index

Query

Decision Checklist for New RAG Projects

Details

Pack: rag-pipeline-skills
File: rag-fundamentals.md
Lines: 266
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add rag-pipeline-skills

Installs the full Rag Pipeline pack to your project.