rag-with-llamaindex
Building RAG systems with LlamaIndex (formerly GPT Index). Covers data connectors, node parsers, index types (vector, keyword, knowledge graph, summary), query engines, response synthesizers, and advanced patterns like sub-question queries and recursive retrieval. Practical code for production LlamaIndex RAG pipelines.
Build structured RAG applications using LlamaIndex's data framework for LLM apps.
## Key Points
- **Documents** -- raw source data
- **Nodes** -- chunked pieces of documents with metadata
- **Indexes** -- data structures for retrieval (vector, keyword, graph)
- **Query Engines** -- end-to-end query interface
- **Response Synthesizers** -- how to combine retrieved nodes into a response
1. **Not using Settings** -- Global settings avoid repeating `embed_model` and `llm` everywhere. Configure once at startup.
2. **Ignoring response modes** -- `compact` is not always best. Use `tree_summarize` for broad summarization, `refine` for detailed multi-source answers.
3. **Building indexes without persistence** -- Always persist to disk or external vector store. Rebuilding indexes on every restart wastes time and money.
4. **Skipping metadata exclusions** -- Metadata like file paths clutter LLM prompts and embeddings. Use `excluded_llm_metadata_keys` and `excluded_embed_metadata_keys`.
5. **Not using IngestionPipeline for updates** -- Manual re-indexing is error-prone. The pipeline handles deduplication and incremental processing.
6. **Ignoring LlamaIndex observability** -- Enable callbacks for debugging: `from llama_index.core import set_global_handler; set_global_handler("simple")`.
## Quick Example
```bash
pip install llama-index
pip install llama-index-embeddings-openai llama-index-llms-openai
pip install llama-index-vector-stores-chroma # Or qdrant, pinecone, etc.
pip install llama-index-readers-file # PDF, DOCX, etc.
```skilldb get rag-pipeline-skills/rag-with-llamaindexFull skill: 463 linesRAG with LlamaIndex
Build structured RAG applications using LlamaIndex's data framework for LLM apps.
Installation
pip install llama-index
pip install llama-index-embeddings-openai llama-index-llms-openai
pip install llama-index-vector-stores-chroma # Or qdrant, pinecone, etc.
pip install llama-index-readers-file # PDF, DOCX, etc.
Core Concepts
LlamaIndex organizes RAG into:
- Documents -- raw source data
- Nodes -- chunked pieces of documents with metadata
- Indexes -- data structures for retrieval (vector, keyword, graph)
- Query Engines -- end-to-end query interface
- Response Synthesizers -- how to combine retrieved nodes into a response
Data Connectors (Readers)
from llama_index.core import SimpleDirectoryReader
# Load from directory (auto-detects file types)
documents = SimpleDirectoryReader(
input_dir="./docs",
recursive=True,
required_exts=[".md", ".txt", ".pdf"],
filename_as_id=True,
).load_data()
# Load specific files
documents = SimpleDirectoryReader(
input_files=["./docs/auth.md", "./docs/api.pdf"]
).load_data()
# From LlamaHub (community connectors)
# pip install llama-index-readers-notion
from llama_index.readers.notion import NotionPageReader
reader = NotionPageReader(integration_token="secret_xxx")
documents = reader.load_data(page_ids=["page-id-1", "page-id-2"])
# Web pages
from llama_index.readers.web import SimpleWebPageReader
documents = SimpleWebPageReader().load_data(
urls=["https://docs.example.com/auth"]
)
# Database
from llama_index.readers.database import DatabaseReader
reader = DatabaseReader(uri="postgresql://user:pass@host/db")
documents = reader.load_data(query="SELECT id, content, title FROM articles")
Node Parsers (Chunking)
from llama_index.core.node_parser import (
SentenceSplitter,
SemanticSplitterNodeParser,
MarkdownNodeParser,
CodeSplitter,
HierarchicalNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding
# Sentence-based splitting (recommended default)
parser = SentenceSplitter(
chunk_size=512,
chunk_overlap=64,
paragraph_separator="\n\n",
)
nodes = parser.get_nodes_from_documents(documents)
# Semantic splitting (groups by embedding similarity)
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
parser = SemanticSplitterNodeParser(
embed_model=embed_model,
breakpoint_percentile_threshold=95,
buffer_size=1, # Sentences to group together
)
nodes = parser.get_nodes_from_documents(documents)
# Markdown-aware
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(documents)
# Code-aware
parser = CodeSplitter(
language="python",
max_chars=1500,
chunk_lines=40,
)
nodes = parser.get_nodes_from_documents(documents)
# Hierarchical (parent-child)
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[1024, 512, 256] # Large -> medium -> small
)
nodes = parser.get_nodes_from_documents(documents)
Index Types
Vector Store Index (Most Common)
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
# Configure global settings
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
# Build index from documents
index = VectorStoreIndex.from_documents(documents)
# Build from pre-parsed nodes
index = VectorStoreIndex(nodes)
# With external vector store
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("my_docs")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
index = VectorStoreIndex.from_documents(
documents,
vector_store=vector_store,
)
# Persist and reload
from llama_index.core import StorageContext, load_index_from_storage
# Save
index.storage_context.persist(persist_dir="./storage")
# Load
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
Keyword Table Index
from llama_index.core import KeywordTableIndex
# BM25-like keyword retrieval
keyword_index = KeywordTableIndex.from_documents(documents)
query_engine = keyword_index.as_query_engine()
response = query_engine.query("JWT token expiration")
Summary Index
from llama_index.core import SummaryIndex
# Summarizes all nodes (good for "tell me about everything" queries)
summary_index = SummaryIndex.from_documents(documents)
query_engine = summary_index.as_query_engine(
response_mode="tree_summarize"
)
response = query_engine.query("Give me an overview of the authentication system")
Knowledge Graph Index
from llama_index.core import KnowledgeGraphIndex
# Extract entities and relationships
kg_index = KnowledgeGraphIndex.from_documents(
documents,
max_triplets_per_chunk=5,
include_embeddings=True,
)
query_engine = kg_index.as_query_engine(
include_text=True,
response_mode="tree_summarize",
)
response = query_engine.query("How are users and roles related?")
Query Engines
# Basic query engine
query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="compact", # compact, refine, tree_summarize, simple
)
response = query_engine.query("How does authentication work?")
print(response.response) # The answer
print(response.source_nodes) # Retrieved nodes with scores
# Streaming
query_engine = index.as_query_engine(streaming=True, similarity_top_k=5)
streaming_response = query_engine.query("How does authentication work?")
for text in streaming_response.response_gen:
print(text, end="", flush=True)
Response Modes
# compact: Stuff as many nodes as possible into one LLM call
query_engine = index.as_query_engine(response_mode="compact")
# refine: Process nodes one by one, refining the answer iteratively
query_engine = index.as_query_engine(response_mode="refine")
# tree_summarize: Build a tree of summaries bottom-up
query_engine = index.as_query_engine(response_mode="tree_summarize")
# no_text: Return nodes only, no LLM generation
query_engine = index.as_query_engine(response_mode="no_text")
# accumulate: Generate a response per node, then concatenate
query_engine = index.as_query_engine(response_mode="accumulate")
| Mode | LLM Calls | Best For |
|---|---|---|
| compact | 1-2 | Default, most queries |
| refine | N (one per node) | Detailed answers, many sources |
| tree_summarize | log(N) | Summarization over many docs |
| no_text | 0 | Retrieval-only, custom generation |
Retrievers
from llama_index.core.retrievers import VectorIndexRetriever
# Basic vector retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=10,
)
nodes = retriever.retrieve("How does authentication work?")
# Auto-merging retriever (for hierarchical nodes)
from llama_index.core.retrievers import AutoMergingRetriever
base_retriever = index.as_retriever(similarity_top_k=12)
retriever = AutoMergingRetriever(
base_retriever,
storage_context=index.storage_context,
simple_ratio_thresh=0.4, # Merge if 40%+ children retrieved
)
# BM25 retriever
from llama_index.retrievers.bm25 import BM25Retriever
bm25_retriever = BM25Retriever.from_defaults(
nodes=nodes,
similarity_top_k=10,
)
# Fusion retriever (hybrid)
from llama_index.core.retrievers import QueryFusionRetriever
fusion_retriever = QueryFusionRetriever(
retrievers=[
index.as_retriever(similarity_top_k=10),
bm25_retriever,
],
num_queries=3, # Generate query variations
similarity_top_k=5, # Final top-k after fusion
use_async=True,
)
Sub-Question Query Engine
Decomposes complex questions into sub-questions, queries each, and synthesizes.
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
# Create tools from different indexes
auth_tool = QueryEngineTool(
query_engine=auth_index.as_query_engine(),
metadata=ToolMetadata(
name="auth_docs",
description="Documentation about authentication and authorization"
),
)
api_tool = QueryEngineTool(
query_engine=api_index.as_query_engine(),
metadata=ToolMetadata(
name="api_docs",
description="API reference documentation"
),
)
# Sub-question engine
sub_question_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=[auth_tool, api_tool],
)
# Complex query gets decomposed
response = sub_question_engine.query(
"Compare the authentication methods used in the REST API vs GraphQL API"
)
# Internally generates:
# 1. "What authentication methods does the REST API use?" -> auth_docs
# 2. "What authentication methods does the GraphQL API use?" -> api_docs
# Then synthesizes both answers
Chat Engine (Conversational)
from llama_index.core.chat_engine import CondensePlusContextChatEngine
chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context", # Best for RAG conversations
similarity_top_k=5,
system_prompt="You are a helpful assistant that answers questions about our documentation.",
)
# First question
response = chat_engine.chat("What auth methods do you support?")
print(response.response)
# Follow-up (automatically contextualizes)
response = chat_engine.chat("How do I configure the first one?")
print(response.response)
# Reset conversation
chat_engine.reset()
# Chat modes:
# "condense_plus_context" - Rewrites question with history, then retrieves (recommended)
# "context" - Retrieves for each message, includes history in prompt
# "condense_question" - Condenses history into standalone question
# "simple" - No retrieval, just chat with history
Metadata Filtering
from llama_index.core.vector_stores import (
MetadataFilter,
MetadataFilters,
FilterOperator,
FilterCondition,
)
# Add metadata during document creation
from llama_index.core import Document
doc = Document(
text="OAuth2 authorization code flow...",
metadata={
"source": "auth.md",
"category": "security",
"version": "2.0",
"updated": "2024-06-01",
},
excluded_llm_metadata_keys=["source"], # Don't send to LLM
excluded_embed_metadata_keys=["updated"], # Don't embed this field
)
# Query with filters
filters = MetadataFilters(
filters=[
MetadataFilter(key="category", operator=FilterOperator.EQ, value="security"),
MetadataFilter(key="version", operator=FilterOperator.GTE, value="2.0"),
],
condition=FilterCondition.AND,
)
retriever = index.as_retriever(
similarity_top_k=5,
filters=filters,
)
Ingestion Pipeline
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.extractors import TitleExtractor, SummaryExtractor
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512, chunk_overlap=64),
TitleExtractor(nodes=3), # Extract title from first N nodes
SummaryExtractor(summaries=["self"]), # Generate summary per node
OpenAIEmbedding(model="text-embedding-3-small"),
],
vector_store=vector_store,
)
# Run pipeline (handles deduplication via document hashing)
nodes = pipeline.run(documents=documents, show_progress=True)
# Incremental updates: only re-processes changed documents
pipeline.run(documents=updated_documents)
Anti-Patterns
-
Not using Settings -- Global settings avoid repeating
embed_modelandllmeverywhere. Configure once at startup. -
Ignoring response modes --
compactis not always best. Usetree_summarizefor broad summarization,refinefor detailed multi-source answers. -
Building indexes without persistence -- Always persist to disk or external vector store. Rebuilding indexes on every restart wastes time and money.
-
Skipping metadata exclusions -- Metadata like file paths clutter LLM prompts and embeddings. Use
excluded_llm_metadata_keysandexcluded_embed_metadata_keys. -
Not using IngestionPipeline for updates -- Manual re-indexing is error-prone. The pipeline handles deduplication and incremental processing.
-
Ignoring LlamaIndex observability -- Enable callbacks for debugging:
from llama_index.core import set_global_handler; set_global_handler("simple").
Install this skill directly: skilldb add rag-pipeline-skills
Related Skills
advanced-rag
Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.
chunking-strategies
Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.
embedding-models
Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.
rag-evaluation
Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.
rag-fundamentals
Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.
rag-production
Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.