Skip to main content
Technology & EngineeringRag Pipeline463 lines

rag-with-llamaindex

Building RAG systems with LlamaIndex (formerly GPT Index). Covers data connectors, node parsers, index types (vector, keyword, knowledge graph, summary), query engines, response synthesizers, and advanced patterns like sub-question queries and recursive retrieval. Practical code for production LlamaIndex RAG pipelines.

Quick Summary26 lines
Build structured RAG applications using LlamaIndex's data framework for LLM apps.

## Key Points

- **Documents** -- raw source data
- **Nodes** -- chunked pieces of documents with metadata
- **Indexes** -- data structures for retrieval (vector, keyword, graph)
- **Query Engines** -- end-to-end query interface
- **Response Synthesizers** -- how to combine retrieved nodes into a response
1. **Not using Settings** -- Global settings avoid repeating `embed_model` and `llm` everywhere. Configure once at startup.
2. **Ignoring response modes** -- `compact` is not always best. Use `tree_summarize` for broad summarization, `refine` for detailed multi-source answers.
3. **Building indexes without persistence** -- Always persist to disk or external vector store. Rebuilding indexes on every restart wastes time and money.
4. **Skipping metadata exclusions** -- Metadata like file paths clutter LLM prompts and embeddings. Use `excluded_llm_metadata_keys` and `excluded_embed_metadata_keys`.
5. **Not using IngestionPipeline for updates** -- Manual re-indexing is error-prone. The pipeline handles deduplication and incremental processing.
6. **Ignoring LlamaIndex observability** -- Enable callbacks for debugging: `from llama_index.core import set_global_handler; set_global_handler("simple")`.

## Quick Example

```bash
pip install llama-index
pip install llama-index-embeddings-openai llama-index-llms-openai
pip install llama-index-vector-stores-chroma  # Or qdrant, pinecone, etc.
pip install llama-index-readers-file  # PDF, DOCX, etc.
```
skilldb get rag-pipeline-skills/rag-with-llamaindexFull skill: 463 lines
Paste into your CLAUDE.md or agent config

RAG with LlamaIndex

Build structured RAG applications using LlamaIndex's data framework for LLM apps.


Installation

pip install llama-index
pip install llama-index-embeddings-openai llama-index-llms-openai
pip install llama-index-vector-stores-chroma  # Or qdrant, pinecone, etc.
pip install llama-index-readers-file  # PDF, DOCX, etc.

Core Concepts

LlamaIndex organizes RAG into:

  • Documents -- raw source data
  • Nodes -- chunked pieces of documents with metadata
  • Indexes -- data structures for retrieval (vector, keyword, graph)
  • Query Engines -- end-to-end query interface
  • Response Synthesizers -- how to combine retrieved nodes into a response

Data Connectors (Readers)

from llama_index.core import SimpleDirectoryReader

# Load from directory (auto-detects file types)
documents = SimpleDirectoryReader(
    input_dir="./docs",
    recursive=True,
    required_exts=[".md", ".txt", ".pdf"],
    filename_as_id=True,
).load_data()

# Load specific files
documents = SimpleDirectoryReader(
    input_files=["./docs/auth.md", "./docs/api.pdf"]
).load_data()

# From LlamaHub (community connectors)
# pip install llama-index-readers-notion
from llama_index.readers.notion import NotionPageReader

reader = NotionPageReader(integration_token="secret_xxx")
documents = reader.load_data(page_ids=["page-id-1", "page-id-2"])

# Web pages
from llama_index.readers.web import SimpleWebPageReader

documents = SimpleWebPageReader().load_data(
    urls=["https://docs.example.com/auth"]
)

# Database
from llama_index.readers.database import DatabaseReader

reader = DatabaseReader(uri="postgresql://user:pass@host/db")
documents = reader.load_data(query="SELECT id, content, title FROM articles")

Node Parsers (Chunking)

from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
    MarkdownNodeParser,
    CodeSplitter,
    HierarchicalNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding

# Sentence-based splitting (recommended default)
parser = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=64,
    paragraph_separator="\n\n",
)
nodes = parser.get_nodes_from_documents(documents)

# Semantic splitting (groups by embedding similarity)
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
parser = SemanticSplitterNodeParser(
    embed_model=embed_model,
    breakpoint_percentile_threshold=95,
    buffer_size=1,  # Sentences to group together
)
nodes = parser.get_nodes_from_documents(documents)

# Markdown-aware
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(documents)

# Code-aware
parser = CodeSplitter(
    language="python",
    max_chars=1500,
    chunk_lines=40,
)
nodes = parser.get_nodes_from_documents(documents)

# Hierarchical (parent-child)
parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[1024, 512, 256]  # Large -> medium -> small
)
nodes = parser.get_nodes_from_documents(documents)

Index Types

Vector Store Index (Most Common)

from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Configure global settings
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o", temperature=0)

# Build index from documents
index = VectorStoreIndex.from_documents(documents)

# Build from pre-parsed nodes
index = VectorStoreIndex(nodes)

# With external vector store
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("my_docs")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

index = VectorStoreIndex.from_documents(
    documents,
    vector_store=vector_store,
)

# Persist and reload
from llama_index.core import StorageContext, load_index_from_storage

# Save
index.storage_context.persist(persist_dir="./storage")

# Load
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

Keyword Table Index

from llama_index.core import KeywordTableIndex

# BM25-like keyword retrieval
keyword_index = KeywordTableIndex.from_documents(documents)
query_engine = keyword_index.as_query_engine()
response = query_engine.query("JWT token expiration")

Summary Index

from llama_index.core import SummaryIndex

# Summarizes all nodes (good for "tell me about everything" queries)
summary_index = SummaryIndex.from_documents(documents)
query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize"
)
response = query_engine.query("Give me an overview of the authentication system")

Knowledge Graph Index

from llama_index.core import KnowledgeGraphIndex

# Extract entities and relationships
kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    max_triplets_per_chunk=5,
    include_embeddings=True,
)

query_engine = kg_index.as_query_engine(
    include_text=True,
    response_mode="tree_summarize",
)
response = query_engine.query("How are users and roles related?")

Query Engines

# Basic query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact",  # compact, refine, tree_summarize, simple
)

response = query_engine.query("How does authentication work?")
print(response.response)  # The answer
print(response.source_nodes)  # Retrieved nodes with scores

# Streaming
query_engine = index.as_query_engine(streaming=True, similarity_top_k=5)
streaming_response = query_engine.query("How does authentication work?")
for text in streaming_response.response_gen:
    print(text, end="", flush=True)

Response Modes

# compact: Stuff as many nodes as possible into one LLM call
query_engine = index.as_query_engine(response_mode="compact")

# refine: Process nodes one by one, refining the answer iteratively
query_engine = index.as_query_engine(response_mode="refine")

# tree_summarize: Build a tree of summaries bottom-up
query_engine = index.as_query_engine(response_mode="tree_summarize")

# no_text: Return nodes only, no LLM generation
query_engine = index.as_query_engine(response_mode="no_text")

# accumulate: Generate a response per node, then concatenate
query_engine = index.as_query_engine(response_mode="accumulate")
ModeLLM CallsBest For
compact1-2Default, most queries
refineN (one per node)Detailed answers, many sources
tree_summarizelog(N)Summarization over many docs
no_text0Retrieval-only, custom generation

Retrievers

from llama_index.core.retrievers import VectorIndexRetriever

# Basic vector retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)
nodes = retriever.retrieve("How does authentication work?")

# Auto-merging retriever (for hierarchical nodes)
from llama_index.core.retrievers import AutoMergingRetriever

base_retriever = index.as_retriever(similarity_top_k=12)
retriever = AutoMergingRetriever(
    base_retriever,
    storage_context=index.storage_context,
    simple_ratio_thresh=0.4,  # Merge if 40%+ children retrieved
)

# BM25 retriever
from llama_index.retrievers.bm25 import BM25Retriever

bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=10,
)

# Fusion retriever (hybrid)
from llama_index.core.retrievers import QueryFusionRetriever

fusion_retriever = QueryFusionRetriever(
    retrievers=[
        index.as_retriever(similarity_top_k=10),
        bm25_retriever,
    ],
    num_queries=3,       # Generate query variations
    similarity_top_k=5,  # Final top-k after fusion
    use_async=True,
)

Sub-Question Query Engine

Decomposes complex questions into sub-questions, queries each, and synthesizes.

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

# Create tools from different indexes
auth_tool = QueryEngineTool(
    query_engine=auth_index.as_query_engine(),
    metadata=ToolMetadata(
        name="auth_docs",
        description="Documentation about authentication and authorization"
    ),
)

api_tool = QueryEngineTool(
    query_engine=api_index.as_query_engine(),
    metadata=ToolMetadata(
        name="api_docs",
        description="API reference documentation"
    ),
)

# Sub-question engine
sub_question_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[auth_tool, api_tool],
)

# Complex query gets decomposed
response = sub_question_engine.query(
    "Compare the authentication methods used in the REST API vs GraphQL API"
)
# Internally generates:
# 1. "What authentication methods does the REST API use?" -> auth_docs
# 2. "What authentication methods does the GraphQL API use?" -> api_docs
# Then synthesizes both answers

Chat Engine (Conversational)

from llama_index.core.chat_engine import CondensePlusContextChatEngine

chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",  # Best for RAG conversations
    similarity_top_k=5,
    system_prompt="You are a helpful assistant that answers questions about our documentation.",
)

# First question
response = chat_engine.chat("What auth methods do you support?")
print(response.response)

# Follow-up (automatically contextualizes)
response = chat_engine.chat("How do I configure the first one?")
print(response.response)

# Reset conversation
chat_engine.reset()

# Chat modes:
# "condense_plus_context" - Rewrites question with history, then retrieves (recommended)
# "context" - Retrieves for each message, includes history in prompt
# "condense_question" - Condenses history into standalone question
# "simple" - No retrieval, just chat with history

Metadata Filtering

from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
    FilterCondition,
)

# Add metadata during document creation
from llama_index.core import Document

doc = Document(
    text="OAuth2 authorization code flow...",
    metadata={
        "source": "auth.md",
        "category": "security",
        "version": "2.0",
        "updated": "2024-06-01",
    },
    excluded_llm_metadata_keys=["source"],  # Don't send to LLM
    excluded_embed_metadata_keys=["updated"],  # Don't embed this field
)

# Query with filters
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="category", operator=FilterOperator.EQ, value="security"),
        MetadataFilter(key="version", operator=FilterOperator.GTE, value="2.0"),
    ],
    condition=FilterCondition.AND,
)

retriever = index.as_retriever(
    similarity_top_k=5,
    filters=filters,
)

Ingestion Pipeline

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.extractors import TitleExtractor, SummaryExtractor

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=64),
        TitleExtractor(nodes=3),       # Extract title from first N nodes
        SummaryExtractor(summaries=["self"]),  # Generate summary per node
        OpenAIEmbedding(model="text-embedding-3-small"),
    ],
    vector_store=vector_store,
)

# Run pipeline (handles deduplication via document hashing)
nodes = pipeline.run(documents=documents, show_progress=True)

# Incremental updates: only re-processes changed documents
pipeline.run(documents=updated_documents)

Anti-Patterns

  1. Not using Settings -- Global settings avoid repeating embed_model and llm everywhere. Configure once at startup.

  2. Ignoring response modes -- compact is not always best. Use tree_summarize for broad summarization, refine for detailed multi-source answers.

  3. Building indexes without persistence -- Always persist to disk or external vector store. Rebuilding indexes on every restart wastes time and money.

  4. Skipping metadata exclusions -- Metadata like file paths clutter LLM prompts and embeddings. Use excluded_llm_metadata_keys and excluded_embed_metadata_keys.

  5. Not using IngestionPipeline for updates -- Manual re-indexing is error-prone. The pipeline handles deduplication and incremental processing.

  6. Ignoring LlamaIndex observability -- Enable callbacks for debugging: from llama_index.core import set_global_handler; set_global_handler("simple").

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline464L

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline343L

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Rag Pipeline357L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline501L

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Rag Pipeline266L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline498L