Technology & EngineeringRag Pipeline463 lines

rag-with-llamaindex

Building RAG systems with LlamaIndex (formerly GPT Index). Covers data connectors, node parsers, index types (vector, keyword, knowledge graph, summary), query engines, response synthesizers, and advanced patterns like sub-question queries and recursive retrieval. Practical code for production LlamaIndex RAG pipelines.

Quick Summary26 lines

Build structured RAG applications using LlamaIndex's data framework for LLM apps.

## Key Points

- **Documents** -- raw source data
- **Nodes** -- chunked pieces of documents with metadata
- **Indexes** -- data structures for retrieval (vector, keyword, graph)
- **Query Engines** -- end-to-end query interface
- **Response Synthesizers** -- how to combine retrieved nodes into a response
1. **Not using Settings** -- Global settings avoid repeating `embed_model` and `llm` everywhere. Configure once at startup.
2. **Ignoring response modes** -- `compact` is not always best. Use `tree_summarize` for broad summarization, `refine` for detailed multi-source answers.
3. **Building indexes without persistence** -- Always persist to disk or external vector store. Rebuilding indexes on every restart wastes time and money.
4. **Skipping metadata exclusions** -- Metadata like file paths clutter LLM prompts and embeddings. Use `excluded_llm_metadata_keys` and `excluded_embed_metadata_keys`.
5. **Not using IngestionPipeline for updates** -- Manual re-indexing is error-prone. The pipeline handles deduplication and incremental processing.
6. **Ignoring LlamaIndex observability** -- Enable callbacks for debugging: `from llama_index.core import set_global_handler; set_global_handler("simple")`.

## Quick Example

```bash
pip install llama-index
pip install llama-index-embeddings-openai llama-index-llms-openai
pip install llama-index-vector-stores-chroma  # Or qdrant, pinecone, etc.
pip install llama-index-readers-file  # PDF, DOCX, etc.
```

skilldb get rag-pipeline-skills/rag-with-llamaindexFull skill: 463 lines

Paste into your CLAUDE.md or agent config

RAG with LlamaIndex

Build structured RAG applications using LlamaIndex's data framework for LLM apps.

Installation

pip install llama-index
pip install llama-index-embeddings-openai llama-index-llms-openai
pip install llama-index-vector-stores-chroma  # Or qdrant, pinecone, etc.
pip install llama-index-readers-file  # PDF, DOCX, etc.

Core Concepts

LlamaIndex organizes RAG into:

Documents -- raw source data
Nodes -- chunked pieces of documents with metadata
Indexes -- data structures for retrieval (vector, keyword, graph)
Query Engines -- end-to-end query interface
Response Synthesizers -- how to combine retrieved nodes into a response

Data Connectors (Readers)

from llama_index.core import SimpleDirectoryReader

# Load from directory (auto-detects file types)
documents = SimpleDirectoryReader(
    input_dir="./docs",
    recursive=True,
    required_exts=[".md", ".txt", ".pdf"],
    filename_as_id=True,
).load_data()

# Load specific files
documents = SimpleDirectoryReader(
    input_files=["./docs/auth.md", "./docs/api.pdf"]
).load_data()

# From LlamaHub (community connectors)
# pip install llama-index-readers-notion
from llama_index.readers.notion import NotionPageReader

reader = NotionPageReader(integration_token="secret_xxx")
documents = reader.load_data(page_ids=["page-id-1", "page-id-2"])

# Web pages
from llama_index.readers.web import SimpleWebPageReader

documents = SimpleWebPageReader().load_data(
    urls=["https://docs.example.com/auth"]
)

# Database
from llama_index.readers.database import DatabaseReader

reader = DatabaseReader(uri="postgresql://user:pass@host/db")
documents = reader.load_data(query="SELECT id, content, title FROM articles")

Node Parsers (Chunking)

from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
    MarkdownNodeParser,
    CodeSplitter,
    HierarchicalNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding

# Sentence-based splitting (recommended default)
parser = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=64,
    paragraph_separator="\n\n",
)
nodes = parser.get_nodes_from_documents(documents)

# Semantic splitting (groups by embedding similarity)
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
parser = SemanticSplitterNodeParser(
    embed_model=embed_model,
    breakpoint_percentile_threshold=95,
    buffer_size=1,  # Sentences to group together
)
nodes = parser.get_nodes_from_documents(documents)

# Markdown-aware
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(documents)

# Code-aware
parser = CodeSplitter(
    language="python",
    max_chars=1500,
    chunk_lines=40,
)
nodes = parser.get_nodes_from_documents(documents)

# Hierarchical (parent-child)
parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[1024, 512, 256]  # Large -> medium -> small
)
nodes = parser.get_nodes_from_documents(documents)

Index Types

Vector Store Index (Most Common)

from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Configure global settings
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o", temperature=0)

# Build index from documents
index = VectorStoreIndex.from_documents(documents)

# Build from pre-parsed nodes
index = VectorStoreIndex(nodes)

# With external vector store
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("my_docs")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

index = VectorStoreIndex.from_documents(
    documents,
    vector_store=vector_store,
)

# Persist and reload
from llama_index.core import StorageContext, load_index_from_storage

# Save
index.storage_context.persist(persist_dir="./storage")

# Load
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

Keyword Table Index

from llama_index.core import KeywordTableIndex

# BM25-like keyword retrieval
keyword_index = KeywordTableIndex.from_documents(documents)
query_engine = keyword_index.as_query_engine()
response = query_engine.query("JWT token expiration")

Summary Index

from llama_index.core import SummaryIndex

# Summarizes all nodes (good for "tell me about everything" queries)
summary_index = SummaryIndex.from_documents(documents)
query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize"
)
response = query_engine.query("Give me an overview of the authentication system")

Knowledge Graph Index

from llama_index.core import KnowledgeGraphIndex

# Extract entities and relationships
kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    max_triplets_per_chunk=5,
    include_embeddings=True,
)

query_engine = kg_index.as_query_engine(
    include_text=True,
    response_mode="tree_summarize",
)
response = query_engine.query("How are users and roles related?")

Query Engines

# Basic query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact",  # compact, refine, tree_summarize, simple
)

response = query_engine.query("How does authentication work?")
print(response.response)  # The answer
print(response.source_nodes)  # Retrieved nodes with scores

# Streaming
query_engine = index.as_query_engine(streaming=True, similarity_top_k=5)
streaming_response = query_engine.query("How does authentication work?")
for text in streaming_response.response_gen:
    print(text, end="", flush=True)

Response Modes

# compact: Stuff as many nodes as possible into one LLM call
query_engine = index.as_query_engine(response_mode="compact")

# refine: Process nodes one by one, refining the answer iteratively
query_engine = index.as_query_engine(response_mode="refine")

# tree_summarize: Build a tree of summaries bottom-up
query_engine = index.as_query_engine(response_mode="tree_summarize")

# no_text: Return nodes only, no LLM generation
query_engine = index.as_query_engine(response_mode="no_text")

# accumulate: Generate a response per node, then concatenate
query_engine = index.as_query_engine(response_mode="accumulate")

Mode	LLM Calls	Best For
compact	1-2	Default, most queries
refine	N (one per node)	Detailed answers, many sources
tree_summarize	log(N)	Summarization over many docs
no_text	0	Retrieval-only, custom generation

Retrievers

from llama_index.core.retrievers import VectorIndexRetriever

# Basic vector retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)
nodes = retriever.retrieve("How does authentication work?")

# Auto-merging retriever (for hierarchical nodes)
from llama_index.core.retrievers import AutoMergingRetriever

base_retriever = index.as_retriever(similarity_top_k=12)
retriever = AutoMergingRetriever(
    base_retriever,
    storage_context=index.storage_context,
    simple_ratio_thresh=0.4,  # Merge if 40%+ children retrieved
)

# BM25 retriever
from llama_index.retrievers.bm25 import BM25Retriever

bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=10,
)

# Fusion retriever (hybrid)
from llama_index.core.retrievers import QueryFusionRetriever

fusion_retriever = QueryFusionRetriever(
    retrievers=[
        index.as_retriever(similarity_top_k=10),
        bm25_retriever,
    ],
    num_queries=3,       # Generate query variations
    similarity_top_k=5,  # Final top-k after fusion
    use_async=True,
)

Sub-Question Query Engine

Decomposes complex questions into sub-questions, queries each, and synthesizes.

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

# Create tools from different indexes
auth_tool = QueryEngineTool(
    query_engine=auth_index.as_query_engine(),
    metadata=ToolMetadata(
        name="auth_docs",
        description="Documentation about authentication and authorization"
    ),
)

api_tool = QueryEngineTool(
    query_engine=api_index.as_query_engine(),
    metadata=ToolMetadata(
        name="api_docs",
        description="API reference documentation"
    ),
)

# Sub-question engine
sub_question_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[auth_tool, api_tool],
)

# Complex query gets decomposed
response = sub_question_engine.query(
    "Compare the authentication methods used in the REST API vs GraphQL API"
)
# Internally generates:
# 1. "What authentication methods does the REST API use?" -> auth_docs
# 2. "What authentication methods does the GraphQL API use?" -> api_docs
# Then synthesizes both answers

Chat Engine (Conversational)

from llama_index.core.chat_engine import CondensePlusContextChatEngine

chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",  # Best for RAG conversations
    similarity_top_k=5,
    system_prompt="You are a helpful assistant that answers questions about our documentation.",
)

# First question
response = chat_engine.chat("What auth methods do you support?")
print(response.response)

# Follow-up (automatically contextualizes)
response = chat_engine.chat("How do I configure the first one?")
print(response.response)

# Reset conversation
chat_engine.reset()

# Chat modes:
# "condense_plus_context" - Rewrites question with history, then retrieves (recommended)
# "context" - Retrieves for each message, includes history in prompt
# "condense_question" - Condenses history into standalone question
# "simple" - No retrieval, just chat with history

Metadata Filtering

from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
    FilterCondition,
)

# Add metadata during document creation
from llama_index.core import Document

doc = Document(
    text="OAuth2 authorization code flow...",
    metadata={
        "source": "auth.md",
        "category": "security",
        "version": "2.0",
        "updated": "2024-06-01",
    },
    excluded_llm_metadata_keys=["source"],  # Don't send to LLM
    excluded_embed_metadata_keys=["updated"],  # Don't embed this field
)

# Query with filters
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="category", operator=FilterOperator.EQ, value="security"),
        MetadataFilter(key="version", operator=FilterOperator.GTE, value="2.0"),
    ],
    condition=FilterCondition.AND,
)

retriever = index.as_retriever(
    similarity_top_k=5,
    filters=filters,
)

Ingestion Pipeline

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.extractors import TitleExtractor, SummaryExtractor

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=64),
        TitleExtractor(nodes=3),       # Extract title from first N nodes
        SummaryExtractor(summaries=["self"]),  # Generate summary per node
        OpenAIEmbedding(model="text-embedding-3-small"),
    ],
    vector_store=vector_store,
)

# Run pipeline (handles deduplication via document hashing)
nodes = pipeline.run(documents=documents, show_progress=True)

# Incremental updates: only re-processes changed documents
pipeline.run(documents=updated_documents)

Anti-Patterns

Not using Settings -- Global settings avoid repeating embed_model and llm everywhere. Configure once at startup.
Ignoring response modes -- compact is not always best. Use tree_summarize for broad summarization, refine for detailed multi-source answers.
Building indexes without persistence -- Always persist to disk or external vector store. Rebuilding indexes on every restart wastes time and money.
Skipping metadata exclusions -- Metadata like file paths clutter LLM prompts and embeddings. Use excluded_llm_metadata_keys and excluded_embed_metadata_keys.
Not using IngestionPipeline for updates -- Manual re-indexing is error-prone. The pipeline handles deduplication and incremental processing.
Ignoring LlamaIndex observability -- Enable callbacks for debugging: from llama_index.core import set_global_handler; set_global_handler("simple").

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline•464L

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline•343L

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Rag Pipeline•357L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline•501L

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Rag Pipeline•266L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline•498L

RAG with LlamaIndex

Installation

Core Concepts

Data Connectors (Readers)

Load from directory (auto-detects file types)

Load specific files

From LlamaHub (community connectors)

pip install llama-index-readers-notion

Web pages

Database

Node Parsers (Chunking)

Sentence-based splitting (recommended default)

Semantic splitting (groups by embedding similarity)

Markdown-aware

Code-aware

Hierarchical (parent-child)

Index Types

Vector Store Index (Most Common)

Configure global settings

Build index from documents

Build from pre-parsed nodes

With external vector store

Persist and reload

Save

Load

Keyword Table Index

BM25-like keyword retrieval

Summary Index

Summarizes all nodes (good for "tell me about everything" queries)

Knowledge Graph Index

Extract entities and relationships

Query Engines

Basic query engine

Streaming

Response Modes

compact: Stuff as many nodes as possible into one LLM call

refine: Process nodes one by one, refining the answer iteratively

tree_summarize: Build a tree of summaries bottom-up

no_text: Return nodes only, no LLM generation

accumulate: Generate a response per node, then concatenate

Retrievers

Basic vector retriever

Auto-merging retriever (for hierarchical nodes)

BM25 retriever

Fusion retriever (hybrid)

Sub-Question Query Engine

Create tools from different indexes

Sub-question engine

Complex query gets decomposed

Internally generates:

1. "What authentication methods does the REST API use?" -> auth_docs

2. "What authentication methods does the GraphQL API use?" -> api_docs

Then synthesizes both answers

Chat Engine (Conversational)

First question

Follow-up (automatically contextualizes)

Reset conversation

Chat modes:

"condense_plus_context" - Rewrites question with history, then retrieves (recommended)

"context" - Retrieves for each message, includes history in prompt

"condense_question" - Condenses history into standalone question

"simple" - No retrieval, just chat with history

Metadata Filtering

Add metadata during document creation

Query with filters

Ingestion Pipeline

Run pipeline (handles deduplication via document hashing)

Incremental updates: only re-processes changed documents

Anti-Patterns

Details

Pack: rag-pipeline-skills
File: rag-with-llamaindex.md
Lines: 463
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add rag-pipeline-skills

Installs the full Rag Pipeline pack to your project.