advanced-rag
Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.
Move beyond naive RAG with patterns that handle complex queries, improve accuracy, and add structure. ## Key Points - Questions require synthesizing information across multiple documents - Queries are vague or need decomposition into sub-queries - The corpus has complex relationships between entities - Users need verifiable citations, not just answers - Retrieved context is sometimes irrelevant (low precision) - Questions require multi-step reasoning 1. **Over-engineering from the start** -- Start with basic RAG, measure where it fails, then selectively add advanced patterns. Each adds complexity and latency. 2. **Graph RAG for simple corpora** -- Knowledge graphs are costly to build and maintain. Only use when entity relationships are central to your queries. 3. **Unlimited multi-hop** -- Cap hop count at 2-3. More hops compound errors and increase latency exponentially. 4. **Query decomposition for simple questions** -- Decomposing "What is OAuth2?" into sub-questions wastes LLM calls. Classify query complexity first. 5. **No evaluation between iterations** -- Every advanced pattern should measurably improve your eval metrics. If it does not, remove it. 6. **Ignoring latency costs** -- Agentic RAG with multiple tool calls can take 5-15 seconds. Make sure your users accept that latency for the quality gain.
skilldb get rag-pipeline-skills/advanced-ragFull skill: 464 linesAdvanced RAG
Move beyond naive RAG with patterns that handle complex queries, improve accuracy, and add structure.
When You Need Advanced RAG
Basic RAG (retrieve top-k, stuff into prompt) fails when:
- Questions require synthesizing information across multiple documents
- Queries are vague or need decomposition into sub-queries
- The corpus has complex relationships between entities
- Users need verifiable citations, not just answers
- Retrieved context is sometimes irrelevant (low precision)
- Questions require multi-step reasoning
Multi-Hop RAG
Answer questions that require chaining multiple retrieval steps.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o", temperature=0)
def multi_hop_rag(question, retriever, max_hops=3):
"""Iteratively retrieve and reason to answer complex questions."""
context_so_far = []
current_query = question
for hop in range(max_hops):
# Retrieve for current query
docs = retriever.invoke(current_query)
context_so_far.extend(docs)
# Check if we can answer
combined_context = "\n\n".join(d.page_content for d in context_so_far)
check_prompt = f"""Given this context, can you fully answer the question?
Context: {combined_context[:4000]}
Question: {question}
Reply ANSWERABLE or NEED_MORE_INFO with a follow-up query."""
check = llm.invoke(check_prompt).content
if "ANSWERABLE" in check:
break
else:
# Extract follow-up query
current_query = check.replace("NEED_MORE_INFO", "").strip()
# Generate final answer
answer_prompt = f"""Answer this question using ONLY the provided context.
Context: {combined_context[:6000]}
Question: {question}
Cite sources for each claim."""
return llm.invoke(answer_prompt).content
# Example: "What team does the CEO's brother work for?"
# Hop 1: Retrieves CEO info -> finds brother's name
# Hop 2: Retrieves info about the brother -> finds team
Agentic RAG
Give the LLM tools including retrieval, letting it decide when and how to search.
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_core.tools import tool
from typing import TypedDict, List, Annotated
import operator
@tool
def search_docs(query: str) -> str:
"""Search the documentation knowledge base."""
docs = retriever.invoke(query)
return "\n\n".join(d.page_content for d in docs[:5])
@tool
def search_api_reference(query: str) -> str:
"""Search the API reference for endpoint details."""
docs = api_retriever.invoke(query)
return "\n\n".join(d.page_content for d in docs[:5])
@tool
def search_changelog(query: str) -> str:
"""Search the changelog for recent changes and updates."""
docs = changelog_retriever.invoke(query)
return "\n\n".join(d.page_content for d in docs[:3])
tools = [search_docs, search_api_reference, search_changelog]
# LLM with tool binding
llm_with_tools = ChatOpenAI(model="gpt-4o", temperature=0).bind_tools(tools)
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
def agent(state: AgentState):
response = llm_with_tools.invoke(state["messages"])
return {"messages": [response]}
def should_continue(state: AgentState):
last_message = state["messages"][-1]
if hasattr(last_message, "tool_calls") and last_message.tool_calls:
return "tools"
return "end"
# Build agent graph
workflow = StateGraph(AgentState)
workflow.add_node("agent", agent)
workflow.add_node("tools", ToolNode(tools))
workflow.set_entry_point("agent")
workflow.add_conditional_edges("agent", should_continue, {"tools": "tools", "end": END})
workflow.add_edge("tools", "agent")
app = workflow.compile()
# The agent decides which sources to search and when
result = app.invoke({
"messages": [("system", "You are a helpful assistant. Use the search tools to find information before answering."),
("human", "What changed in the auth API in the last release?")]
})
Graph RAG
Combine knowledge graphs with vector retrieval for relationship-aware answers.
# Step 1: Extract entities and relationships from chunks
def extract_triplets(text, llm):
"""Extract (subject, predicate, object) triplets from text."""
prompt = f"""Extract all entity relationships from this text as triplets.
Format each as: (subject, predicate, object)
Text: {text}
Triplets:"""
response = llm.invoke(prompt).content
triplets = []
for line in response.strip().split("\n"):
line = line.strip("()- ")
parts = [p.strip() for p in line.split(",")]
if len(parts) == 3:
triplets.append(tuple(parts))
return triplets
# Step 2: Build graph
import networkx as nx
graph = nx.DiGraph()
for chunk in chunks:
triplets = extract_triplets(chunk.page_content, llm)
for subj, pred, obj in triplets:
graph.add_edge(subj, obj, relation=pred, source=chunk.metadata.get("source"))
# Step 3: Graph-enhanced retrieval
def graph_rag_retrieve(query, vectorstore, graph, k=5, hops=1):
"""Retrieve via vector search, then expand using graph relationships."""
# Vector retrieval
vector_results = vectorstore.similarity_search(query, k=k)
# Extract entities from results
entities = set()
for doc in vector_results:
doc_triplets = extract_triplets(doc.page_content, llm)
for subj, _, obj in doc_triplets:
entities.add(subj)
entities.add(obj)
# Graph expansion: find related entities
expanded_entities = set()
for entity in entities:
if entity in graph:
for neighbor in graph.neighbors(entity):
expanded_entities.add(neighbor)
for predecessor in graph.predecessors(entity):
expanded_entities.add(predecessor)
# Retrieve chunks mentioning expanded entities
expanded_results = []
for entity in expanded_entities:
results = vectorstore.similarity_search(entity, k=2)
expanded_results.extend(results)
# Deduplicate and combine
seen = set()
all_results = []
for doc in vector_results + expanded_results:
doc_id = doc.page_content[:100]
if doc_id not in seen:
seen.add(doc_id)
all_results.append(doc)
return all_results[:k * 2]
# Microsoft GraphRAG approach (community detection + summarization)
# pip install graphrag
# Uses community detection on the entity graph, summarizes communities,
# and uses those summaries for global queries
Recursive Retrieval
Retrieve at different granularities, drilling down for detail.
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import SummaryIndex, VectorStoreIndex
# Index documents at two levels
# Level 1: Document-level summaries
doc_summaries = {}
for doc in documents:
summary = llm.invoke(f"Summarize this document in 2-3 sentences:\n{doc.page_content[:3000]}").content
doc_summaries[doc.metadata["source"]] = {
"summary": summary,
"doc": doc
}
# Level 2: Chunk-level index for each document
chunk_indexes = {}
for source, info in doc_summaries.items():
chunks = splitter.split_documents([info["doc"]])
chunk_indexes[source] = VectorStoreIndex.from_documents(chunks)
# Recursive retrieval: first find relevant docs, then search within them
def recursive_retrieve(query, top_docs=3, chunks_per_doc=3):
# Step 1: Find relevant documents via summary matching
summary_texts = [(src, info["summary"]) for src, info in doc_summaries.items()]
summary_embeddings = embed_texts([s[1] for s in summary_texts])
query_embedding = embed_texts([query])[0]
# Rank summaries
import numpy as np
scores = [np.dot(query_embedding, se) for se in summary_embeddings]
top_indices = np.argsort(scores)[-top_docs:][::-1]
relevant_sources = [summary_texts[i][0] for i in top_indices]
# Step 2: Retrieve chunks from relevant documents
all_chunks = []
for source in relevant_sources:
retriever = chunk_indexes[source].as_retriever(similarity_top_k=chunks_per_doc)
chunks = retriever.retrieve(query)
all_chunks.extend(chunks)
return all_chunks
Self-Querying Retriever
Let the LLM generate metadata filters from natural language queries.
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
metadata_field_info = [
AttributeInfo(name="source", description="The source document filename", type="string"),
AttributeInfo(name="category", description="Document category: security, api, infrastructure, tutorial", type="string"),
AttributeInfo(name="version", description="API version number", type="float"),
AttributeInfo(name="updated_date", description="When the document was last updated (YYYY-MM-DD)", type="string"),
]
self_query_retriever = SelfQueryRetriever.from_llm(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
vectorstore=vectorstore,
document_contents="Technical documentation for a SaaS platform",
metadata_field_info=metadata_field_info,
)
# Natural language query -> automatic metadata filtering
results = self_query_retriever.invoke("What security features were added in version 3.0?")
# Internally generates: filter = {"category": "security", "version": {"$gte": 3.0}}
# Then performs vector search with that filter
results = self_query_retriever.invoke("Show me tutorials updated after 2024-06-01")
# Internally generates: filter = {"category": "tutorial", "updated_date": {"$gt": "2024-06-01"}}
Query Decomposition
Break complex queries into simpler sub-queries.
def decompose_query(query, llm):
"""Decompose a complex query into sub-queries."""
prompt = f"""Break this complex question into 2-4 simpler sub-questions
that can each be answered independently. Return one question per line.
Complex question: {query}
Sub-questions:"""
response = llm.invoke(prompt).content
sub_queries = [q.strip().lstrip("0123456789.-) ") for q in response.strip().split("\n") if q.strip()]
return sub_queries
def decomposed_rag(query, retriever, llm):
"""Answer a complex query by decomposing, retrieving, and synthesizing."""
sub_queries = decompose_query(query, llm)
sub_answers = []
for sub_q in sub_queries:
docs = retriever.invoke(sub_q)
context = "\n".join(d.page_content for d in docs[:3])
answer = llm.invoke(
f"Answer based on context:\nContext: {context}\nQuestion: {sub_q}"
).content
sub_answers.append({"question": sub_q, "answer": answer})
# Synthesize
synthesis_context = "\n\n".join(
f"Q: {sa['question']}\nA: {sa['answer']}" for sa in sub_answers
)
final = llm.invoke(
f"""Synthesize these sub-answers into a comprehensive response to the original question.
Sub-answers:
{synthesis_context}
Original question: {query}
Comprehensive answer:"""
).content
return final
# Example:
# "Compare OAuth2 and API keys for auth, and which is better for mobile apps?"
# Decomposed into:
# 1. "How does OAuth2 authentication work?"
# 2. "How does API key authentication work?"
# 3. "What are the pros and cons of each for mobile applications?"
Citation Extraction
Ground every claim in a specific source.
def rag_with_citations(query, retriever, llm):
"""Generate answer with inline citations."""
docs = retriever.invoke(query)
# Number the sources
numbered_context = ""
for i, doc in enumerate(docs):
source = doc.metadata.get("source", f"doc_{i}")
numbered_context += f"\n\n[{i+1}] (Source: {source})\n{doc.page_content}"
prompt = f"""Answer the question using ONLY the numbered sources below.
For each claim, cite the source number in brackets, e.g., [1].
If no source supports a claim, do not make that claim.
Sources:
{numbered_context}
Question: {query}
Answer with citations:"""
answer = llm.invoke(prompt).content
return {
"answer": answer,
"sources": [
{"index": i+1, "source": doc.metadata.get("source"), "content": doc.page_content[:200]}
for i, doc in enumerate(docs)
]
}
Corrective RAG (CRAG)
Evaluate retrieval quality and take corrective action when results are poor.
def corrective_rag(query, retriever, llm):
"""Retrieve, evaluate, correct if needed, then generate."""
docs = retriever.invoke(query)
# Grade each document
relevant_docs = []
for doc in docs:
grade = llm.invoke(
f"Is this document relevant to the query '{query}'?\n"
f"Document: {doc.page_content[:300]}\n"
f"Answer RELEVANT or IRRELEVANT only."
).content.strip().upper()
if "RELEVANT" in grade:
relevant_docs.append(doc)
# Corrective actions based on relevance
if len(relevant_docs) >= 2:
# Good retrieval - proceed normally
context_docs = relevant_docs
elif len(relevant_docs) == 1:
# Partial - supplement with web search or query rewrite
rewritten = llm.invoke(
f"Rewrite this query to find more relevant results: {query}"
).content
extra_docs = retriever.invoke(rewritten)
context_docs = relevant_docs + extra_docs[:3]
else:
# Poor retrieval - try completely different approach
# Option 1: Web search fallback
# Option 2: Query decomposition
sub_queries = decompose_query(query, llm)
context_docs = []
for sq in sub_queries:
context_docs.extend(retriever.invoke(sq)[:2])
# Generate with corrected context
context = "\n\n".join(d.page_content for d in context_docs)
answer = llm.invoke(
f"Answer based on context. If context is insufficient, say so.\n"
f"Context: {context}\nQuestion: {query}"
).content
return answer
Anti-Patterns
-
Over-engineering from the start -- Start with basic RAG, measure where it fails, then selectively add advanced patterns. Each adds complexity and latency.
-
Graph RAG for simple corpora -- Knowledge graphs are costly to build and maintain. Only use when entity relationships are central to your queries.
-
Unlimited multi-hop -- Cap hop count at 2-3. More hops compound errors and increase latency exponentially.
-
Query decomposition for simple questions -- Decomposing "What is OAuth2?" into sub-questions wastes LLM calls. Classify query complexity first.
-
No evaluation between iterations -- Every advanced pattern should measurably improve your eval metrics. If it does not, remove it.
-
Ignoring latency costs -- Agentic RAG with multiple tool calls can take 5-15 seconds. Make sure your users accept that latency for the quality gain.
Install this skill directly: skilldb add rag-pipeline-skills
Related Skills
chunking-strategies
Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.
embedding-models
Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.
rag-evaluation
Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.
rag-fundamentals
Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.
rag-production
Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.
rag-with-langchain
Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.