Technology & EngineeringRag Pipeline464 lines

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Quick Summary18 lines

Move beyond naive RAG with patterns that handle complex queries, improve accuracy, and add structure.

## Key Points

- Questions require synthesizing information across multiple documents
- Queries are vague or need decomposition into sub-queries
- The corpus has complex relationships between entities
- Users need verifiable citations, not just answers
- Retrieved context is sometimes irrelevant (low precision)
- Questions require multi-step reasoning
1. **Over-engineering from the start** -- Start with basic RAG, measure where it fails, then selectively add advanced patterns. Each adds complexity and latency.
2. **Graph RAG for simple corpora** -- Knowledge graphs are costly to build and maintain. Only use when entity relationships are central to your queries.
3. **Unlimited multi-hop** -- Cap hop count at 2-3. More hops compound errors and increase latency exponentially.
4. **Query decomposition for simple questions** -- Decomposing "What is OAuth2?" into sub-questions wastes LLM calls. Classify query complexity first.
5. **No evaluation between iterations** -- Every advanced pattern should measurably improve your eval metrics. If it does not, remove it.
6. **Ignoring latency costs** -- Agentic RAG with multiple tool calls can take 5-15 seconds. Make sure your users accept that latency for the quality gain.

skilldb get rag-pipeline-skills/advanced-ragFull skill: 464 lines

Paste into your CLAUDE.md or agent config

Advanced RAG

Move beyond naive RAG with patterns that handle complex queries, improve accuracy, and add structure.

When You Need Advanced RAG

Basic RAG (retrieve top-k, stuff into prompt) fails when:

Questions require synthesizing information across multiple documents
Queries are vague or need decomposition into sub-queries
The corpus has complex relationships between entities
Users need verifiable citations, not just answers
Retrieved context is sometimes irrelevant (low precision)
Questions require multi-step reasoning

Multi-Hop RAG

Answer questions that require chaining multiple retrieval steps.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o", temperature=0)

def multi_hop_rag(question, retriever, max_hops=3):
    """Iteratively retrieve and reason to answer complex questions."""
    context_so_far = []
    current_query = question

    for hop in range(max_hops):
        # Retrieve for current query
        docs = retriever.invoke(current_query)
        context_so_far.extend(docs)

        # Check if we can answer
        combined_context = "\n\n".join(d.page_content for d in context_so_far)
        check_prompt = f"""Given this context, can you fully answer the question?
Context: {combined_context[:4000]}
Question: {question}
Reply ANSWERABLE or NEED_MORE_INFO with a follow-up query."""

        check = llm.invoke(check_prompt).content

        if "ANSWERABLE" in check:
            break
        else:
            # Extract follow-up query
            current_query = check.replace("NEED_MORE_INFO", "").strip()

    # Generate final answer
    answer_prompt = f"""Answer this question using ONLY the provided context.
Context: {combined_context[:6000]}
Question: {question}
Cite sources for each claim."""

    return llm.invoke(answer_prompt).content

# Example: "What team does the CEO's brother work for?"
# Hop 1: Retrieves CEO info -> finds brother's name
# Hop 2: Retrieves info about the brother -> finds team

Agentic RAG

Give the LLM tools including retrieval, letting it decide when and how to search.

from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_core.tools import tool
from typing import TypedDict, List, Annotated
import operator

@tool
def search_docs(query: str) -> str:
    """Search the documentation knowledge base."""
    docs = retriever.invoke(query)
    return "\n\n".join(d.page_content for d in docs[:5])

@tool
def search_api_reference(query: str) -> str:
    """Search the API reference for endpoint details."""
    docs = api_retriever.invoke(query)
    return "\n\n".join(d.page_content for d in docs[:5])

@tool
def search_changelog(query: str) -> str:
    """Search the changelog for recent changes and updates."""
    docs = changelog_retriever.invoke(query)
    return "\n\n".join(d.page_content for d in docs[:3])

tools = [search_docs, search_api_reference, search_changelog]

# LLM with tool binding
llm_with_tools = ChatOpenAI(model="gpt-4o", temperature=0).bind_tools(tools)

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]

def agent(state: AgentState):
    response = llm_with_tools.invoke(state["messages"])
    return {"messages": [response]}

def should_continue(state: AgentState):
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return "end"

# Build agent graph
workflow = StateGraph(AgentState)
workflow.add_node("agent", agent)
workflow.add_node("tools", ToolNode(tools))
workflow.set_entry_point("agent")
workflow.add_conditional_edges("agent", should_continue, {"tools": "tools", "end": END})
workflow.add_edge("tools", "agent")

app = workflow.compile()

# The agent decides which sources to search and when
result = app.invoke({
    "messages": [("system", "You are a helpful assistant. Use the search tools to find information before answering."),
                 ("human", "What changed in the auth API in the last release?")]
})

Graph RAG

Combine knowledge graphs with vector retrieval for relationship-aware answers.

# Step 1: Extract entities and relationships from chunks
def extract_triplets(text, llm):
    """Extract (subject, predicate, object) triplets from text."""
    prompt = f"""Extract all entity relationships from this text as triplets.
Format each as: (subject, predicate, object)

Text: {text}

Triplets:"""
    response = llm.invoke(prompt).content
    triplets = []
    for line in response.strip().split("\n"):
        line = line.strip("()- ")
        parts = [p.strip() for p in line.split(",")]
        if len(parts) == 3:
            triplets.append(tuple(parts))
    return triplets

# Step 2: Build graph
import networkx as nx

graph = nx.DiGraph()
for chunk in chunks:
    triplets = extract_triplets(chunk.page_content, llm)
    for subj, pred, obj in triplets:
        graph.add_edge(subj, obj, relation=pred, source=chunk.metadata.get("source"))

# Step 3: Graph-enhanced retrieval
def graph_rag_retrieve(query, vectorstore, graph, k=5, hops=1):
    """Retrieve via vector search, then expand using graph relationships."""
    # Vector retrieval
    vector_results = vectorstore.similarity_search(query, k=k)

    # Extract entities from results
    entities = set()
    for doc in vector_results:
        doc_triplets = extract_triplets(doc.page_content, llm)
        for subj, _, obj in doc_triplets:
            entities.add(subj)
            entities.add(obj)

    # Graph expansion: find related entities
    expanded_entities = set()
    for entity in entities:
        if entity in graph:
            for neighbor in graph.neighbors(entity):
                expanded_entities.add(neighbor)
            for predecessor in graph.predecessors(entity):
                expanded_entities.add(predecessor)

    # Retrieve chunks mentioning expanded entities
    expanded_results = []
    for entity in expanded_entities:
        results = vectorstore.similarity_search(entity, k=2)
        expanded_results.extend(results)

    # Deduplicate and combine
    seen = set()
    all_results = []
    for doc in vector_results + expanded_results:
        doc_id = doc.page_content[:100]
        if doc_id not in seen:
            seen.add(doc_id)
            all_results.append(doc)

    return all_results[:k * 2]

# Microsoft GraphRAG approach (community detection + summarization)
# pip install graphrag
# Uses community detection on the entity graph, summarizes communities,
# and uses those summaries for global queries

Recursive Retrieval

Retrieve at different granularities, drilling down for detail.

from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import SummaryIndex, VectorStoreIndex

# Index documents at two levels
# Level 1: Document-level summaries
doc_summaries = {}
for doc in documents:
    summary = llm.invoke(f"Summarize this document in 2-3 sentences:\n{doc.page_content[:3000]}").content
    doc_summaries[doc.metadata["source"]] = {
        "summary": summary,
        "doc": doc
    }

# Level 2: Chunk-level index for each document
chunk_indexes = {}
for source, info in doc_summaries.items():
    chunks = splitter.split_documents([info["doc"]])
    chunk_indexes[source] = VectorStoreIndex.from_documents(chunks)

# Recursive retrieval: first find relevant docs, then search within them
def recursive_retrieve(query, top_docs=3, chunks_per_doc=3):
    # Step 1: Find relevant documents via summary matching
    summary_texts = [(src, info["summary"]) for src, info in doc_summaries.items()]
    summary_embeddings = embed_texts([s[1] for s in summary_texts])
    query_embedding = embed_texts([query])[0]

    # Rank summaries
    import numpy as np
    scores = [np.dot(query_embedding, se) for se in summary_embeddings]
    top_indices = np.argsort(scores)[-top_docs:][::-1]
    relevant_sources = [summary_texts[i][0] for i in top_indices]

    # Step 2: Retrieve chunks from relevant documents
    all_chunks = []
    for source in relevant_sources:
        retriever = chunk_indexes[source].as_retriever(similarity_top_k=chunks_per_doc)
        chunks = retriever.retrieve(query)
        all_chunks.extend(chunks)

    return all_chunks

Self-Querying Retriever

Let the LLM generate metadata filters from natural language queries.

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(name="source", description="The source document filename", type="string"),
    AttributeInfo(name="category", description="Document category: security, api, infrastructure, tutorial", type="string"),
    AttributeInfo(name="version", description="API version number", type="float"),
    AttributeInfo(name="updated_date", description="When the document was last updated (YYYY-MM-DD)", type="string"),
]

self_query_retriever = SelfQueryRetriever.from_llm(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    vectorstore=vectorstore,
    document_contents="Technical documentation for a SaaS platform",
    metadata_field_info=metadata_field_info,
)

# Natural language query -> automatic metadata filtering
results = self_query_retriever.invoke("What security features were added in version 3.0?")
# Internally generates: filter = {"category": "security", "version": {"$gte": 3.0}}
# Then performs vector search with that filter

results = self_query_retriever.invoke("Show me tutorials updated after 2024-06-01")
# Internally generates: filter = {"category": "tutorial", "updated_date": {"$gt": "2024-06-01"}}

Query Decomposition

Break complex queries into simpler sub-queries.

def decompose_query(query, llm):
    """Decompose a complex query into sub-queries."""
    prompt = f"""Break this complex question into 2-4 simpler sub-questions
that can each be answered independently. Return one question per line.

Complex question: {query}

Sub-questions:"""
    response = llm.invoke(prompt).content
    sub_queries = [q.strip().lstrip("0123456789.-) ") for q in response.strip().split("\n") if q.strip()]
    return sub_queries

def decomposed_rag(query, retriever, llm):
    """Answer a complex query by decomposing, retrieving, and synthesizing."""
    sub_queries = decompose_query(query, llm)
    sub_answers = []

    for sub_q in sub_queries:
        docs = retriever.invoke(sub_q)
        context = "\n".join(d.page_content for d in docs[:3])
        answer = llm.invoke(
            f"Answer based on context:\nContext: {context}\nQuestion: {sub_q}"
        ).content
        sub_answers.append({"question": sub_q, "answer": answer})

    # Synthesize
    synthesis_context = "\n\n".join(
        f"Q: {sa['question']}\nA: {sa['answer']}" for sa in sub_answers
    )
    final = llm.invoke(
        f"""Synthesize these sub-answers into a comprehensive response to the original question.

Sub-answers:
{synthesis_context}

Original question: {query}

Comprehensive answer:"""
    ).content
    return final

# Example:
# "Compare OAuth2 and API keys for auth, and which is better for mobile apps?"
# Decomposed into:
# 1. "How does OAuth2 authentication work?"
# 2. "How does API key authentication work?"
# 3. "What are the pros and cons of each for mobile applications?"

Citation Extraction

Ground every claim in a specific source.

def rag_with_citations(query, retriever, llm):
    """Generate answer with inline citations."""
    docs = retriever.invoke(query)

    # Number the sources
    numbered_context = ""
    for i, doc in enumerate(docs):
        source = doc.metadata.get("source", f"doc_{i}")
        numbered_context += f"\n\n[{i+1}] (Source: {source})\n{doc.page_content}"

    prompt = f"""Answer the question using ONLY the numbered sources below.
For each claim, cite the source number in brackets, e.g., [1].
If no source supports a claim, do not make that claim.

Sources:
{numbered_context}

Question: {query}

Answer with citations:"""

    answer = llm.invoke(prompt).content

    return {
        "answer": answer,
        "sources": [
            {"index": i+1, "source": doc.metadata.get("source"), "content": doc.page_content[:200]}
            for i, doc in enumerate(docs)
        ]
    }

Corrective RAG (CRAG)

Evaluate retrieval quality and take corrective action when results are poor.

def corrective_rag(query, retriever, llm):
    """Retrieve, evaluate, correct if needed, then generate."""
    docs = retriever.invoke(query)

    # Grade each document
    relevant_docs = []
    for doc in docs:
        grade = llm.invoke(
            f"Is this document relevant to the query '{query}'?\n"
            f"Document: {doc.page_content[:300]}\n"
            f"Answer RELEVANT or IRRELEVANT only."
        ).content.strip().upper()
        if "RELEVANT" in grade:
            relevant_docs.append(doc)

    # Corrective actions based on relevance
    if len(relevant_docs) >= 2:
        # Good retrieval - proceed normally
        context_docs = relevant_docs
    elif len(relevant_docs) == 1:
        # Partial - supplement with web search or query rewrite
        rewritten = llm.invoke(
            f"Rewrite this query to find more relevant results: {query}"
        ).content
        extra_docs = retriever.invoke(rewritten)
        context_docs = relevant_docs + extra_docs[:3]
    else:
        # Poor retrieval - try completely different approach
        # Option 1: Web search fallback
        # Option 2: Query decomposition
        sub_queries = decompose_query(query, llm)
        context_docs = []
        for sq in sub_queries:
            context_docs.extend(retriever.invoke(sq)[:2])

    # Generate with corrected context
    context = "\n\n".join(d.page_content for d in context_docs)
    answer = llm.invoke(
        f"Answer based on context. If context is insufficient, say so.\n"
        f"Context: {context}\nQuestion: {query}"
    ).content

    return answer

Anti-Patterns

Over-engineering from the start -- Start with basic RAG, measure where it fails, then selectively add advanced patterns. Each adds complexity and latency.
Graph RAG for simple corpora -- Knowledge graphs are costly to build and maintain. Only use when entity relationships are central to your queries.
Unlimited multi-hop -- Cap hop count at 2-3. More hops compound errors and increase latency exponentially.
Query decomposition for simple questions -- Decomposing "What is OAuth2?" into sub-questions wastes LLM calls. Classify query complexity first.
No evaluation between iterations -- Every advanced pattern should measurably improve your eval metrics. If it does not, remove it.
Ignoring latency costs -- Agentic RAG with multiple tool calls can take 5-15 seconds. Make sure your users accept that latency for the quality gain.

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline•343L

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Rag Pipeline•357L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline•501L

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Rag Pipeline•266L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline•498L

rag-with-langchain

Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.

Rag Pipeline•460L

Advanced RAG

When You Need Advanced RAG

Multi-Hop RAG

Example: "What team does the CEO's brother work for?"

Hop 1: Retrieves CEO info -> finds brother's name

Hop 2: Retrieves info about the brother -> finds team

Agentic RAG

LLM with tool binding

Build agent graph

The agent decides which sources to search and when

Graph RAG

Step 1: Extract entities and relationships from chunks

Step 2: Build graph

Step 3: Graph-enhanced retrieval

Microsoft GraphRAG approach (community detection + summarization)

pip install graphrag

Uses community detection on the entity graph, summarizes communities,

and uses those summaries for global queries

Recursive Retrieval

Index documents at two levels

Level 1: Document-level summaries

Level 2: Chunk-level index for each document

Recursive retrieval: first find relevant docs, then search within them

Self-Querying Retriever

Natural language query -> automatic metadata filtering

Internally generates: filter = {"category": "security", "version": {"$gte": 3.0}}

Then performs vector search with that filter

Internally generates: filter = {"category": "tutorial", "updated_date": {"$gt": "2024-06-01"}}

Query Decomposition

Example:

"Compare OAuth2 and API keys for auth, and which is better for mobile apps?"

Decomposed into:

1. "How does OAuth2 authentication work?"

2. "How does API key authentication work?"

3. "What are the pros and cons of each for mobile applications?"

Citation Extraction

Corrective RAG (CRAG)

Anti-Patterns

Details

Pack: rag-pipeline-skills
File: advanced-rag.md
Lines: 464
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add rag-pipeline-skills

Installs the full Rag Pipeline pack to your project.