Skip to main content
Technology & EngineeringRag Pipeline343 lines

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Quick Summary18 lines
Split documents into retrieval-friendly chunks that preserve meaning and optimize search quality.

## Key Points

- Irrelevant retrieval results (chunks too large or too small)
- Lost context (splitting mid-thought)
- Embedding quality degradation (noisy, incoherent text)
- Wasted tokens in the generation context window
1. Split text into sentences
2. Embed each sentence
3. Compare cosine similarity between consecutive sentences
4. Split where similarity drops significantly
- **Rule of thumb**: overlap = 10-15% of chunk size
- **Why overlap**: Prevents losing context at chunk boundaries
- **Too much overlap**: Redundant chunks waste storage and retrieval slots
- **Zero overlap**: Acceptable for self-contained units (code functions, FAQ pairs)
skilldb get rag-pipeline-skills/chunking-strategiesFull skill: 343 lines
Paste into your CLAUDE.md or agent config

Chunking Strategies

Split documents into retrieval-friendly chunks that preserve meaning and optimize search quality.


Why Chunking Matters

Chunking is the single highest-leverage decision in a RAG pipeline. Poor chunking leads to:

  • Irrelevant retrieval results (chunks too large or too small)
  • Lost context (splitting mid-thought)
  • Embedding quality degradation (noisy, incoherent text)
  • Wasted tokens in the generation context window

The goal: Each chunk should be a self-contained unit of meaning that is independently useful when retrieved.


Strategy 1: Fixed-Size Chunking

Split text into chunks of N characters/tokens with optional overlap.

from langchain.text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator="\n"
)
chunks = splitter.split_text(document_text)

Pros: Simple, predictable chunk sizes, fast. Cons: Splits mid-sentence, ignores document structure. Use when: Quick prototyping, uniform plain text, latency-critical indexing.


Strategy 2: Recursive Character Splitting

Tries a hierarchy of separators, falling back to smaller units only when needed.

from langchain.text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=[
        "\n\n",   # Paragraph breaks first
        "\n",     # Then line breaks
        ". ",     # Then sentences
        ", ",     # Then clauses
        " ",      # Then words
        ""        # Then characters
    ]
)
chunks = splitter.split_text(document_text)

Pros: Respects natural text boundaries, good default for most text. Cons: Still character-count based, not truly semantic. Use when: General-purpose RAG over prose documents. This is the recommended default.


Strategy 3: Sentence-Based Chunking

Group sentences together up to a size limit.

from langchain.text_splitters import SentenceTransformersTokenTextSplitter

# Using spaCy for sentence detection
import spacy
nlp = spacy.load("en_core_web_sm")

def sentence_chunk(text, max_sentences=5, overlap_sentences=1):
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    chunks = []
    for i in range(0, len(sentences), max_sentences - overlap_sentences):
        chunk = " ".join(sentences[i:i + max_sentences])
        chunks.append(chunk)
    return chunks

# Alternative: NLTK
import nltk
sentences = nltk.sent_tokenize(text)

Pros: Never splits mid-sentence, linguistically aware. Cons: Sentence detection can fail on technical text, variable chunk sizes. Use when: Well-formed prose (articles, documentation, legal text).


Strategy 4: Semantic Chunking

Split based on embedding similarity -- group sentences that are semantically related.

from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Breakpoint methods: percentile, standard_deviation, interquartile
splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,  # Split when similarity drops below 95th percentile
)
chunks = splitter.split_text(document_text)

How it works:

  1. Split text into sentences
  2. Embed each sentence
  3. Compare cosine similarity between consecutive sentences
  4. Split where similarity drops significantly

Pros: Chunks are semantically coherent, adapts to content. Cons: Expensive (requires embedding every sentence), variable chunk sizes, slower indexing. Use when: High-value corpora where retrieval quality is paramount and indexing cost is acceptable.


Strategy 5: Markdown-Aware Chunking

Respect markdown structure: headers, code blocks, lists.

from langchain.text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
chunks = splitter.split_text(markdown_text)

# Each chunk gets header metadata automatically
for chunk in chunks:
    print(chunk.metadata)  # {'h1': 'Main Title', 'h2': 'Section Name'}
    print(chunk.page_content[:100])

# Combine with size-based splitting for long sections
from langchain.text_splitters import RecursiveCharacterTextSplitter

child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64
)
final_chunks = child_splitter.split_documents(chunks)

Pros: Preserves document hierarchy, automatic section metadata, keeps code blocks intact. Cons: Only works with markdown, headers may not reflect semantic boundaries. Use when: Technical documentation, wikis, README files, any markdown corpus.


Strategy 6: Code-Aware Chunking

Split source code by logical units (functions, classes, modules).

from langchain.text_splitters import Language, RecursiveCharacterTextSplitter

# Language-specific separators
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100,
)
chunks = python_splitter.split_text(python_code)

# Supported languages: PYTHON, JS, TS, GO, RUST, JAVA, CPP, etc.

# Custom approach: split by AST nodes
import ast

def chunk_python_by_functions(source_code, file_path=""):
    tree = ast.parse(source_code)
    chunks = []
    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            start = node.lineno - 1
            end = node.end_lineno
            lines = source_code.split("\n")[start:end]
            chunk_text = "\n".join(lines)
            chunks.append({
                "content": chunk_text,
                "metadata": {
                    "type": type(node).__name__,
                    "name": node.name,
                    "file": file_path,
                    "line_start": start + 1,
                    "line_end": end,
                }
            })
    return chunks

Pros: Functions/classes stay intact, meaningful code units, rich metadata. Cons: Single large functions may exceed chunk size, language-specific logic needed. Use when: Code search, documentation generation, code Q&A systems.


Strategy 7: Parent-Child (Hierarchical) Chunking

Index small chunks for precise retrieval, but return larger parent chunks for context.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitters import RecursiveCharacterTextSplitter

# Small chunks for precise embedding matching
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# Larger chunks returned as context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

store = InMemoryStore()  # Use Redis/SQL in production

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents -- indexes child chunks, stores parent mapping
retriever.add_documents(documents)

# Retrieval: matches on child chunks, returns parent chunks
results = retriever.invoke("authentication flow")
# Returns ~1000 char chunks even though matching was on ~200 char chunks

Pros: Best of both worlds -- precise matching with rich context. Cons: More complex infrastructure, two storage layers needed. Use when: Production RAG where both retrieval precision and generation context quality matter.


Chunk Size Optimization

General Guidelines

Document TypeRecommended SizeOverlap
Short-form prose (FAQ, support)200-400 chars20-40
Long-form prose (articles, books)500-800 chars50-100
Technical documentation400-600 chars50-80
Source code800-1500 chars100-200
Legal / regulatory300-500 chars50-80
Tabular dataRow-level or section0

How to Benchmark Chunk Size

def evaluate_chunk_sizes(documents, eval_questions, sizes=[256, 512, 768, 1024]):
    results = {}
    for size in sizes:
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=size, chunk_overlap=size // 8
        )
        chunks = splitter.split_documents(documents)

        # Build index
        db = Chroma.from_documents(chunks, embeddings)
        retriever = db.as_retriever(search_kwargs={"k": 5})

        # Evaluate retrieval quality
        hits = 0
        for q, expected_source in eval_questions:
            results_docs = retriever.invoke(q)
            sources = [d.metadata["source"] for d in results_docs]
            if expected_source in sources:
                hits += 1

        results[size] = hits / len(eval_questions)
        print(f"Chunk size {size}: recall@5 = {results[size]:.2%}")

    return results

Overlap Strategy

  • Rule of thumb: overlap = 10-15% of chunk size
  • Why overlap: Prevents losing context at chunk boundaries
  • Too much overlap: Redundant chunks waste storage and retrieval slots
  • Zero overlap: Acceptable for self-contained units (code functions, FAQ pairs)

Anti-Patterns

  1. Using one strategy for mixed content -- A corpus with markdown, code, and prose needs different splitters per content type. Route documents to the appropriate splitter.

  2. Chunks too small (< 100 chars) -- Embeddings of very short text are noisy and unstable. Small chunks also lack context for the LLM.

  3. Chunks too large (> 2000 chars) -- Large chunks dilute the embedding signal. Retrieval returns broadly relevant but imprecise results.

  4. Ignoring metadata -- Every chunk should carry source file, section header, page number, and content type. This enables filtering at retrieval time.

  5. Not preserving tables -- Splitting a table row by row destroys meaning. Detect tables and keep them as single chunks or use structured extraction.

  6. Chunking before cleaning -- Remove boilerplate (headers, footers, navigation) before chunking. Otherwise noise gets embedded.


Decision Flowchart

Is the content markdown/HTML?
  YES --> Markdown/HTML-aware splitter + size-based sub-splitting
  NO -->
    Is the content source code?
      YES --> Language-aware splitter (by function/class)
      NO -->
        Is retrieval precision critical?
          YES --> Semantic chunking or parent-child
          NO --> Recursive character splitting (default)

Always validate with an evaluation set. The "best" strategy is the one that maximizes retrieval recall on your actual queries.

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline464L

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Rag Pipeline357L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline501L

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Rag Pipeline266L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline498L

rag-with-langchain

Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.

Rag Pipeline460L