chunking-strategies
Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.
Split documents into retrieval-friendly chunks that preserve meaning and optimize search quality. ## Key Points - Irrelevant retrieval results (chunks too large or too small) - Lost context (splitting mid-thought) - Embedding quality degradation (noisy, incoherent text) - Wasted tokens in the generation context window 1. Split text into sentences 2. Embed each sentence 3. Compare cosine similarity between consecutive sentences 4. Split where similarity drops significantly - **Rule of thumb**: overlap = 10-15% of chunk size - **Why overlap**: Prevents losing context at chunk boundaries - **Too much overlap**: Redundant chunks waste storage and retrieval slots - **Zero overlap**: Acceptable for self-contained units (code functions, FAQ pairs)
skilldb get rag-pipeline-skills/chunking-strategiesFull skill: 343 linesChunking Strategies
Split documents into retrieval-friendly chunks that preserve meaning and optimize search quality.
Why Chunking Matters
Chunking is the single highest-leverage decision in a RAG pipeline. Poor chunking leads to:
- Irrelevant retrieval results (chunks too large or too small)
- Lost context (splitting mid-thought)
- Embedding quality degradation (noisy, incoherent text)
- Wasted tokens in the generation context window
The goal: Each chunk should be a self-contained unit of meaning that is independently useful when retrieved.
Strategy 1: Fixed-Size Chunking
Split text into chunks of N characters/tokens with optional overlap.
from langchain.text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separator="\n"
)
chunks = splitter.split_text(document_text)
Pros: Simple, predictable chunk sizes, fast. Cons: Splits mid-sentence, ignores document structure. Use when: Quick prototyping, uniform plain text, latency-critical indexing.
Strategy 2: Recursive Character Splitting
Tries a hierarchy of separators, falling back to smaller units only when needed.
from langchain.text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=[
"\n\n", # Paragraph breaks first
"\n", # Then line breaks
". ", # Then sentences
", ", # Then clauses
" ", # Then words
"" # Then characters
]
)
chunks = splitter.split_text(document_text)
Pros: Respects natural text boundaries, good default for most text. Cons: Still character-count based, not truly semantic. Use when: General-purpose RAG over prose documents. This is the recommended default.
Strategy 3: Sentence-Based Chunking
Group sentences together up to a size limit.
from langchain.text_splitters import SentenceTransformersTokenTextSplitter
# Using spaCy for sentence detection
import spacy
nlp = spacy.load("en_core_web_sm")
def sentence_chunk(text, max_sentences=5, overlap_sentences=1):
doc = nlp(text)
sentences = [sent.text.strip() for sent in doc.sents]
chunks = []
for i in range(0, len(sentences), max_sentences - overlap_sentences):
chunk = " ".join(sentences[i:i + max_sentences])
chunks.append(chunk)
return chunks
# Alternative: NLTK
import nltk
sentences = nltk.sent_tokenize(text)
Pros: Never splits mid-sentence, linguistically aware. Cons: Sentence detection can fail on technical text, variable chunk sizes. Use when: Well-formed prose (articles, documentation, legal text).
Strategy 4: Semantic Chunking
Split based on embedding similarity -- group sentences that are semantically related.
from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Breakpoint methods: percentile, standard_deviation, interquartile
splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95, # Split when similarity drops below 95th percentile
)
chunks = splitter.split_text(document_text)
How it works:
- Split text into sentences
- Embed each sentence
- Compare cosine similarity between consecutive sentences
- Split where similarity drops significantly
Pros: Chunks are semantically coherent, adapts to content. Cons: Expensive (requires embedding every sentence), variable chunk sizes, slower indexing. Use when: High-value corpora where retrieval quality is paramount and indexing cost is acceptable.
Strategy 5: Markdown-Aware Chunking
Respect markdown structure: headers, code blocks, lists.
from langchain.text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = splitter.split_text(markdown_text)
# Each chunk gets header metadata automatically
for chunk in chunks:
print(chunk.metadata) # {'h1': 'Main Title', 'h2': 'Section Name'}
print(chunk.page_content[:100])
# Combine with size-based splitting for long sections
from langchain.text_splitters import RecursiveCharacterTextSplitter
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64
)
final_chunks = child_splitter.split_documents(chunks)
Pros: Preserves document hierarchy, automatic section metadata, keeps code blocks intact. Cons: Only works with markdown, headers may not reflect semantic boundaries. Use when: Technical documentation, wikis, README files, any markdown corpus.
Strategy 6: Code-Aware Chunking
Split source code by logical units (functions, classes, modules).
from langchain.text_splitters import Language, RecursiveCharacterTextSplitter
# Language-specific separators
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1000,
chunk_overlap=100,
)
chunks = python_splitter.split_text(python_code)
# Supported languages: PYTHON, JS, TS, GO, RUST, JAVA, CPP, etc.
# Custom approach: split by AST nodes
import ast
def chunk_python_by_functions(source_code, file_path=""):
tree = ast.parse(source_code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
start = node.lineno - 1
end = node.end_lineno
lines = source_code.split("\n")[start:end]
chunk_text = "\n".join(lines)
chunks.append({
"content": chunk_text,
"metadata": {
"type": type(node).__name__,
"name": node.name,
"file": file_path,
"line_start": start + 1,
"line_end": end,
}
})
return chunks
Pros: Functions/classes stay intact, meaningful code units, rich metadata. Cons: Single large functions may exceed chunk size, language-specific logic needed. Use when: Code search, documentation generation, code Q&A systems.
Strategy 7: Parent-Child (Hierarchical) Chunking
Index small chunks for precise retrieval, but return larger parent chunks for context.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitters import RecursiveCharacterTextSplitter
# Small chunks for precise embedding matching
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
# Larger chunks returned as context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
store = InMemoryStore() # Use Redis/SQL in production
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
# Add documents -- indexes child chunks, stores parent mapping
retriever.add_documents(documents)
# Retrieval: matches on child chunks, returns parent chunks
results = retriever.invoke("authentication flow")
# Returns ~1000 char chunks even though matching was on ~200 char chunks
Pros: Best of both worlds -- precise matching with rich context. Cons: More complex infrastructure, two storage layers needed. Use when: Production RAG where both retrieval precision and generation context quality matter.
Chunk Size Optimization
General Guidelines
| Document Type | Recommended Size | Overlap |
|---|---|---|
| Short-form prose (FAQ, support) | 200-400 chars | 20-40 |
| Long-form prose (articles, books) | 500-800 chars | 50-100 |
| Technical documentation | 400-600 chars | 50-80 |
| Source code | 800-1500 chars | 100-200 |
| Legal / regulatory | 300-500 chars | 50-80 |
| Tabular data | Row-level or section | 0 |
How to Benchmark Chunk Size
def evaluate_chunk_sizes(documents, eval_questions, sizes=[256, 512, 768, 1024]):
results = {}
for size in sizes:
splitter = RecursiveCharacterTextSplitter(
chunk_size=size, chunk_overlap=size // 8
)
chunks = splitter.split_documents(documents)
# Build index
db = Chroma.from_documents(chunks, embeddings)
retriever = db.as_retriever(search_kwargs={"k": 5})
# Evaluate retrieval quality
hits = 0
for q, expected_source in eval_questions:
results_docs = retriever.invoke(q)
sources = [d.metadata["source"] for d in results_docs]
if expected_source in sources:
hits += 1
results[size] = hits / len(eval_questions)
print(f"Chunk size {size}: recall@5 = {results[size]:.2%}")
return results
Overlap Strategy
- Rule of thumb: overlap = 10-15% of chunk size
- Why overlap: Prevents losing context at chunk boundaries
- Too much overlap: Redundant chunks waste storage and retrieval slots
- Zero overlap: Acceptable for self-contained units (code functions, FAQ pairs)
Anti-Patterns
-
Using one strategy for mixed content -- A corpus with markdown, code, and prose needs different splitters per content type. Route documents to the appropriate splitter.
-
Chunks too small (< 100 chars) -- Embeddings of very short text are noisy and unstable. Small chunks also lack context for the LLM.
-
Chunks too large (> 2000 chars) -- Large chunks dilute the embedding signal. Retrieval returns broadly relevant but imprecise results.
-
Ignoring metadata -- Every chunk should carry source file, section header, page number, and content type. This enables filtering at retrieval time.
-
Not preserving tables -- Splitting a table row by row destroys meaning. Detect tables and keep them as single chunks or use structured extraction.
-
Chunking before cleaning -- Remove boilerplate (headers, footers, navigation) before chunking. Otherwise noise gets embedded.
Decision Flowchart
Is the content markdown/HTML?
YES --> Markdown/HTML-aware splitter + size-based sub-splitting
NO -->
Is the content source code?
YES --> Language-aware splitter (by function/class)
NO -->
Is retrieval precision critical?
YES --> Semantic chunking or parent-child
NO --> Recursive character splitting (default)
Always validate with an evaluation set. The "best" strategy is the one that maximizes retrieval recall on your actual queries.
Install this skill directly: skilldb add rag-pipeline-skills
Related Skills
advanced-rag
Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.
embedding-models
Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.
rag-evaluation
Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.
rag-fundamentals
Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.
rag-production
Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.
rag-with-langchain
Building RAG pipelines with LangChain and LangGraph. Covers document loaders, text splitters, vector stores, retrievers, chains, and agents. Includes practical patterns for conversational RAG, multi-source retrieval, streaming, and LangGraph-based agentic RAG workflows.