Skip to main content
Technology & EngineeringAgent Memory199 lines

Vector-Backed Agent Memory with RAG

Implement an agent memory system using a vector database with retrieval-

Quick Summary25 lines
Vector retrieval is the workhorse of modern agent memory. Past conversations, documents, knowledge bases — all become searchable through embeddings, retrieved on demand, fed back into the model's context.

## Key Points

- Sentences and paragraphs split mid-thought. The chunk lacks coherence.
- Important context (a heading, a table label) ends up in a different chunk than the content it labels.
- Retrieval returns fragments that don't make sense in isolation.
- **Small chunks (200-500 tokens).** Precise; each retrieval returns just the relevant content. Risk: lose surrounding context.
- **Medium chunks (500-1500 tokens).** Balance between precision and context.
- **Large chunks (1500-3000 tokens).** More context per chunk; retrieval is broader. Risk: irrelevant content alongside relevant.
- **Dimension.** 768, 1024, 1536, 3072. Higher dimensions are more expressive but cost more (storage, compute).
- **Quality.** Newer models retrieve better than older ones for the same query distribution.
- **Domain match.** Some embeddings are tuned for specific domains (code, legal, medical). Use domain-tuned models for specific corpora.
- **Cost.** Per-token embedding cost varies; bulk-embedding millions of chunks adds up.
1. Vector search returns top-k by semantic similarity.
2. BM25 returns top-k by keyword match.

## Quick Example

```
Query → Vector + BM25 retrieval → top-50 candidates
→ Cross-encoder reranker → top-10 → context
```
skilldb get agent-memory-skills/Vector-Backed Agent Memory with RAGFull skill: 199 lines
Paste into your CLAUDE.md or agent config

Vector retrieval is the workhorse of modern agent memory. Past conversations, documents, knowledge bases — all become searchable through embeddings, retrieved on demand, fed back into the model's context.

The naive implementation is straightforward: chunk text, embed, store in vector DB, retrieve top-k. The naive implementation produces noisy retrieval that wastes tokens and confuses the model. Production retrieval requires more thought.

Chunking

Documents need to be chunked. The chunk is the unit of retrieval; what gets retrieved is what gets used. Get this wrong and retrieval is noisy.

Naive Chunking: Fixed-Size Splits

Split text into chunks of N tokens each. Simple.

Problems:

  • Sentences and paragraphs split mid-thought. The chunk lacks coherence.
  • Important context (a heading, a table label) ends up in a different chunk than the content it labels.
  • Retrieval returns fragments that don't make sense in isolation.

Use only when content is uniform and small chunks are inherently meaningful.

Recursive Chunking

Split on natural boundaries — sections, paragraphs, sentences — recursively. Keep splitting until each chunk is under the size limit.

Better. Each chunk is a logical unit. Headings stay with their sections; paragraphs aren't split.

Implementation: LangChain's RecursiveCharacterTextSplitter with separators like ["\n\n", "\n", ". ", " "]. Splits on the strongest boundary that keeps chunks within the limit.

Semantic Chunking

Use embeddings to identify semantic boundaries. Chunks correspond to topical shifts in the text.

Better still for unstructured documents. The chunks are coherent topically; retrieval is about meaning, not just keywords.

Higher implementation cost. Each document needs an embedding pass to identify boundaries before chunking.

Structure-Aware Chunking

For structured documents (PDFs with sections, code with functions, markdown with headings), use the structure to chunk.

A code chunk is a function. A documentation chunk is a section. A meeting transcript chunk is a topic block.

This is usually the best for well-structured input. Implementation depends on the document type.

Chunk Size

The size of chunks affects retrieval quality:

  • Small chunks (200-500 tokens). Precise; each retrieval returns just the relevant content. Risk: lose surrounding context.
  • Medium chunks (500-1500 tokens). Balance between precision and context.
  • Large chunks (1500-3000 tokens). More context per chunk; retrieval is broader. Risk: irrelevant content alongside relevant.

Use medium chunks as default. Adjust based on whether retrieval is too noisy (chunks are too big) or too fragmented (too small).

Chunk Overlap

Adjacent chunks can overlap by a few tokens. Helps when relevant content sits at chunk boundaries.

10-20% overlap is typical. Too much overlap costs storage and produces near-duplicates in retrieval.

Embeddings

The embedding model converts chunks to vectors for similarity search. The model choice matters:

  • Dimension. 768, 1024, 1536, 3072. Higher dimensions are more expressive but cost more (storage, compute).
  • Quality. Newer models retrieve better than older ones for the same query distribution.
  • Domain match. Some embeddings are tuned for specific domains (code, legal, medical). Use domain-tuned models for specific corpora.
  • Cost. Per-token embedding cost varies; bulk-embedding millions of chunks adds up.

OpenAI's text-embedding-3-large, Cohere's embed-v4, BGE-large for open-source options. Test on your specific corpus; benchmark retrieval quality.

Hybrid Retrieval

Pure vector search has weaknesses. It can miss exact-match keywords (vector search treats "PostgreSQL" and "postgres" as related but not identical). It can over-retrieve thematically related content.

Hybrid retrieval combines vector search with keyword (BM25) search:

  1. Vector search returns top-k by semantic similarity.
  2. BM25 returns top-k by keyword match.
  3. Merge and rerank.

Implementations: Pinecone hybrid, Weaviate hybrid, Vespa, OpenSearch with neural plugin. Or roll your own: vector store + Postgres full-text search + reranker.

Hybrid significantly improves retrieval quality, especially for queries with specific entities, code, or technical jargon.

Reranking

Retrieval returns more candidates than you'll use. A reranker scores each candidate against the query more precisely than the initial retrieval.

Query → Vector + BM25 retrieval → top-50 candidates
→ Cross-encoder reranker → top-10 → context

Cross-encoder rerankers (Cohere Rerank, BGE Reranker) score query-document pairs jointly. More expensive than embedding similarity but much more accurate.

Reranking transforms mediocre retrieval into useful retrieval. Worth the cost for production systems.

Metadata Filtering

Vector search alone returns content from anywhere. Often you need filtering:

  • Documents from this user only (multi-tenant).
  • Content from the last 30 days.
  • Documents tagged as "policy" not "draft."

Most vector DBs support metadata filters alongside vector search. Apply at query time:

results = vector_store.similarity_search(
    query="onboarding policy",
    k=10,
    filter={
        "tenant_id": "acme",
        "tags": ["policy"],
        "updated_at": {"$gte": "2026-04-01"},
    },
)

Filtering before search is much more efficient than search-then-filter. Use the DB's native capability.

Retrieval Strategy

How retrieval fits into the agent loop matters.

Always-Retrieve

Every turn, retrieve top-k from memory. The retrieved context is part of every prompt.

Simple but wasteful. Many turns don't benefit from memory; you're paying retrieval cost and consuming context for nothing.

Conditional Retrieve

Decide whether to retrieve based on the turn. If the user's query is conversational ("hi", "thanks"), skip retrieval. If it's substantive, retrieve.

Implementation: a small model or heuristic decides. Often an LLM call that says "should we retrieve?"

Agent-Driven Retrieve

The agent decides when to retrieve, calling a search_memory tool. The agent might call it 0, 1, or many times depending on the task.

Most flexible; most aligned with how human assistants work. Requires the agent to understand its own memory access.

For most production systems, conditional or agent-driven retrieval outperforms always-retrieve.

Retrieval Quality Metrics

How do you know retrieval is working?

  • Recall@k. For a known-correct case, does the right chunk appear in the top k?
  • MRR (mean reciprocal rank). Average of 1 / rank of the correct chunk across cases.
  • NDCG. Discounted cumulative gain — accounts for cases where multiple chunks are relevant.
  • End-to-end task success. Does the agent answer correctly given retrieval?

Build a labeled eval set: queries paired with the chunks that should be retrieved. Run retrieval; score against the labels. Iterate.

Cost and Scale

Vector storage and retrieval are not free:

  • Storage. Per million vectors.
  • Embedding compute. Per token embedded.
  • Retrieval compute. Per query.
  • Reranking. Per pair scored.

For scale planning:

  • 1M chunks × 1024 dim × 4 bytes = 4 GB. Plus index overhead.
  • 10K queries/day × 50 candidates × $0.001/rerank = $500/day for reranking alone.

Budget. The unit economics matter.

Anti-Patterns

Naive fixed-size chunking. Sentences split mid-thought; context lost. Use recursive or structure-aware chunking.

Vector-only retrieval. Misses exact keyword matches. Use hybrid.

No reranker. Top-50 from retrieval is good but top-10 from reranker is much better. Add the rerank step.

No metadata filtering. Every query searches the whole corpus. Apply filters at query time.

Always-retrieve regardless of query. Wastes tokens on conversational turns. Conditional retrieve or agent-driven.

No retrieval evaluation. Improvements ship based on developer intuition. Build labeled retrieval evals.

Embedding model from 2022. Newer models are dramatically better. Reembed when ROI justifies.

Install this skill directly: skilldb add agent-memory-skills

Get CLI access →