Skip to main content
Technology & EngineeringRag Pipeline390 lines

vector-databases

Practical guide to vector databases for RAG systems. Covers Pinecone, Qdrant, Weaviate, ChromaDB, pgvector, and Milvus with setup, indexing, querying, metadata filtering, hybrid search, and scaling considerations. Includes selection criteria, performance benchmarks, and production deployment patterns.

Quick Summary27 lines
Store, index, and query embeddings at scale for retrieval-augmented generation.

## Key Points

- **Best for**: Most use cases, good recall/speed balance
- **Memory**: High (keeps graph in RAM)
- **Parameters**: `M` (connections per node, 16-64), `ef_construction` (build quality, 100-400)
- **Best for**: Large datasets where memory is constrained
- **Memory**: Lower with PQ (product quantization)
- **Parameters**: `nlist` (clusters), `nprobe` (clusters to search)
- **Best for**: < 50K vectors, exact results needed
- **Memory**: Low
- **Parameters**: None
1. **No metadata filtering** -- Searching all vectors when you know the document category. Always attach filterable metadata during indexing.
2. **Wrong distance metric** -- Using L2 when embeddings are normalized (use cosine or inner product). Check your embedding model's recommendation.
3. **Skipping index tuning** -- Default HNSW parameters work but are rarely optimal. Benchmark `M` and `ef` on your dataset.

## Quick Example

```
Vectors < 50K     --> Flat index (exact, fast enough)
50K - 1M vectors  --> HNSW (best recall/speed)
1M - 100M vectors --> IVF-HNSW or IVF-PQ (memory-efficient)
> 100M vectors    --> Distributed Milvus/Pinecone + IVF-PQ
```
skilldb get rag-pipeline-skills/vector-databasesFull skill: 390 lines
Paste into your CLAUDE.md or agent config

Vector Databases

Store, index, and query embeddings at scale for retrieval-augmented generation.


Quick Comparison

DatabaseHostingHybrid SearchFilteringBest For
ChromaDBLocal / embeddedNo (dense only)BasicPrototyping, small datasets
PineconeManaged cloudYesRichProduction SaaS, serverless
QdrantSelf-hosted / cloudYes (sparse+dense)Rich, nestedProduction, complex filters
WeaviateSelf-hosted / cloudYes (BM25+vector)GraphQL-basedMulti-modal, generative search
pgvectorPostgres extensionWith pg full-textSQL WHEREAlready using Postgres
MilvusSelf-hosted / Zilliz cloudYesAttribute filteringLarge scale, high throughput

ChromaDB (Local Development)

import chromadb
from chromadb.utils import embedding_functions

# In-memory for testing
client = chromadb.Client()

# Persistent for development
client = chromadb.PersistentClient(path="./chroma_db")

# Use OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    model_name="text-embedding-3-small"
)

collection = client.get_or_create_collection(
    name="documents",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}  # cosine, l2, or ip
)

# Add documents
collection.add(
    ids=["doc1", "doc2", "doc3"],
    documents=["Auth uses JWT tokens.", "OAuth2 is an authorization framework.", "Rate limiting prevents abuse."],
    metadatas=[
        {"source": "auth.md", "category": "security"},
        {"source": "oauth.md", "category": "security"},
        {"source": "api.md", "category": "infrastructure"},
    ]
)

# Query with metadata filtering
results = collection.query(
    query_texts=["How does authentication work?"],
    n_results=3,
    where={"category": "security"},
    include=["documents", "metadatas", "distances"]
)
print(results["documents"][0])

Limits: Single machine, ~1M vectors comfortable, no built-in hybrid search.


Pinecone (Managed Production)

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_KEY")

# Create serverless index
pc.create_index(
    name="rag-docs",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("rag-docs")

# Upsert vectors with metadata
index.upsert(vectors=[
    {
        "id": "doc1",
        "values": embedding_vector,  # List[float]
        "metadata": {
            "source": "auth.md",
            "category": "security",
            "token_count": 245,
            "text": "Authentication uses JWT tokens..."  # Store text in metadata
        }
    }
])

# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={
        "category": {"$eq": "security"},
        "token_count": {"$lte": 500}
    },
    include_metadata=True
)

for match in results.matches:
    print(f"{match.id}: {match.score:.3f} - {match.metadata['text'][:80]}")

# Namespace isolation (multi-tenant)
index.upsert(vectors=[...], namespace="tenant-123")
results = index.query(vector=q, top_k=5, namespace="tenant-123")

Qdrant (Self-Hosted / Cloud)

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue, Range
)

# Local
client = QdrantClient(path="./qdrant_data")
# Cloud
# client = QdrantClient(url="https://xxx.qdrant.io", api_key="KEY")

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Upsert with rich payload
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=1,
            vector=embedding_vector,
            payload={
                "text": "Authentication uses JWT tokens...",
                "source": "auth.md",
                "category": "security",
                "tags": ["jwt", "auth", "tokens"],
                "created_at": "2024-01-15",
            }
        )
    ]
)

# Query with nested filtering
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="security")),
        ],
        must_not=[
            FieldCondition(key="tags", match=MatchValue(value="deprecated")),
        ]
    ),
)

# Hybrid search (sparse + dense)
from qdrant_client.models import SparseVectorParams, SparseVector, NamedVector

# Requires collection with named vectors configured for both dense and sparse

Weaviate

import weaviate
from weaviate.classes.init import Auth

client = weaviate.connect_to_weaviate_cloud(
    cluster_url="https://xxx.weaviate.network",
    auth_credentials=Auth.api_key("KEY"),
)

# Define collection (class)
from weaviate.classes.config import Configure, Property, DataType

collection = client.collections.create(
    name="Document",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    properties=[
        Property(name="text", data_type=DataType.TEXT),
        Property(name="source", data_type=DataType.TEXT),
        Property(name="category", data_type=DataType.TEXT),
    ]
)

# Add objects (auto-vectorized)
docs = client.collections.get("Document")
docs.data.insert({"text": "Auth uses JWT tokens.", "source": "auth.md", "category": "security"})

# Hybrid search (BM25 + vector)
results = docs.query.hybrid(
    query="authentication flow",
    alpha=0.7,  # 0 = pure BM25, 1 = pure vector
    limit=5,
    filters=weaviate.classes.query.Filter.by_property("category").equal("security"),
    return_metadata=weaviate.classes.query.MetadataQuery(score=True)
)

for obj in results.objects:
    print(f"{obj.properties['source']}: {obj.metadata.score:.3f}")

pgvector (PostgreSQL)

-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    source VARCHAR(255),
    category VARCHAR(50),
    embedding vector(1536),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create HNSW index (recommended)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Insert
INSERT INTO documents (content, source, category, embedding)
VALUES ('Auth uses JWT...', 'auth.md', 'security', '[0.1, 0.2, ...]'::vector);

-- Query: nearest neighbors with filter
SELECT id, content, source,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE category = 'security'
ORDER BY embedding <=> $1::vector
LIMIT 5;

-- Hybrid: combine with full-text search
SELECT id, content,
       ts_rank(to_tsvector('english', content), plainto_tsquery('english', 'authentication')) AS text_rank,
       1 - (embedding <=> $1::vector) AS vector_similarity
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'authentication')
ORDER BY vector_similarity DESC
LIMIT 5;
# Python with asyncpg
import asyncpg
import numpy as np

async def search_documents(pool, query_embedding, category=None, limit=5):
    embedding_str = "[" + ",".join(str(x) for x in query_embedding) + "]"
    query = """
        SELECT id, content, source,
               1 - (embedding <=> $1::vector) AS similarity
        FROM documents
        WHERE ($2::text IS NULL OR category = $2)
        ORDER BY embedding <=> $1::vector
        LIMIT $3
    """
    return await pool.fetch(query, embedding_str, category, limit)

Milvus

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility

connections.connect("default", host="localhost", port="19530")

# Define schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=255),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
]
schema = CollectionSchema(fields=fields)

collection = Collection("documents", schema)

# Insert
collection.insert([
    ["Auth uses JWT tokens.", "OAuth2 framework..."],        # text
    ["auth.md", "oauth.md"],                                  # source
    [embedding1, embedding2],                                 # embedding
])

# Build HNSW index
index_params = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)
collection.load()

# Search
results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"ef": 128}},
    limit=5,
    expr='source == "auth.md"',
    output_fields=["text", "source"]
)

Indexing Strategies

HNSW (Hierarchical Navigable Small World)

  • Best for: Most use cases, good recall/speed balance
  • Memory: High (keeps graph in RAM)
  • Parameters: M (connections per node, 16-64), ef_construction (build quality, 100-400)

IVF-Flat / IVF-PQ

  • Best for: Large datasets where memory is constrained
  • Memory: Lower with PQ (product quantization)
  • Parameters: nlist (clusters), nprobe (clusters to search)

Flat (Brute Force)

  • Best for: < 50K vectors, exact results needed
  • Memory: Low
  • Parameters: None
Vectors < 50K     --> Flat index (exact, fast enough)
50K - 1M vectors  --> HNSW (best recall/speed)
1M - 100M vectors --> IVF-HNSW or IVF-PQ (memory-efficient)
> 100M vectors    --> Distributed Milvus/Pinecone + IVF-PQ

Anti-Patterns

  1. No metadata filtering -- Searching all vectors when you know the document category. Always attach filterable metadata during indexing.

  2. Wrong distance metric -- Using L2 when embeddings are normalized (use cosine or inner product). Check your embedding model's recommendation.

  3. Skipping index tuning -- Default HNSW parameters work but are rarely optimal. Benchmark M and ef on your dataset.

  4. Storing text outside the vector DB -- Requiring a separate lookup for chunk text adds latency. Most vector DBs support payload/metadata storage -- use it.

  5. Not planning for updates -- If documents change, you need upsert/delete strategies. Design your ID scheme to support incremental updates from day one.

  6. Single-node for production -- ChromaDB and local Qdrant are not designed for high-availability production. Use managed services or proper cluster deployments.


Selection Guide

  • Just prototyping? ChromaDB (zero config, in-process)
  • Already on Postgres? pgvector (no new infrastructure)
  • Need managed + serverless? Pinecone (lowest ops burden)
  • Need rich filtering + self-hosted? Qdrant
  • Need multi-modal or auto-vectorization? Weaviate
  • Need massive scale (100M+ vectors)? Milvus / Zilliz

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline464L

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline343L

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Rag Pipeline357L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline501L

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Rag Pipeline266L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline498L