Technology & EngineeringRag Pipeline390 lines

vector-databases

Practical guide to vector databases for RAG systems. Covers Pinecone, Qdrant, Weaviate, ChromaDB, pgvector, and Milvus with setup, indexing, querying, metadata filtering, hybrid search, and scaling considerations. Includes selection criteria, performance benchmarks, and production deployment patterns.

Quick Summary27 lines

Store, index, and query embeddings at scale for retrieval-augmented generation.

## Key Points

- **Best for**: Most use cases, good recall/speed balance
- **Memory**: High (keeps graph in RAM)
- **Parameters**: `M` (connections per node, 16-64), `ef_construction` (build quality, 100-400)
- **Best for**: Large datasets where memory is constrained
- **Memory**: Lower with PQ (product quantization)
- **Parameters**: `nlist` (clusters), `nprobe` (clusters to search)
- **Best for**: < 50K vectors, exact results needed
- **Memory**: Low
- **Parameters**: None
1. **No metadata filtering** -- Searching all vectors when you know the document category. Always attach filterable metadata during indexing.
2. **Wrong distance metric** -- Using L2 when embeddings are normalized (use cosine or inner product). Check your embedding model's recommendation.
3. **Skipping index tuning** -- Default HNSW parameters work but are rarely optimal. Benchmark `M` and `ef` on your dataset.

## Quick Example

```
Vectors < 50K     --> Flat index (exact, fast enough)
50K - 1M vectors  --> HNSW (best recall/speed)
1M - 100M vectors --> IVF-HNSW or IVF-PQ (memory-efficient)
> 100M vectors    --> Distributed Milvus/Pinecone + IVF-PQ
```

skilldb get rag-pipeline-skills/vector-databasesFull skill: 390 lines

Paste into your CLAUDE.md or agent config

Vector Databases

Store, index, and query embeddings at scale for retrieval-augmented generation.

Quick Comparison

Database	Hosting	Hybrid Search	Filtering	Best For
ChromaDB	Local / embedded	No (dense only)	Basic	Prototyping, small datasets
Pinecone	Managed cloud	Yes	Rich	Production SaaS, serverless
Qdrant	Self-hosted / cloud	Yes (sparse+dense)	Rich, nested	Production, complex filters
Weaviate	Self-hosted / cloud	Yes (BM25+vector)	GraphQL-based	Multi-modal, generative search
pgvector	Postgres extension	With pg full-text	SQL WHERE	Already using Postgres
Milvus	Self-hosted / Zilliz cloud	Yes	Attribute filtering	Large scale, high throughput

ChromaDB (Local Development)

import chromadb
from chromadb.utils import embedding_functions

# In-memory for testing
client = chromadb.Client()

# Persistent for development
client = chromadb.PersistentClient(path="./chroma_db")

# Use OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    model_name="text-embedding-3-small"
)

collection = client.get_or_create_collection(
    name="documents",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}  # cosine, l2, or ip
)

# Add documents
collection.add(
    ids=["doc1", "doc2", "doc3"],
    documents=["Auth uses JWT tokens.", "OAuth2 is an authorization framework.", "Rate limiting prevents abuse."],
    metadatas=[
        {"source": "auth.md", "category": "security"},
        {"source": "oauth.md", "category": "security"},
        {"source": "api.md", "category": "infrastructure"},
    ]
)

# Query with metadata filtering
results = collection.query(
    query_texts=["How does authentication work?"],
    n_results=3,
    where={"category": "security"},
    include=["documents", "metadatas", "distances"]
)
print(results["documents"][0])

Limits: Single machine, ~1M vectors comfortable, no built-in hybrid search.

Pinecone (Managed Production)

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_KEY")

# Create serverless index
pc.create_index(
    name="rag-docs",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("rag-docs")

# Upsert vectors with metadata
index.upsert(vectors=[
    {
        "id": "doc1",
        "values": embedding_vector,  # List[float]
        "metadata": {
            "source": "auth.md",
            "category": "security",
            "token_count": 245,
            "text": "Authentication uses JWT tokens..."  # Store text in metadata
        }
    }
])

# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={
        "category": {"$eq": "security"},
        "token_count": {"$lte": 500}
    },
    include_metadata=True
)

for match in results.matches:
    print(f"{match.id}: {match.score:.3f} - {match.metadata['text'][:80]}")

# Namespace isolation (multi-tenant)
index.upsert(vectors=[...], namespace="tenant-123")
results = index.query(vector=q, top_k=5, namespace="tenant-123")

Qdrant (Self-Hosted / Cloud)

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue, Range
)

# Local
client = QdrantClient(path="./qdrant_data")
# Cloud
# client = QdrantClient(url="https://xxx.qdrant.io", api_key="KEY")

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Upsert with rich payload
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=1,
            vector=embedding_vector,
            payload={
                "text": "Authentication uses JWT tokens...",
                "source": "auth.md",
                "category": "security",
                "tags": ["jwt", "auth", "tokens"],
                "created_at": "2024-01-15",
            }
        )
    ]
)

# Query with nested filtering
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="security")),
        ],
        must_not=[
            FieldCondition(key="tags", match=MatchValue(value="deprecated")),
        ]
    ),
)

# Hybrid search (sparse + dense)
from qdrant_client.models import SparseVectorParams, SparseVector, NamedVector

# Requires collection with named vectors configured for both dense and sparse

Weaviate

import weaviate
from weaviate.classes.init import Auth

client = weaviate.connect_to_weaviate_cloud(
    cluster_url="https://xxx.weaviate.network",
    auth_credentials=Auth.api_key("KEY"),
)

# Define collection (class)
from weaviate.classes.config import Configure, Property, DataType

collection = client.collections.create(
    name="Document",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    properties=[
        Property(name="text", data_type=DataType.TEXT),
        Property(name="source", data_type=DataType.TEXT),
        Property(name="category", data_type=DataType.TEXT),
    ]
)

# Add objects (auto-vectorized)
docs = client.collections.get("Document")
docs.data.insert({"text": "Auth uses JWT tokens.", "source": "auth.md", "category": "security"})

# Hybrid search (BM25 + vector)
results = docs.query.hybrid(
    query="authentication flow",
    alpha=0.7,  # 0 = pure BM25, 1 = pure vector
    limit=5,
    filters=weaviate.classes.query.Filter.by_property("category").equal("security"),
    return_metadata=weaviate.classes.query.MetadataQuery(score=True)
)

for obj in results.objects:
    print(f"{obj.properties['source']}: {obj.metadata.score:.3f}")

pgvector (PostgreSQL)

-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    source VARCHAR(255),
    category VARCHAR(50),
    embedding vector(1536),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create HNSW index (recommended)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

-- Insert
INSERT INTO documents (content, source, category, embedding)
VALUES ('Auth uses JWT...', 'auth.md', 'security', '[0.1, 0.2, ...]'::vector);

-- Query: nearest neighbors with filter
SELECT id, content, source,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE category = 'security'
ORDER BY embedding <=> $1::vector
LIMIT 5;

-- Hybrid: combine with full-text search
SELECT id, content,
       ts_rank(to_tsvector('english', content), plainto_tsquery('english', 'authentication')) AS text_rank,
       1 - (embedding <=> $1::vector) AS vector_similarity
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'authentication')
ORDER BY vector_similarity DESC
LIMIT 5;

# Python with asyncpg
import asyncpg
import numpy as np

async def search_documents(pool, query_embedding, category=None, limit=5):
    embedding_str = "[" + ",".join(str(x) for x in query_embedding) + "]"
    query = """
        SELECT id, content, source,
               1 - (embedding <=> $1::vector) AS similarity
        FROM documents
        WHERE ($2::text IS NULL OR category = $2)
        ORDER BY embedding <=> $1::vector
        LIMIT $3
    """
    return await pool.fetch(query, embedding_str, category, limit)

Milvus

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility

connections.connect("default", host="localhost", port="19530")

# Define schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=255),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
]
schema = CollectionSchema(fields=fields)

collection = Collection("documents", schema)

# Insert
collection.insert([
    ["Auth uses JWT tokens.", "OAuth2 framework..."],        # text
    ["auth.md", "oauth.md"],                                  # source
    [embedding1, embedding2],                                 # embedding
])

# Build HNSW index
index_params = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)
collection.load()

# Search
results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"ef": 128}},
    limit=5,
    expr='source == "auth.md"',
    output_fields=["text", "source"]
)

Indexing Strategies

HNSW (Hierarchical Navigable Small World)

Best for: Most use cases, good recall/speed balance
Memory: High (keeps graph in RAM)
Parameters: M (connections per node, 16-64), ef_construction (build quality, 100-400)

IVF-Flat / IVF-PQ

Best for: Large datasets where memory is constrained
Memory: Lower with PQ (product quantization)
Parameters: nlist (clusters), nprobe (clusters to search)

Flat (Brute Force)

Best for: < 50K vectors, exact results needed
Memory: Low
Parameters: None

Vectors < 50K     --> Flat index (exact, fast enough)
50K - 1M vectors  --> HNSW (best recall/speed)
1M - 100M vectors --> IVF-HNSW or IVF-PQ (memory-efficient)
> 100M vectors    --> Distributed Milvus/Pinecone + IVF-PQ

Anti-Patterns

No metadata filtering -- Searching all vectors when you know the document category. Always attach filterable metadata during indexing.
Wrong distance metric -- Using L2 when embeddings are normalized (use cosine or inner product). Check your embedding model's recommendation.
Skipping index tuning -- Default HNSW parameters work but are rarely optimal. Benchmark M and ef on your dataset.
Storing text outside the vector DB -- Requiring a separate lookup for chunk text adds latency. Most vector DBs support payload/metadata storage -- use it.
Not planning for updates -- If documents change, you need upsert/delete strategies. Design your ID scheme to support incremental updates from day one.
Single-node for production -- ChromaDB and local Qdrant are not designed for high-availability production. Use managed services or proper cluster deployments.

Selection Guide

Just prototyping? ChromaDB (zero config, in-process)
Already on Postgres? pgvector (no new infrastructure)
Need managed + serverless? Pinecone (lowest ops burden)
Need rich filtering + self-hosted? Qdrant
Need multi-modal or auto-vectorization? Weaviate
Need massive scale (100M+ vectors)? Milvus / Zilliz

Install this skill directly: skilldb add rag-pipeline-skills

Get CLI access →

Related Skills

advanced-rag

Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.

Rag Pipeline•464L

chunking-strategies

Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.

Rag Pipeline•343L

embedding-models

Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.

Rag Pipeline•357L

rag-evaluation

Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.

Rag Pipeline•501L

rag-fundamentals

Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.

Rag Pipeline•266L

rag-production

Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.

Rag Pipeline•498L

Vector Databases

Quick Comparison

ChromaDB (Local Development)

In-memory for testing

Persistent for development

Use OpenAI embeddings

Add documents

Query with metadata filtering

Pinecone (Managed Production)

Create serverless index

Upsert vectors with metadata

Query with metadata filter

Namespace isolation (multi-tenant)

Qdrant (Self-Hosted / Cloud)

Local

Cloud

client = QdrantClient(url="https://xxx.qdrant.io", api_key="KEY")

Create collection

Upsert with rich payload

Query with nested filtering

Hybrid search (sparse + dense)

Requires collection with named vectors configured for both dense and sparse

Weaviate

Define collection (class)

Add objects (auto-vectorized)

Hybrid search (BM25 + vector)

pgvector (PostgreSQL)

Python with asyncpg

Milvus

Define schema

Insert

Build HNSW index

Indexing Strategies

HNSW (Hierarchical Navigable Small World)

IVF-Flat / IVF-PQ

Flat (Brute Force)

Anti-Patterns

Selection Guide

Details

Pack: rag-pipeline-skills
File: vector-databases.md
Lines: 390
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add rag-pipeline-skills

Installs the full Rag Pipeline pack to your project.