vector-databases
Practical guide to vector databases for RAG systems. Covers Pinecone, Qdrant, Weaviate, ChromaDB, pgvector, and Milvus with setup, indexing, querying, metadata filtering, hybrid search, and scaling considerations. Includes selection criteria, performance benchmarks, and production deployment patterns.
Store, index, and query embeddings at scale for retrieval-augmented generation. ## Key Points - **Best for**: Most use cases, good recall/speed balance - **Memory**: High (keeps graph in RAM) - **Parameters**: `M` (connections per node, 16-64), `ef_construction` (build quality, 100-400) - **Best for**: Large datasets where memory is constrained - **Memory**: Lower with PQ (product quantization) - **Parameters**: `nlist` (clusters), `nprobe` (clusters to search) - **Best for**: < 50K vectors, exact results needed - **Memory**: Low - **Parameters**: None 1. **No metadata filtering** -- Searching all vectors when you know the document category. Always attach filterable metadata during indexing. 2. **Wrong distance metric** -- Using L2 when embeddings are normalized (use cosine or inner product). Check your embedding model's recommendation. 3. **Skipping index tuning** -- Default HNSW parameters work but are rarely optimal. Benchmark `M` and `ef` on your dataset. ## Quick Example ``` Vectors < 50K --> Flat index (exact, fast enough) 50K - 1M vectors --> HNSW (best recall/speed) 1M - 100M vectors --> IVF-HNSW or IVF-PQ (memory-efficient) > 100M vectors --> Distributed Milvus/Pinecone + IVF-PQ ```
skilldb get rag-pipeline-skills/vector-databasesFull skill: 390 linesVector Databases
Store, index, and query embeddings at scale for retrieval-augmented generation.
Quick Comparison
| Database | Hosting | Hybrid Search | Filtering | Best For |
|---|---|---|---|---|
| ChromaDB | Local / embedded | No (dense only) | Basic | Prototyping, small datasets |
| Pinecone | Managed cloud | Yes | Rich | Production SaaS, serverless |
| Qdrant | Self-hosted / cloud | Yes (sparse+dense) | Rich, nested | Production, complex filters |
| Weaviate | Self-hosted / cloud | Yes (BM25+vector) | GraphQL-based | Multi-modal, generative search |
| pgvector | Postgres extension | With pg full-text | SQL WHERE | Already using Postgres |
| Milvus | Self-hosted / Zilliz cloud | Yes | Attribute filtering | Large scale, high throughput |
ChromaDB (Local Development)
import chromadb
from chromadb.utils import embedding_functions
# In-memory for testing
client = chromadb.Client()
# Persistent for development
client = chromadb.PersistentClient(path="./chroma_db")
# Use OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
model_name="text-embedding-3-small"
)
collection = client.get_or_create_collection(
name="documents",
embedding_function=openai_ef,
metadata={"hnsw:space": "cosine"} # cosine, l2, or ip
)
# Add documents
collection.add(
ids=["doc1", "doc2", "doc3"],
documents=["Auth uses JWT tokens.", "OAuth2 is an authorization framework.", "Rate limiting prevents abuse."],
metadatas=[
{"source": "auth.md", "category": "security"},
{"source": "oauth.md", "category": "security"},
{"source": "api.md", "category": "infrastructure"},
]
)
# Query with metadata filtering
results = collection.query(
query_texts=["How does authentication work?"],
n_results=3,
where={"category": "security"},
include=["documents", "metadatas", "distances"]
)
print(results["documents"][0])
Limits: Single machine, ~1M vectors comfortable, no built-in hybrid search.
Pinecone (Managed Production)
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="YOUR_KEY")
# Create serverless index
pc.create_index(
name="rag-docs",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("rag-docs")
# Upsert vectors with metadata
index.upsert(vectors=[
{
"id": "doc1",
"values": embedding_vector, # List[float]
"metadata": {
"source": "auth.md",
"category": "security",
"token_count": 245,
"text": "Authentication uses JWT tokens..." # Store text in metadata
}
}
])
# Query with metadata filter
results = index.query(
vector=query_embedding,
top_k=5,
filter={
"category": {"$eq": "security"},
"token_count": {"$lte": 500}
},
include_metadata=True
)
for match in results.matches:
print(f"{match.id}: {match.score:.3f} - {match.metadata['text'][:80]}")
# Namespace isolation (multi-tenant)
index.upsert(vectors=[...], namespace="tenant-123")
results = index.query(vector=q, top_k=5, namespace="tenant-123")
Qdrant (Self-Hosted / Cloud)
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct,
Filter, FieldCondition, MatchValue, Range
)
# Local
client = QdrantClient(path="./qdrant_data")
# Cloud
# client = QdrantClient(url="https://xxx.qdrant.io", api_key="KEY")
# Create collection
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
# Upsert with rich payload
client.upsert(
collection_name="documents",
points=[
PointStruct(
id=1,
vector=embedding_vector,
payload={
"text": "Authentication uses JWT tokens...",
"source": "auth.md",
"category": "security",
"tags": ["jwt", "auth", "tokens"],
"created_at": "2024-01-15",
}
)
]
)
# Query with nested filtering
results = client.search(
collection_name="documents",
query_vector=query_embedding,
limit=5,
query_filter=Filter(
must=[
FieldCondition(key="category", match=MatchValue(value="security")),
],
must_not=[
FieldCondition(key="tags", match=MatchValue(value="deprecated")),
]
),
)
# Hybrid search (sparse + dense)
from qdrant_client.models import SparseVectorParams, SparseVector, NamedVector
# Requires collection with named vectors configured for both dense and sparse
Weaviate
import weaviate
from weaviate.classes.init import Auth
client = weaviate.connect_to_weaviate_cloud(
cluster_url="https://xxx.weaviate.network",
auth_credentials=Auth.api_key("KEY"),
)
# Define collection (class)
from weaviate.classes.config import Configure, Property, DataType
collection = client.collections.create(
name="Document",
vectorizer_config=Configure.Vectorizer.text2vec_openai(),
properties=[
Property(name="text", data_type=DataType.TEXT),
Property(name="source", data_type=DataType.TEXT),
Property(name="category", data_type=DataType.TEXT),
]
)
# Add objects (auto-vectorized)
docs = client.collections.get("Document")
docs.data.insert({"text": "Auth uses JWT tokens.", "source": "auth.md", "category": "security"})
# Hybrid search (BM25 + vector)
results = docs.query.hybrid(
query="authentication flow",
alpha=0.7, # 0 = pure BM25, 1 = pure vector
limit=5,
filters=weaviate.classes.query.Filter.by_property("category").equal("security"),
return_metadata=weaviate.classes.query.MetadataQuery(score=True)
)
for obj in results.objects:
print(f"{obj.properties['source']}: {obj.metadata.score:.3f}")
pgvector (PostgreSQL)
-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create table
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
source VARCHAR(255),
category VARCHAR(50),
embedding vector(1536),
created_at TIMESTAMP DEFAULT NOW()
);
-- Create HNSW index (recommended)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Insert
INSERT INTO documents (content, source, category, embedding)
VALUES ('Auth uses JWT...', 'auth.md', 'security', '[0.1, 0.2, ...]'::vector);
-- Query: nearest neighbors with filter
SELECT id, content, source,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE category = 'security'
ORDER BY embedding <=> $1::vector
LIMIT 5;
-- Hybrid: combine with full-text search
SELECT id, content,
ts_rank(to_tsvector('english', content), plainto_tsquery('english', 'authentication')) AS text_rank,
1 - (embedding <=> $1::vector) AS vector_similarity
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'authentication')
ORDER BY vector_similarity DESC
LIMIT 5;
# Python with asyncpg
import asyncpg
import numpy as np
async def search_documents(pool, query_embedding, category=None, limit=5):
embedding_str = "[" + ",".join(str(x) for x in query_embedding) + "]"
query = """
SELECT id, content, source,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE ($2::text IS NULL OR category = $2)
ORDER BY embedding <=> $1::vector
LIMIT $3
"""
return await pool.fetch(query, embedding_str, category, limit)
Milvus
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility
connections.connect("default", host="localhost", port="19530")
# Define schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=255),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
]
schema = CollectionSchema(fields=fields)
collection = Collection("documents", schema)
# Insert
collection.insert([
["Auth uses JWT tokens.", "OAuth2 framework..."], # text
["auth.md", "oauth.md"], # source
[embedding1, embedding2], # embedding
])
# Build HNSW index
index_params = {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)
collection.load()
# Search
results = collection.search(
data=[query_embedding],
anns_field="embedding",
param={"metric_type": "COSINE", "params": {"ef": 128}},
limit=5,
expr='source == "auth.md"',
output_fields=["text", "source"]
)
Indexing Strategies
HNSW (Hierarchical Navigable Small World)
- Best for: Most use cases, good recall/speed balance
- Memory: High (keeps graph in RAM)
- Parameters:
M(connections per node, 16-64),ef_construction(build quality, 100-400)
IVF-Flat / IVF-PQ
- Best for: Large datasets where memory is constrained
- Memory: Lower with PQ (product quantization)
- Parameters:
nlist(clusters),nprobe(clusters to search)
Flat (Brute Force)
- Best for: < 50K vectors, exact results needed
- Memory: Low
- Parameters: None
Vectors < 50K --> Flat index (exact, fast enough)
50K - 1M vectors --> HNSW (best recall/speed)
1M - 100M vectors --> IVF-HNSW or IVF-PQ (memory-efficient)
> 100M vectors --> Distributed Milvus/Pinecone + IVF-PQ
Anti-Patterns
-
No metadata filtering -- Searching all vectors when you know the document category. Always attach filterable metadata during indexing.
-
Wrong distance metric -- Using L2 when embeddings are normalized (use cosine or inner product). Check your embedding model's recommendation.
-
Skipping index tuning -- Default HNSW parameters work but are rarely optimal. Benchmark
Mandefon your dataset. -
Storing text outside the vector DB -- Requiring a separate lookup for chunk text adds latency. Most vector DBs support payload/metadata storage -- use it.
-
Not planning for updates -- If documents change, you need upsert/delete strategies. Design your ID scheme to support incremental updates from day one.
-
Single-node for production -- ChromaDB and local Qdrant are not designed for high-availability production. Use managed services or proper cluster deployments.
Selection Guide
- Just prototyping? ChromaDB (zero config, in-process)
- Already on Postgres? pgvector (no new infrastructure)
- Need managed + serverless? Pinecone (lowest ops burden)
- Need rich filtering + self-hosted? Qdrant
- Need multi-modal or auto-vectorization? Weaviate
- Need massive scale (100M+ vectors)? Milvus / Zilliz
Install this skill directly: skilldb add rag-pipeline-skills
Related Skills
advanced-rag
Advanced RAG patterns beyond basic retrieve-and-generate. Covers multi-hop RAG, agentic RAG with tool use, graph RAG (knowledge graphs + vector retrieval), recursive retrieval, self-querying retrievers, query decomposition, citation extraction, and corrective RAG. Includes implementation patterns and guidance on when each advanced technique is warranted.
chunking-strategies
Comprehensive guide to document chunking strategies for RAG pipelines. Covers fixed-size, semantic, recursive character, sentence-based, parent-child, markdown-aware, and code-aware chunking. Includes chunk size optimization, overlap strategies, and practical benchmarks for choosing the right approach based on document type and retrieval quality.
embedding-models
Guide to selecting, using, and optimizing text embedding models for RAG pipelines. Covers commercial models (OpenAI text-embedding-3, Cohere embed-v3, Voyage AI) and open-source options (BGE, E5, Nomic Embed). Includes dimensionality selection, batch processing, embedding caching, fine-tuning for domain-specific retrieval, and cost analysis.
rag-evaluation
Evaluating RAG systems end-to-end. Covers retrieval metrics (context precision, context recall, MRR), generation metrics (faithfulness, answer relevance, hallucination detection), the RAGAS framework, human evaluation protocols, A/B testing retrieval strategies, building evaluation datasets, and continuous monitoring in production.
rag-fundamentals
Teaches the foundational architecture of Retrieval-Augmented Generation (RAG) systems. Covers why RAG outperforms fine-tuning for most knowledge-grounding use cases, the three core stages (indexing, retrieval, generation), component design, latency budgets, and evaluation metrics including faithfulness, relevance, and hallucination rate. Use when building or explaining any RAG system from scratch.
rag-production
Production-grade RAG deployment patterns. Covers caching strategies (semantic and exact), streaming responses, token budget management, fallback strategies for retrieval failures, monitoring retrieval quality, cost optimization, incremental indexing, multi-tenancy, and operational best practices for running RAG systems at scale.