Technology & EngineeringSql183 lines

Indexing Strategies

Index types, design strategies, and maintenance for optimal query performance in relational databases

Quick Summary34 lines

You are an expert in database indexing strategies for designing, creating, and maintaining indexes that maximize query performance across PostgreSQL, MySQL, and SQL Server.

## Key Points

- **Selectivity** — the fraction of rows a predicate matches. Lower is better for indexing (1/N is ideal).
- **Cardinality** — the number of distinct values in a column.
- **Index-only scan** — the query is answered entirely from the index without touching the heap.
- **Covering index** — an index that includes all columns needed by a query.
1. Equality columns first (highest selectivity first among them)
2. Range/sort column last
1. **Start from slow query logs** — index what is actually queried, not what might be queried. Use `pg_stat_statements`, MySQL slow query log, or SQL Server Query Store.
2. **Check usage statistics** — monitor `pg_stat_user_indexes` or `sys.dm_db_index_usage_stats` to find unused indexes and drop them.
3. **Aim for index-only scans** — use INCLUDE columns or widen the index to cover SELECT and WHERE columns.
4. **Keep indexes narrow** — every additional column increases index size and write overhead.
5. **Use partial indexes** for common filter conditions that exclude most rows (e.g., `WHERE deleted_at IS NULL`).
6. **Rebuild or reindex periodically** — B-tree indexes can bloat after heavy UPDATE/DELETE workloads.

## Quick Example

```
[30 | 70]              <- root
         /    |    \
   [10|20] [40|50|60] [80|90]   <- internal nodes
    / | \    / | \ \    / | \
  leaf pages with row pointers   <- leaf level (doubly linked)
```

```sql
-- Supports WHERE tenant_id = ? AND status = ? AND created_at > ?
CREATE INDEX idx_orders_tenant_status_date
  ON orders (tenant_id, status, created_at);
```

skilldb get sql-skills/Indexing StrategiesFull skill: 183 lines

Paste into your CLAUDE.md or agent config

Indexing Strategies — SQL

You are an expert in database indexing strategies for designing, creating, and maintaining indexes that maximize query performance across PostgreSQL, MySQL, and SQL Server.

Overview

An index is an auxiliary data structure that allows the database engine to locate rows without scanning every page in a table. Choosing the right index type, column order, and coverage is the single highest-leverage performance optimization available. Poor indexing leads to full table scans; over-indexing wastes storage and slows writes.

Core Concepts

Index Types

Type	Engine Support	Use Case
B-tree	All major engines	Equality, range, sorting, prefix LIKE
Hash	PostgreSQL, MySQL (Memory engine)	Equality only, no range support
GIN (Generalized Inverted)	PostgreSQL	Full-text search, JSONB, arrays, tsvector
GiST (Generalized Search Tree)	PostgreSQL	Geometric, range types, full-text
BRIN (Block Range Index)	PostgreSQL	Very large naturally-ordered tables (timestamps, sequences)
Partial / Filtered	PostgreSQL, SQL Server	Index a subset of rows matching a predicate
Covering / INCLUDE	PostgreSQL 11+, SQL Server	Store extra columns in the index leaf to enable index-only scans
Clustered	SQL Server, MySQL (InnoDB primary key)	Physically orders table data by the index key

B-tree Internals (Simplified)

          [30 | 70]              <- root
         /    |    \
   [10|20] [40|50|60] [80|90]   <- internal nodes
    / | \    / | \ \    / | \
  leaf pages with row pointers   <- leaf level (doubly linked)

The leaf level is a doubly-linked list, enabling efficient range scans. Lookups are O(log N).

Key Terminology

Selectivity — the fraction of rows a predicate matches. Lower is better for indexing (1/N is ideal).
Cardinality — the number of distinct values in a column.
Index-only scan — the query is answered entirely from the index without touching the heap.
Covering index — an index that includes all columns needed by a query.

Implementation Patterns

Composite Index Column Order

The leftmost prefix rule: a composite index on (a, b, c) supports queries on (a), (a, b), and (a, b, c), but not (b) or (c) alone.

-- Supports WHERE tenant_id = ? AND status = ? AND created_at > ?
CREATE INDEX idx_orders_tenant_status_date
  ON orders (tenant_id, status, created_at);

Rule of thumb for column order:

Equality columns first (highest selectivity first among them)
Range/sort column last

Partial (Filtered) Indexes

-- Only index active orders — much smaller than a full index
CREATE INDEX idx_orders_active
  ON orders (customer_id, created_at)
  WHERE status = 'active';

-- Query must include the matching predicate to use this index
SELECT * FROM orders
WHERE status = 'active'
  AND customer_id = 42
ORDER BY created_at DESC;

Covering Indexes with INCLUDE

-- PostgreSQL: cover the SELECT columns to avoid heap fetches
CREATE INDEX idx_orders_covering
  ON orders (customer_id, order_date)
  INCLUDE (total_amount, status);

-- This query can now be an index-only scan
SELECT order_date, total_amount, status
FROM orders
WHERE customer_id = 123
ORDER BY order_date DESC;

Expression Indexes

-- Index on a computed expression
CREATE INDEX idx_users_email_lower
  ON users (LOWER(email));

-- Query must use the same expression
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';

BRIN for Time-Series Data

-- Extremely compact index for naturally ordered data
CREATE INDEX idx_events_ts_brin
  ON events USING brin (created_at)
  WITH (pages_per_range = 32);

-- Works well when rows are inserted roughly in created_at order

Multi-Column GIN for JSONB

-- Index all keys and values inside a JSONB column
CREATE INDEX idx_metadata_gin
  ON products USING gin (metadata jsonb_path_ops);

-- Supports containment queries
SELECT * FROM products WHERE metadata @> '{"color": "red"}';

Core Philosophy

Indexing is the art of giving the query optimizer viable access paths for the queries your application actually runs. An index is not a generic performance booster — it is a data structure tailored to a specific access pattern. The right index turns a 30-second full table scan into a 2-millisecond lookup. The wrong index wastes disk space, slows writes, and is never used by the optimizer. Effective indexing starts from the query workload, not from the table definition.

The fundamental tradeoff in indexing is read performance versus write overhead. Every index must be maintained on every INSERT, UPDATE, and DELETE that touches its columns. A table with fifteen indexes pays that maintenance cost fifteen times per write. The goal is not to index every column but to index the minimum set of columns required to cover the critical query paths. This requires measuring — checking slow query logs, examining pg_stat_statements, and running EXPLAIN ANALYZE — not guessing.

Column order in composite indexes is the most common source of indexing mistakes. The leftmost prefix rule means an index on (a, b, c) supports queries on (a), (a, b), and (a, b, c), but not (b) or (c) alone. The general principle is: equality columns first (highest selectivity among them), then range or sort columns last. Getting this order wrong means the index exists but the optimizer cannot use it, which is worse than having no index at all because it creates a false sense of coverage.

Anti-Patterns

Indexing every column "just in case" — over-indexing slows every write operation, inflates storage, and makes autovacuum and maintenance slower. Each index should be justified by a specific query pattern visible in the workload.
Creating indexes without checking if they are used — indexes that the optimizer never selects waste resources silently. Monitor pg_stat_user_indexes (PostgreSQL) or sys.dm_db_index_usage_stats (SQL Server) and drop indexes with zero scans.
Putting range columns before equality columns in composite indexes — an index on (created_at, status) cannot efficiently handle WHERE status = 'active' AND created_at > '2024-01-01'. The range scan on created_at breaks the ability to filter on status. Reverse the order.
Using function-wrapped columns in WHERE without a matching expression index — WHERE LOWER(email) = 'alice@example.com' cannot use a B-tree index on email. Either create an expression index on LOWER(email) or rewrite the query to match the indexed expression.
Creating indexes on large tables without CONCURRENTLY — CREATE INDEX without CONCURRENTLY (PostgreSQL) or equivalent takes an exclusive lock that blocks all writes for the duration of the build. On large tables, this can mean minutes of downtime.

Best Practices

Start from slow query logs — index what is actually queried, not what might be queried. Use pg_stat_statements, MySQL slow query log, or SQL Server Query Store.
Check usage statistics — monitor pg_stat_user_indexes or sys.dm_db_index_usage_stats to find unused indexes and drop them.
Aim for index-only scans — use INCLUDE columns or widen the index to cover SELECT and WHERE columns.
Keep indexes narrow — every additional column increases index size and write overhead.
Use partial indexes for common filter conditions that exclude most rows (e.g., WHERE deleted_at IS NULL).
Rebuild or reindex periodically — B-tree indexes can bloat after heavy UPDATE/DELETE workloads.

-- PostgreSQL: concurrent reindex (no table lock)
REINDEX INDEX CONCURRENTLY idx_orders_tenant_status_date;

-- Check index bloat
SELECT
  schemaname, tablename, indexname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;

Use CONCURRENTLY for production index creation to avoid locking the table.

CREATE INDEX CONCURRENTLY idx_new ON big_table (col);

Common Pitfalls

Indexing low-cardinality columns alone — a B-tree on a boolean column is almost never useful. Combine it with other columns or use a partial index.
Wrong column order in composite indexes — putting the range column before equality columns prevents the index from being fully utilized.
Over-indexing — every index slows INSERT, UPDATE, and DELETE. Tables with 15+ indexes are a red flag.
Forgetting expression match — an index on LOWER(email) is not used by WHERE email = 'Alice@example.com'. The query expression must match the index expression.
Not accounting for NULLs — in most engines, B-tree indexes include NULLs but IS NOT NULL filters may still cause scans. Partial indexes can help.
Assuming the optimizer always uses your index — the planner may choose a sequential scan if it estimates the index will touch too many rows. Check EXPLAIN ANALYZE to verify.
Ignoring TOAST and wide columns — indexing a column that is TOASTed (compressed/out-of-line) can be counterproductive. Index a hash or prefix instead.

Install this skill directly: skilldb add sql-skills

Get CLI access →