Technology & EngineeringSql218 lines

Query Optimization

Techniques for analyzing and optimizing SQL query performance using execution plans, statistics, and query rewrites

Quick Summary28 lines

You are an expert in SQL query optimization for diagnosing slow queries and applying systematic techniques to improve performance across PostgreSQL, MySQL, and SQL Server.

## Key Points

1. **Parse** — SQL text to abstract syntax tree
2. **Analyze/Bind** — resolve table and column names, check permissions
3. **Rewrite** — apply rules (view expansion, predicate pushdown)
4. **Optimize** — generate candidate plans, estimate costs, pick cheapest
5. **Execute** — run the chosen plan, return results
- `seq_page_cost` = 1.0 (baseline)
- `random_page_cost` = 4.0 (index lookups are ~4x more expensive)
- `cpu_tuple_cost` = 0.01
- Row estimates drive join order and method choice
1. **Always use EXPLAIN ANALYZE** — estimated plans can mislead. Actual execution reveals true row counts, timing, and buffer usage.
2. **Keep statistics fresh** — run ANALYZE after bulk loads. Consider `autovacuum` tuning in PostgreSQL.
3. **Index for your WHERE, JOIN, ORDER BY, and GROUP BY** — these are the clauses the optimizer needs access paths for.

## Quick Example

```sql
-- BAD: selecting all columns when only a few are needed
SELECT * FROM orders WHERE customer_id = 5;

-- GOOD: select only what you need; enables index-only scans
SELECT id, total_amount FROM orders WHERE customer_id = 5;
```

skilldb get sql-skills/Query OptimizationFull skill: 218 lines

Paste into your CLAUDE.md or agent config

Query Optimization — SQL

You are an expert in SQL query optimization for diagnosing slow queries and applying systematic techniques to improve performance across PostgreSQL, MySQL, and SQL Server.

Overview

Query optimization is the practice of reducing query execution time and resource consumption by understanding how the database engine plans and executes SQL. It involves reading execution plans, ensuring accurate statistics, rewriting queries, and structuring schemas to align with access patterns. The optimizer makes cost-based decisions; your job is to give it accurate information and viable access paths.

Core Concepts

The Query Processing Pipeline

Parse — SQL text to abstract syntax tree
Analyze/Bind — resolve table and column names, check permissions
Rewrite — apply rules (view expansion, predicate pushdown)
Optimize — generate candidate plans, estimate costs, pick cheapest
Execute — run the chosen plan, return results

Reading Execution Plans

-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;

-- MySQL
EXPLAIN ANALYZE SELECT ...;

-- SQL Server
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
-- Or use actual execution plan in SSMS

Key plan nodes to understand:

Node	Meaning
Seq Scan / Full Table Scan	Reading every row — often a problem
Index Scan	B-tree traversal then heap fetch
Index Only Scan	Answered entirely from the index
Bitmap Index Scan	Build bitmap from index, then scan heap
Nested Loop	For each outer row, scan inner — good for small outer sets
Hash Join	Build hash table from smaller set, probe with larger
Merge Join	Both inputs sorted, merge — efficient for large sorted sets
Sort	Explicit sort, may spill to disk
Materialize	Cache a subplan's output for reuse

Cost Model Basics

PostgreSQL costs are in arbitrary units. Key factors:

seq_page_cost = 1.0 (baseline)
random_page_cost = 4.0 (index lookups are ~4x more expensive)
cpu_tuple_cost = 0.01
Row estimates drive join order and method choice

If row estimates are wrong, the plan is wrong. Always check estimated vs actual rows.

Implementation Patterns

Diagnosing a Slow Query

-- Step 1: Get the actual plan
EXPLAIN (ANALYZE, BUFFERS, COSTS, TIMING)
SELECT o.id, c.name, SUM(oi.quantity * oi.price) AS total
FROM orders o
JOIN customers c ON c.id = o.customer_id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.created_at >= '2024-01-01'
GROUP BY o.id, c.name;

-- Step 2: Look for
--   - Large gaps between estimated and actual rows
--   - Seq Scans on large tables
--   - Sort nodes with "Sort Method: external merge" (disk spill)
--   - Nested loops with large outer row counts

Fixing Bad Row Estimates

-- Update statistics
ANALYZE orders;
ANALYZE order_items;

-- For correlated columns, use extended statistics (PostgreSQL 10+)
CREATE STATISTICS stat_orders_customer_date (dependencies)
  ON customer_id, created_at FROM orders;
ANALYZE orders;

Eliminating Unnecessary Work

-- BAD: selecting all columns when only a few are needed
SELECT * FROM orders WHERE customer_id = 5;

-- GOOD: select only what you need; enables index-only scans
SELECT id, total_amount FROM orders WHERE customer_id = 5;

-- BAD: DISTINCT as a band-aid for a bad join
SELECT DISTINCT o.id, o.total
FROM orders o
JOIN order_items oi ON oi.order_id = o.id;

-- GOOD: fix the query logic
SELECT o.id, o.total FROM orders o
WHERE EXISTS (SELECT 1 FROM order_items oi WHERE oi.order_id = o.id);

Rewriting Correlated Subqueries

-- SLOW: correlated subquery executes per row
SELECT id, name,
  (SELECT COUNT(*) FROM orders WHERE customer_id = c.id) AS order_count
FROM customers c;

-- FASTER: lateral join or plain join with aggregation
SELECT c.id, c.name, COALESCE(o.order_count, 0) AS order_count
FROM customers c
LEFT JOIN (
  SELECT customer_id, COUNT(*) AS order_count
  FROM orders GROUP BY customer_id
) o ON o.customer_id = c.id;

Pagination Done Right

-- BAD: OFFSET-based pagination degrades with page depth
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 100000;

-- GOOD: keyset (cursor) pagination — constant performance
SELECT * FROM products
WHERE id > :last_seen_id
ORDER BY id
LIMIT 20;

Batch Operations

-- BAD: 10,000 individual INSERTs in a loop
INSERT INTO events (type, data) VALUES ('click', '...');

-- GOOD: batch insert
INSERT INTO events (type, data)
VALUES
  ('click', '...'),
  ('view', '...'),
  ('click', '...');
-- Or use COPY (PostgreSQL) / LOAD DATA INFILE (MySQL) for bulk loads

Join Order Hints (When Needed)

-- PostgreSQL: force join order
SET LOCAL join_collapse_limit = 1;
SELECT ...
FROM small_table s
JOIN medium_table m ON m.id = s.mid
JOIN large_table l ON l.id = m.lid;

-- MySQL: STRAIGHT_JOIN forces left-to-right join order
SELECT STRAIGHT_JOIN ...
FROM small_table s
JOIN large_table l ON l.id = s.lid;

Core Philosophy

Query optimization is not about writing clever SQL — it is about giving the optimizer accurate information and viable access paths so it can choose the best execution strategy. The optimizer is a cost-based decision engine that estimates the expense of different plans and picks the cheapest one. When its cost estimates are accurate (because statistics are fresh and the schema provides appropriate indexes), it almost always makes the right choice. When the estimates are wrong — usually because statistics are stale or row correlations are unmeasured — even simple queries can produce catastrophic plans.

The most important diagnostic skill is reading execution plans and comparing estimated rows to actual rows. A plan that estimates 10 rows but produces 100,000 will choose a nested loop join where a hash join was needed, or an index scan where a sequential scan was cheaper. The root cause is almost always stale statistics, correlated columns without extended statistics, or a data distribution that changed since the last ANALYZE. Fixing the estimates fixes the plan, and fixing the plan fixes the performance.

Optimization should follow measurement, never intuition. Adding an index, rewriting a join, or denormalizing a table based on a guess is as likely to hurt performance as to help it. The correct workflow is: identify the slow query from monitoring, run EXPLAIN ANALYZE to understand the current plan, diagnose the bottleneck (sequential scan, disk sort, nested loop on a large outer set), apply a targeted fix (add an index, rewrite the predicate, update statistics), and verify with EXPLAIN ANALYZE again. This measure-diagnose-fix-verify cycle is the only reliable path to better query performance.

Anti-Patterns

Adding indexes without measuring first — creating an index because a column appears in a WHERE clause, without checking whether the optimizer would use it or whether the existing plan is actually slow, wastes write overhead and storage.
Using SELECT * in production queries — selecting all columns prevents index-only scans, pulls unnecessary data across the network, and makes it impossible for the optimizer to satisfy the query from a covering index.
Wrapping indexed columns in functions — WHERE YEAR(created_at) = 2024 forces a sequential scan because the B-tree index on created_at cannot be used through the function. Rewrite as a range predicate: WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'.
Using OFFSET for deep pagination — OFFSET 100000 forces the database to scan and discard 100,000 rows before returning the next page. Performance degrades linearly with page depth. Use keyset (cursor) pagination for stable, constant-time page fetches.
Relying on query hints instead of fixing the root cause — hints override the optimizer's cost model. They may help today, but when data distribution changes tomorrow, the forced plan becomes the worst plan. Fix statistics, add indexes, or rewrite the query instead.

Best Practices

Always use EXPLAIN ANALYZE — estimated plans can mislead. Actual execution reveals true row counts, timing, and buffer usage.
Keep statistics fresh — run ANALYZE after bulk loads. Consider autovacuum tuning in PostgreSQL.
Index for your WHERE, JOIN, ORDER BY, and GROUP BY — these are the clauses the optimizer needs access paths for.
Avoid functions on indexed columns in WHERE — WHERE YEAR(created_at) = 2024 cannot use a B-tree index on created_at. Rewrite as a range: WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'.
Prefer EXISTS over IN for large subqueries — EXISTS short-circuits; IN may materialize the full subquery.
Test with production-scale data — a query that is fast on 1,000 rows may be catastrophic on 10 million. Plans change with table size.
Monitor query performance continuously — use pg_stat_statements, MySQL Performance Schema, or SQL Server Query Store.

Common Pitfalls

Premature optimization — do not add indexes or rewrite queries before measuring. Profile first.
Implicit type casts — WHERE varchar_col = 123 may prevent index use because the engine casts every row. Match types explicitly.
OR conditions defeating indexes — WHERE a = 1 OR b = 2 often results in a sequential scan. Rewrite as UNION ALL of two indexed queries if needed.
Over-relying on query hints — hints bypass the optimizer's cost model. They may help today and hurt tomorrow when data distribution changes.
N+1 query patterns in application code — the ORM fetches 1 parent query then N child queries. Use eager loading or a single JOIN query.
Ignoring lock contention — a query may be fast in isolation but slow under concurrency due to row locks, table locks, or buffer contention. Check wait events.

Install this skill directly: skilldb add sql-skills

Get CLI access →