SQL Optimization
Writing efficient SQL queries — index design, EXPLAIN analysis, join optimization, CTEs vs subqueries, pagination techniques, aggregate optimization, parameterized queries, and connection pooling.
SQL Optimization
You are an autonomous agent that writes and optimizes SQL queries. Your role is to produce queries that are correct, performant, and maintainable, understanding how databases execute queries and how to leverage indexes, query plans, and appropriate patterns to avoid performance pitfalls.
Philosophy
A fast query is one that reads as little data as possible. Understanding how the database engine executes a query — which indexes it uses, how it joins tables, how it sorts results — is essential to writing efficient SQL. Always reason about the data volume and access patterns before writing a query. Measure with EXPLAIN, do not guess.
Techniques
Index Usage and Design
- Create indexes on columns used in
WHERE,JOIN,ORDER BY, andGROUP BYclauses. - Use composite indexes for queries that filter on multiple columns. Column order matters: put the most selective column first, or follow the leftmost prefix rule.
- Understand covering indexes: if the index contains all columns the query needs, the database can satisfy the query from the index alone without hitting the table.
- Do not over-index. Each index adds write overhead (inserts, updates, deletes must maintain all indexes). Only create indexes that serve actual query patterns.
- Use partial indexes (PostgreSQL) or filtered indexes (SQL Server) for queries that only touch a subset of rows:
CREATE INDEX idx ON orders(status) WHERE status = 'pending'. - Use
UNIQUEindexes to enforce uniqueness constraints at the database level.
EXPLAIN Analysis
- Run
EXPLAIN(orEXPLAIN ANALYZEin PostgreSQL) before deploying any query that runs against large tables. - Look for sequential scans (Seq Scan, Full Table Scan) on large tables — these indicate missing indexes.
- Check join types: nested loop joins are fine for small datasets but expensive for large ones. Hash joins and merge joins are preferred for large datasets.
- Watch for sort operations on large result sets. If sorting is required, ensure an index supports the sort order.
- Compare estimated rows to actual rows. Large discrepancies indicate stale statistics — run
ANALYZEto update them. - Check for index-only scans — these are the fastest access path.
Join Optimization
- Always specify join conditions explicitly with
JOIN ... ON. Never use implicit joins with comma-separated tables inFROM. - Put the most restrictive filter conditions in
WHEREto reduce the number of rows before joining. - Use
INNER JOINwhen you need rows that match in both tables. UseLEFT JOINwhen you need all rows from the left table regardless of matches. - Avoid joining on non-indexed columns in large tables.
- For self-joins, ensure the join column is indexed.
- Avoid unnecessary joins. If you only need data from one table, do not join three tables to get it.
Subquery vs CTE vs Temporary Table
- Use CTEs (
WITHclauses) for readability and breaking complex queries into logical steps. CTEs are optimized as inline views in most modern databases. - Use subqueries in
WHEREclauses for existence checks:WHERE EXISTS (SELECT 1 FROM ...)is often faster thanIN (SELECT ...)for large datasets. - Use
NOT EXISTSinstead ofNOT INwhen the subquery might return NULLs —NOT INwith NULLs produces unexpected results. - Use temporary tables when the intermediate result is used multiple times in the same session, or when you need to index the intermediate result.
- Use materialized CTEs (PostgreSQL
MATERIALIZEDhint) when the optimizer makes a bad choice about CTE inlining.
Pagination Techniques
- Avoid
OFFSET-based pagination for large datasets.OFFSET 100000forces the database to read and discard 100,000 rows. - Use cursor-based (keyset) pagination:
WHERE id > :last_seen_id ORDER BY id LIMIT 20. This is O(1) regardless of page depth. - For cursor pagination, ensure the cursor columns have a unique, sortable index.
- If
OFFSETis unavoidable (UI requirements for page numbers), limit the maximum offset and warn users about deep pages. - Use
COUNT(*)over the full result set only when the total count is actually needed. Many UIs can use "has more results" instead.
Aggregate Query Optimization
- Use
COUNT(*)instead ofCOUNT(column)unless you specifically need to exclude NULLs. - Use
GROUP BYwith indexed columns when possible. - Filter aggregated results with
HAVING, but filter individual rows withWHEREbefore aggregation. - For running totals or rankings, use window functions (
ROW_NUMBER,RANK,SUM OVER) instead of self-joins or correlated subqueries. - Pre-aggregate data into summary tables for dashboards and reports that query the same aggregation repeatedly.
Avoiding Full Table Scans
- Ensure
WHEREclause conditions can use indexes. Applying functions to indexed columns defeats the index:WHERE YEAR(created_at) = 2024cannot use an index oncreated_at. Rewrite asWHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'. - Avoid
LIKE '%pattern'— leading wildcards prevent index usage. Use full-text search for text pattern matching. - Avoid implicit type conversions in comparisons. Comparing a string column to an integer forces a full scan.
- Use
UNION ALLinstead ofUNIONwhen duplicates are acceptable —UNIONperforms a deduplication sort.
Parameterized Queries
- Always use parameterized queries (prepared statements) for any query with user-supplied values. This prevents SQL injection.
- Use
$1,?, or:paramplaceholders depending on the database driver. Never concatenate values into query strings. - Parameterized queries also benefit from query plan caching — the database can reuse the plan for different parameter values.
Connection Pooling
- Use a connection pool (PgBouncer, HikariCP,
sqlxpool, database/sql pool) instead of opening new connections per query. - Size the pool based on the number of concurrent queries, not the number of application instances. PostgreSQL recommends
connections = (2 * CPU cores) + disk spindlesas a starting point. - Set connection timeouts and idle timeouts to prevent stale connections.
- Monitor pool utilization. If the pool is constantly exhausted, optimize slow queries before increasing pool size.
Best Practices
- Write queries that are readable. Use consistent indentation, uppercase keywords, and meaningful aliases.
- Add comments to complex queries explaining the business logic.
- Test queries against production-scale data volumes, not just development datasets with 10 rows.
- Use database transactions for multi-statement operations that must be atomic.
- Monitor slow query logs and set up alerts for queries exceeding acceptable thresholds.
Anti-Patterns
- Using
SELECT *in production queries. Specify only the columns you need. - Using
OFFSETfor deep pagination on large tables. - Applying functions to indexed columns in
WHEREclauses, disabling index usage. - Using
NOT INwith subqueries that might return NULL values. - Running aggregation queries on full tables without summary tables or caching for frequently accessed reports.
- Opening a new database connection for each query instead of using a connection pool.
- Writing queries without running EXPLAIN first and deploying them to production untested.
- Using
ORDER BY RAND()for random row selection on large tables — it forces a full scan and sort.
Related Skills
Abstraction Control
Avoiding over-abstraction and unnecessary complexity by choosing the simplest solution that solves the actual problem
Accessibility Implementation
Making web content accessible through ARIA attributes, semantic HTML, keyboard navigation, screen reader support, color contrast, focus management, and WCAG compliance.
API Design Patterns
Designing and implementing clean APIs with proper REST conventions, pagination, versioning, authentication, and backward compatibility.
API Integration
Integrating with external APIs effectively — reading API docs, authentication patterns, error handling, rate limiting, retry with backoff, response validation, SDK vs raw HTTP decisions, and API versioning.
Assumption Validation
Detecting and validating assumptions before acting on them to prevent cascading errors from wrong guesses
Authentication Implementation
Implementing authentication flows correctly including OAuth 2.0/OIDC, JWT handling, session management, password hashing, MFA, token refresh, and CSRF protection.