Database Scaling
Database scaling patterns including sharding, replication, and read replicas for high-throughput systems
You are an expert in Database Scaling for designing scalable distributed systems. ## Key Points - **Single-Leader**: One primary accepts writes, replicas serve reads. Simple, but replication lag means reads may be stale. - **Multi-Leader**: Multiple nodes accept writes. Better write availability but introduces conflict resolution complexity. - **Leaderless**: Any node accepts reads and writes (Cassandra, DynamoDB). Uses quorum reads/writes for consistency. - **Range-Based**: Shard by value ranges (e.g., user IDs 1-1M on shard 1). Simple but can create hotspots. - **Hash-Based**: Hash the shard key and distribute by hash range. Even distribution but range queries span all shards. - **Directory-Based**: A lookup service maps keys to shards. Flexible but the directory is a single point of failure. - Exhaust vertical scaling and read replicas before sharding — sharding introduces permanent architectural complexity that is very difficult to undo. - Choose the shard key based on query patterns, not just even distribution; the most frequent queries should be servable from a single shard. - Monitor replication lag and set up alerts; stale reads from lagging replicas are a common source of subtle bugs in read-heavy systems. - Selecting a shard key with low cardinality (e.g., country code) that creates a few massive shards and many empty ones, defeating the purpose of horizontal scaling. - Ignoring the need for a resharding strategy — as data grows, the initial shard count becomes insufficient, and rebalancing without downtime requires careful planning from the start. - **Premature Sharding**: Sharding a database before exhausting vertical scaling and read replicas. This introduces permanent operational complexity for a scaling problem that may not yet exist.
skilldb get system-design-skills/Database ScalingFull skill: 120 linesDatabase Scaling — System Design
You are an expert in Database Scaling for designing scalable distributed systems.
Core Philosophy
Database scaling should be driven by measured bottlenecks, not anticipated ones. The most expensive mistake in database architecture is sharding too early — it introduces permanent complexity, makes cross-entity queries harder, and locks you into a data distribution model that is painful to change. Vertical scaling and read replicas solve the majority of real-world scaling problems, and they should be exhausted before reaching for horizontal partitioning.
When horizontal scaling does become necessary, the shard key decision is the single most consequential choice in the system. It determines which queries are efficient (single-shard) and which are expensive (scatter-gather). This decision should be driven by actual query patterns, not theoretical uniformity of distribution.
Data gravity is real: once data is distributed across shards, every subsequent feature must account for that distribution. Schema changes, migrations, backups, and debugging all become harder. The goal is to delay this complexity as long as possible and, when it becomes unavoidable, to choose a partitioning strategy that aligns with the system's dominant access patterns.
Overview
Database scaling addresses the problem of a single database becoming a bottleneck as traffic, data volume, or query complexity grows. The two fundamental approaches are vertical scaling (bigger hardware) and horizontal scaling (more nodes). Horizontal scaling involves replication, read replicas, and sharding — each with distinct trade-offs in consistency, complexity, and operational cost.
Core Concepts
Scaling Strategies
Vertical Scaling:
[App] --> [Bigger DB Server (more CPU, RAM, SSD)]
Read Replicas:
[App Writes] --> [Primary DB] --replication--> [Replica 1] <-- [App Reads]
--> [Replica 2] <-- [App Reads]
Sharding:
[App] --> [Router/Proxy]
|
+--------+--------+
v v v
[Shard 1] [Shard 2] [Shard 3]
(users (users (users
A-H) I-P) Q-Z)
Replication Models
- Single-Leader: One primary accepts writes, replicas serve reads. Simple, but replication lag means reads may be stale.
- Multi-Leader: Multiple nodes accept writes. Better write availability but introduces conflict resolution complexity.
- Leaderless: Any node accepts reads and writes (Cassandra, DynamoDB). Uses quorum reads/writes for consistency.
Sharding (Horizontal Partitioning)
Data is split across multiple database instances by a shard key. Each shard holds a subset of the data and handles queries for that subset independently.
- Range-Based: Shard by value ranges (e.g., user IDs 1-1M on shard 1). Simple but can create hotspots.
- Hash-Based: Hash the shard key and distribute by hash range. Even distribution but range queries span all shards.
- Directory-Based: A lookup service maps keys to shards. Flexible but the directory is a single point of failure.
Connection Pooling
Database connections are expensive. Use connection poolers (PgBouncer, ProxySQL) to multiplex application connections over a smaller set of database connections. Critical at scale where thousands of application instances may each want a connection.
Implementation Patterns
Read/Write Splitting
Route write queries to the primary and read queries to replicas. Use application-level routing or a proxy (ProxySQL, HAProxy). Account for replication lag — critical reads (e.g., immediately after a write) should go to the primary.
Shard Key Selection
The shard key determines data distribution and query routing. A good shard key has high cardinality, even distribution, and aligns with the most common query patterns. Poor shard key choices lead to hotspots or cross-shard queries.
Cross-Shard Queries and Joins
Queries that span multiple shards require scatter-gather: send the query to all relevant shards, collect results, merge. This is expensive. Design the schema and shard key to minimize cross-shard operations.
Online Schema Migration
At scale, ALTER TABLE can lock a table for hours. Tools like gh-ost (GitHub), pt-online-schema-change (Percona), or LHM create a shadow table, copy data incrementally, then swap. Plan schema changes as carefully as code changes.
Vitess / Citus / CockroachDB
Middleware and NewSQL databases add sharding capabilities on top of existing databases (Vitess for MySQL, Citus for PostgreSQL) or build it natively (CockroachDB, TiDB). These reduce the operational burden of manual sharding.
Trade-offs
| Factor | Read Replicas | Sharding | NewSQL |
|---|---|---|---|
| Write scaling | No (single primary) | Yes | Yes |
| Read scaling | Yes | Yes | Yes |
| Complexity | Low | High | Moderate |
| Cross-shard queries | N/A | Expensive | Handled internally |
| Consistency | Eventual (replicas) | Per-shard strong | Configurable |
| Operational burden | Low | High (rebalancing) | Moderate |
Start with vertical scaling and read replicas. Move to sharding only when write throughput or data volume demands it. Consider NewSQL if you want horizontal scaling with less operational overhead.
Best Practices
- Exhaust vertical scaling and read replicas before sharding — sharding introduces permanent architectural complexity that is very difficult to undo.
- Choose the shard key based on query patterns, not just even distribution; the most frequent queries should be servable from a single shard.
- Monitor replication lag and set up alerts; stale reads from lagging replicas are a common source of subtle bugs in read-heavy systems.
Common Pitfalls
- Selecting a shard key with low cardinality (e.g., country code) that creates a few massive shards and many empty ones, defeating the purpose of horizontal scaling.
- Ignoring the need for a resharding strategy — as data grows, the initial shard count becomes insufficient, and rebalancing without downtime requires careful planning from the start.
Anti-Patterns
-
Premature Sharding: Sharding a database before exhausting vertical scaling and read replicas. This introduces permanent operational complexity for a scaling problem that may not yet exist.
-
Cross-Shard Joins as Standard Practice: Designing schemas that routinely require scatter-gather queries across all shards. If most queries cannot be satisfied by a single shard, the shard key is wrong.
-
Shared Database Between Services: Multiple services reading and writing to the same database tables. This creates hidden coupling and makes independent scaling impossible — one service's query patterns degrade another's performance.
-
Ignoring Replication Lag: Reading from replicas immediately after a write and expecting consistent results. Critical read-after-write paths must be routed to the primary, or the application must tolerate staleness explicitly.
-
Manual Shard Routing in Application Code: Scattering shard-selection logic throughout the codebase instead of centralizing it in a routing layer or using a sharding middleware. This makes shard rebalancing a codebase-wide change.
Install this skill directly: skilldb add system-design-skills
Related Skills
API Gateway Design
API gateway and Backend-for-Frontend (BFF) patterns for managing client-service communication
Circuit Breaker
Circuit breaker and resilience patterns for building fault-tolerant distributed systems
Distributed Caching
Distributed caching strategies for reducing latency and database load in large-scale systems
Event-Driven
Event-driven architecture and CQRS patterns for reactive, decoupled distributed systems
Message Queues
Message queue patterns including pub/sub, fan-out, and reliable delivery for asynchronous communication
Microservices
Microservices architecture patterns for building independently deployable, loosely coupled services