Clustering
Cluster module patterns for scaling Node.js applications across multiple CPU cores
You are an expert in Node.js clustering patterns for scaling applications across multiple CPU cores to maximize throughput and reliability. ## Key Points - Set the number of workers to `os.availableParallelism()` — one per logical CPU core. More workers than cores adds context-switch overhead without throughput gain. - Automatically restart crashed workers in the primary, but implement a crash-loop backoff to avoid rapid restart storms. - Use IPC messages for coordination (e.g., broadcasting config changes, collecting metrics) rather than shared files or network calls. - Store session state externally (Redis, database) — workers do not share memory and a client's next request may go to a different worker. - Prefer container-based horizontal scaling (Kubernetes, ECS) over in-process clustering for production workloads, as it provides better isolation and orchestration. - **Storing state in worker memory** — sticky sessions or in-memory caches are lost when a worker crashes and are invisible to other workers. Use external state stores. - **Not handling worker crashes** — if the primary does not re-fork dead workers, the cluster gradually loses capacity. - **Crash-loop storms** — a bug that crashes workers immediately after fork creates an infinite restart loop. Track restart frequency and back off or alert after repeated failures. - **Blocking the primary** — CPU-intensive work in the primary process delays connection distribution to all workers. The primary should only manage lifecycle. - **Assuming listener order across workers** — different workers may start listening at different times; do not assume all workers are ready immediately after forking.
skilldb get nodejs-patterns-skills/ClusteringFull skill: 198 linesClustering — Node.js Patterns
You are an expert in Node.js clustering patterns for scaling applications across multiple CPU cores to maximize throughput and reliability.
Core Philosophy
Overview
Node.js runs JavaScript on a single thread. The node:cluster module lets you fork multiple worker processes that share the same server port, distributing incoming connections across CPU cores. This is the simplest way to scale a Node.js server vertically. For production deployments, process managers like PM2 or container orchestrators often handle clustering externally, but understanding the module is essential for designing scalable architectures.
Core Concepts
Primary and Worker Processes
The primary (master) process forks worker processes using cluster.fork(). Each worker is a full Node.js process with its own V8 instance and event loop. The primary does not handle requests — it manages the worker lifecycle.
Load Distribution
On Linux, the primary uses round-robin scheduling by default (cluster.schedulingPolicy = cluster.SCHED_RR). On Windows, the OS distributes connections. Round-robin provides more even distribution in most workloads.
Zero-Downtime Restarts
By forking new workers before killing old ones, you can deploy new code without dropping any requests. The primary orchestrates this rolling restart.
Shared Server Ports
All workers can listen() on the same port. The primary process holds the actual socket and distributes connections. Workers do not need unique ports.
Implementation Patterns
Basic cluster setup
const cluster = require('node:cluster');
const http = require('node:http');
const os = require('node:os');
if (cluster.isPrimary) {
const numWorkers = os.availableParallelism();
console.log(`Primary ${process.pid} forking ${numWorkers} workers`);
for (let i = 0; i < numWorkers; i++) {
cluster.fork();
}
cluster.on('exit', (worker, code, signal) => {
console.log(`Worker ${worker.process.pid} exited (${signal || code})`);
if (!worker.exitedAfterDisconnect) {
console.log('Restarting worker...');
cluster.fork();
}
});
} else {
http.createServer((req, res) => {
res.writeHead(200);
res.end(`Handled by worker ${process.pid}\n`);
}).listen(3000);
console.log(`Worker ${process.pid} listening`);
}
Zero-downtime rolling restart
const cluster = require('node:cluster');
function rollingRestart() {
const workers = Object.values(cluster.workers);
let index = 0;
function restartNext() {
if (index >= workers.length) return;
const oldWorker = workers[index++];
const newWorker = cluster.fork();
newWorker.once('listening', () => {
// New worker is ready; gracefully shut down the old one
oldWorker.disconnect();
oldWorker.once('exit', () => {
restartNext();
});
// Force kill if graceful shutdown takes too long
setTimeout(() => {
if (!oldWorker.isDead()) oldWorker.kill('SIGKILL');
}, 10_000).unref();
});
}
restartNext();
}
// Trigger restart on SIGUSR2
process.on('SIGUSR2', () => {
console.log('Received SIGUSR2, starting rolling restart');
rollingRestart();
});
Primary-worker communication
// Primary
cluster.on('fork', (worker) => {
worker.send({ type: 'config', data: { logLevel: 'info' } });
worker.on('message', (msg) => {
if (msg.type === 'metrics') {
aggregateMetrics(worker.id, msg.data);
}
});
});
// Worker
process.on('message', (msg) => {
if (msg.type === 'config') {
applyConfig(msg.data);
}
});
// Report metrics periodically
setInterval(() => {
process.send({
type: 'metrics',
data: { requestCount, avgLatency },
});
}, 10_000);
Graceful shutdown of the entire cluster
const cluster = require('node:cluster');
async function shutdownCluster() {
const workers = Object.values(cluster.workers);
// Tell all workers to stop accepting new connections
workers.forEach((w) => w.send({ type: 'shutdown' }));
// Wait for all workers to exit
await Promise.all(
workers.map(
(w) =>
new Promise((resolve) => {
w.once('exit', resolve);
w.disconnect();
setTimeout(() => {
if (!w.isDead()) w.kill('SIGKILL');
}, 15_000).unref();
})
)
);
process.exit(0);
}
if (cluster.isPrimary) {
process.on('SIGTERM', shutdownCluster);
process.on('SIGINT', shutdownCluster);
}
Best Practices
- Set the number of workers to
os.availableParallelism()— one per logical CPU core. More workers than cores adds context-switch overhead without throughput gain. - Automatically restart crashed workers in the primary, but implement a crash-loop backoff to avoid rapid restart storms.
- Use IPC messages for coordination (e.g., broadcasting config changes, collecting metrics) rather than shared files or network calls.
- Store session state externally (Redis, database) — workers do not share memory and a client's next request may go to a different worker.
- Prefer container-based horizontal scaling (Kubernetes, ECS) over in-process clustering for production workloads, as it provides better isolation and orchestration.
Common Pitfalls
- Storing state in worker memory — sticky sessions or in-memory caches are lost when a worker crashes and are invisible to other workers. Use external state stores.
- Not handling worker crashes — if the primary does not re-fork dead workers, the cluster gradually loses capacity.
- Crash-loop storms — a bug that crashes workers immediately after fork creates an infinite restart loop. Track restart frequency and back off or alert after repeated failures.
- Blocking the primary — CPU-intensive work in the primary process delays connection distribution to all workers. The primary should only manage lifecycle.
- Assuming listener order across workers — different workers may start listening at different times; do not assume all workers are ready immediately after forking.
Anti-Patterns
Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.
Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.
Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.
Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.
Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.
Install this skill directly: skilldb add nodejs-patterns-skills
Related Skills
Child Processes
Child process management patterns for spawning, communicating with, and controlling external processes
Error Handling
Comprehensive error handling strategies for robust and debuggable Node.js applications
Event Emitter
EventEmitter patterns for building decoupled, event-driven architectures in Node.js
File System
Modern fs/promises patterns for safe, efficient file system operations in Node.js
Native Modules
N-API and native addon patterns for extending Node.js with high-performance C/C++ and Rust modules
Streams
Node.js streams for efficient memory-conscious data processing with backpressure handling