Technology & EngineeringNodejs Patterns198 lines

Clustering

Cluster module patterns for scaling Node.js applications across multiple CPU cores

Quick Summary16 lines

You are an expert in Node.js clustering patterns for scaling applications across multiple CPU cores to maximize throughput and reliability.

## Key Points

- Set the number of workers to `os.availableParallelism()` — one per logical CPU core. More workers than cores adds context-switch overhead without throughput gain.
- Automatically restart crashed workers in the primary, but implement a crash-loop backoff to avoid rapid restart storms.
- Use IPC messages for coordination (e.g., broadcasting config changes, collecting metrics) rather than shared files or network calls.
- Store session state externally (Redis, database) — workers do not share memory and a client's next request may go to a different worker.
- Prefer container-based horizontal scaling (Kubernetes, ECS) over in-process clustering for production workloads, as it provides better isolation and orchestration.
- **Storing state in worker memory** — sticky sessions or in-memory caches are lost when a worker crashes and are invisible to other workers. Use external state stores.
- **Not handling worker crashes** — if the primary does not re-fork dead workers, the cluster gradually loses capacity.
- **Crash-loop storms** — a bug that crashes workers immediately after fork creates an infinite restart loop. Track restart frequency and back off or alert after repeated failures.
- **Blocking the primary** — CPU-intensive work in the primary process delays connection distribution to all workers. The primary should only manage lifecycle.
- **Assuming listener order across workers** — different workers may start listening at different times; do not assume all workers are ready immediately after forking.

skilldb get nodejs-patterns-skills/ClusteringFull skill: 198 lines

Paste into your CLAUDE.md or agent config

Clustering — Node.js Patterns

You are an expert in Node.js clustering patterns for scaling applications across multiple CPU cores to maximize throughput and reliability.

Core Philosophy

Overview

Node.js runs JavaScript on a single thread. The node:cluster module lets you fork multiple worker processes that share the same server port, distributing incoming connections across CPU cores. This is the simplest way to scale a Node.js server vertically. For production deployments, process managers like PM2 or container orchestrators often handle clustering externally, but understanding the module is essential for designing scalable architectures.

Core Concepts

Primary and Worker Processes

The primary (master) process forks worker processes using cluster.fork(). Each worker is a full Node.js process with its own V8 instance and event loop. The primary does not handle requests — it manages the worker lifecycle.

Load Distribution

On Linux, the primary uses round-robin scheduling by default (cluster.schedulingPolicy = cluster.SCHED_RR). On Windows, the OS distributes connections. Round-robin provides more even distribution in most workloads.

Zero-Downtime Restarts

By forking new workers before killing old ones, you can deploy new code without dropping any requests. The primary orchestrates this rolling restart.

Shared Server Ports

All workers can listen() on the same port. The primary process holds the actual socket and distributes connections. Workers do not need unique ports.

Implementation Patterns

Basic cluster setup

const cluster = require('node:cluster');
const http = require('node:http');
const os = require('node:os');

if (cluster.isPrimary) {
  const numWorkers = os.availableParallelism();
  console.log(`Primary ${process.pid} forking ${numWorkers} workers`);

  for (let i = 0; i < numWorkers; i++) {
    cluster.fork();
  }

  cluster.on('exit', (worker, code, signal) => {
    console.log(`Worker ${worker.process.pid} exited (${signal || code})`);
    if (!worker.exitedAfterDisconnect) {
      console.log('Restarting worker...');
      cluster.fork();
    }
  });
} else {
  http.createServer((req, res) => {
    res.writeHead(200);
    res.end(`Handled by worker ${process.pid}\n`);
  }).listen(3000);

  console.log(`Worker ${process.pid} listening`);
}

Zero-downtime rolling restart

const cluster = require('node:cluster');

function rollingRestart() {
  const workers = Object.values(cluster.workers);
  let index = 0;

  function restartNext() {
    if (index >= workers.length) return;

    const oldWorker = workers[index++];
    const newWorker = cluster.fork();

    newWorker.once('listening', () => {
      // New worker is ready; gracefully shut down the old one
      oldWorker.disconnect();

      oldWorker.once('exit', () => {
        restartNext();
      });

      // Force kill if graceful shutdown takes too long
      setTimeout(() => {
        if (!oldWorker.isDead()) oldWorker.kill('SIGKILL');
      }, 10_000).unref();
    });
  }

  restartNext();
}

// Trigger restart on SIGUSR2
process.on('SIGUSR2', () => {
  console.log('Received SIGUSR2, starting rolling restart');
  rollingRestart();
});

Primary-worker communication

// Primary
cluster.on('fork', (worker) => {
  worker.send({ type: 'config', data: { logLevel: 'info' } });

  worker.on('message', (msg) => {
    if (msg.type === 'metrics') {
      aggregateMetrics(worker.id, msg.data);
    }
  });
});

// Worker
process.on('message', (msg) => {
  if (msg.type === 'config') {
    applyConfig(msg.data);
  }
});

// Report metrics periodically
setInterval(() => {
  process.send({
    type: 'metrics',
    data: { requestCount, avgLatency },
  });
}, 10_000);

Graceful shutdown of the entire cluster

const cluster = require('node:cluster');

async function shutdownCluster() {
  const workers = Object.values(cluster.workers);

  // Tell all workers to stop accepting new connections
  workers.forEach((w) => w.send({ type: 'shutdown' }));

  // Wait for all workers to exit
  await Promise.all(
    workers.map(
      (w) =>
        new Promise((resolve) => {
          w.once('exit', resolve);
          w.disconnect();
          setTimeout(() => {
            if (!w.isDead()) w.kill('SIGKILL');
          }, 15_000).unref();
        })
    )
  );

  process.exit(0);
}

if (cluster.isPrimary) {
  process.on('SIGTERM', shutdownCluster);
  process.on('SIGINT', shutdownCluster);
}

Best Practices

Set the number of workers to os.availableParallelism() — one per logical CPU core. More workers than cores adds context-switch overhead without throughput gain.
Automatically restart crashed workers in the primary, but implement a crash-loop backoff to avoid rapid restart storms.
Use IPC messages for coordination (e.g., broadcasting config changes, collecting metrics) rather than shared files or network calls.
Store session state externally (Redis, database) — workers do not share memory and a client's next request may go to a different worker.
Prefer container-based horizontal scaling (Kubernetes, ECS) over in-process clustering for production workloads, as it provides better isolation and orchestration.

Common Pitfalls

Storing state in worker memory — sticky sessions or in-memory caches are lost when a worker crashes and are invisible to other workers. Use external state stores.
Not handling worker crashes — if the primary does not re-fork dead workers, the cluster gradually loses capacity.
Crash-loop storms — a bug that crashes workers immediately after fork creates an infinite restart loop. Track restart frequency and back off or alert after repeated failures.
Blocking the primary — CPU-intensive work in the primary process delays connection distribution to all workers. The primary should only manage lifecycle.
Assuming listener order across workers — different workers may start listening at different times; do not assume all workers are ready immediately after forking.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add nodejs-patterns-skills

Get CLI access →