Skip to main content
Technology & EngineeringWebsocket407 lines

Reconnection

Reconnection and offline resilience patterns for WebSocket apps including retry strategies and state synchronization

Quick Summary18 lines
You are an expert in building resilient real-time applications that handle disconnections, reconnections, and offline operation gracefully.

## Key Points

- **Clean close** — the server or client sends a close frame (code 1000 or 1001). Both sides know immediately.
- **Network failure** — the TCP connection dies silently. Without active probing, neither side knows for seconds to minutes.
- **Idle timeout** — proxies, load balancers, or firewalls close idle connections (often after 60-120 seconds of inactivity).
- **Immediate retry** — try once immediately in case it was a transient glitch
- **Exponential backoff** — increase the delay between attempts (1s, 2s, 4s, 8s, ...) up to a maximum
- **Jitter** — add randomness to backoff to prevent thundering herd when many clients reconnect simultaneously after a server restart
- **Full refresh** — discard local state and fetch everything. Simple but expensive.
- **Delta sync** — send the last known event ID or timestamp and receive only missed events.
- **Optimistic queue** — buffer local operations during disconnection and replay them after reconnection.
- **Always use exponential backoff with jitter** — without jitter, all clients reconnect at the same instant after a server restart, causing a thundering herd.
- **Persist the offline queue to localStorage** — if the user closes the tab and reopens it, their queued operations should survive.
- **Use idempotency keys** — when replaying queued operations, the server may have already processed some of them. Idempotency keys prevent duplicates.
skilldb get websocket-skills/ReconnectionFull skill: 407 lines
Paste into your CLAUDE.md or agent config

Reconnection & Offline Resilience — WebSockets & Real-Time

You are an expert in building resilient real-time applications that handle disconnections, reconnections, and offline operation gracefully.

Overview

Network connections are unreliable. Users switch between WiFi and cellular, walk through dead zones, close laptop lids, and experience server deployments. A production real-time application must handle all of these gracefully: detect disconnections quickly, reconnect automatically, synchronize missed state, and ideally allow some level of offline operation.

Core Concepts

Disconnection Detection

There are several ways a connection can be lost:

  • Clean close — the server or client sends a close frame (code 1000 or 1001). Both sides know immediately.
  • Network failure — the TCP connection dies silently. Without active probing, neither side knows for seconds to minutes.
  • Idle timeout — proxies, load balancers, or firewalls close idle connections (often after 60-120 seconds of inactivity).

Heartbeats (ping/pong) are the primary mechanism for detecting silent failures.

Reconnection Strategies

  • Immediate retry — try once immediately in case it was a transient glitch
  • Exponential backoff — increase the delay between attempts (1s, 2s, 4s, 8s, ...) up to a maximum
  • Jitter — add randomness to backoff to prevent thundering herd when many clients reconnect simultaneously after a server restart

State Synchronization

After reconnecting, the client must reconcile its local state with the server. Strategies:

  • Full refresh — discard local state and fetch everything. Simple but expensive.
  • Delta sync — send the last known event ID or timestamp and receive only missed events.
  • Optimistic queue — buffer local operations during disconnection and replay them after reconnection.

Implementation Patterns

Robust Reconnection Manager

class ReconnectingWebSocket {
  constructor(url, options = {}) {
    this.url = url;
    this.maxRetries = options.maxRetries ?? Infinity;
    this.baseDelay = options.baseDelay ?? 1000;
    this.maxDelay = options.maxDelay ?? 30000;
    this.jitterFactor = options.jitterFactor ?? 0.3;
    this.heartbeatInterval = options.heartbeatInterval ?? 25000;
    this.heartbeatTimeout = options.heartbeatTimeout ?? 10000;

    this.retryCount = 0;
    this.intentionallyClosed = false;
    this.listeners = new Map();
    this.pendingMessages = [];

    this.connect();
  }

  connect() {
    this.ws = new WebSocket(this.url);

    this.ws.addEventListener('open', () => {
      this.retryCount = 0;
      this.startHeartbeat();
      this.flushPendingMessages();
      this.emit('connected', { wasReconnect: this.retryCount > 0 });
    });

    this.ws.addEventListener('message', (event) => {
      if (event.data === 'pong') {
        this.heartbeatAcknowledged = true;
        return;
      }
      this.emit('message', JSON.parse(event.data));
    });

    this.ws.addEventListener('close', (event) => {
      this.stopHeartbeat();
      if (!this.intentionallyClosed) {
        this.emit('disconnected', { code: event.code, reason: event.reason });
        this.scheduleReconnect();
      }
    });

    this.ws.addEventListener('error', () => {
      // The close event always follows an error event; reconnection is handled there.
    });
  }

  scheduleReconnect() {
    if (this.retryCount >= this.maxRetries) {
      this.emit('failed', { retries: this.retryCount });
      return;
    }

    const delay = this.calculateDelay();
    this.retryCount++;
    this.emit('reconnecting', { attempt: this.retryCount, delay });

    this.reconnectTimer = setTimeout(() => this.connect(), delay);
  }

  calculateDelay() {
    const exponential = Math.min(
      this.baseDelay * Math.pow(2, this.retryCount),
      this.maxDelay
    );
    const jitter = exponential * this.jitterFactor * (Math.random() * 2 - 1);
    return Math.max(0, exponential + jitter);
  }

  startHeartbeat() {
    this.heartbeatAcknowledged = true;
    this.heartbeatTimer = setInterval(() => {
      if (!this.heartbeatAcknowledged) {
        // Server did not respond to last heartbeat
        this.ws.close(4000, 'Heartbeat timeout');
        return;
      }
      this.heartbeatAcknowledged = false;
      this.ws.send('ping');
    }, this.heartbeatInterval);
  }

  stopHeartbeat() {
    clearInterval(this.heartbeatTimer);
  }

  send(data) {
    if (this.ws.readyState === WebSocket.OPEN) {
      this.ws.send(JSON.stringify(data));
    } else {
      this.pendingMessages.push(data);
    }
  }

  flushPendingMessages() {
    while (this.pendingMessages.length > 0 && this.ws.readyState === WebSocket.OPEN) {
      this.ws.send(JSON.stringify(this.pendingMessages.shift()));
    }
  }

  close() {
    this.intentionallyClosed = true;
    clearTimeout(this.reconnectTimer);
    this.stopHeartbeat();
    this.ws.close(1000, 'Client closed');
  }

  on(event, callback) {
    if (!this.listeners.has(event)) this.listeners.set(event, []);
    this.listeners.get(event).push(callback);
  }

  emit(event, data) {
    (this.listeners.get(event) || []).forEach((cb) => cb(data));
  }
}

Delta Sync on Reconnect

// Client: track the last received event ID
let lastEventId = null;

const ws = new ReconnectingWebSocket('wss://example.com/socket');

ws.on('connected', ({ wasReconnect }) => {
  // Always send the last known event ID; the server decides what to replay
  ws.send({ type: 'sync', lastEventId });
});

ws.on('message', (msg) => {
  if (msg.type === 'sync-batch') {
    // Server sends missed events as a batch
    for (const event of msg.events) {
      processEvent(event);
      lastEventId = event.id;
    }
  } else if (msg.type === 'event') {
    processEvent(msg);
    lastEventId = msg.id;
  }
});

// Server: replay missed events
socket.on('message', (raw) => {
  const msg = JSON.parse(raw);
  if (msg.type === 'sync') {
    const missedEvents = getEventsSince(msg.lastEventId);
    socket.send(JSON.stringify({ type: 'sync-batch', events: missedEvents }));
  }
});

Offline Operation with Outbox Pattern

// Client: queue operations while offline
class OfflineQueue {
  constructor(storageKey = 'offline-queue') {
    this.storageKey = storageKey;
    this.queue = JSON.parse(localStorage.getItem(storageKey) || '[]');
  }

  enqueue(operation) {
    operation.clientTimestamp = Date.now();
    operation.idempotencyKey = crypto.randomUUID();
    this.queue.push(operation);
    this.persist();
  }

  async flush(socket) {
    const pending = [...this.queue];
    for (const op of pending) {
      try {
        await new Promise((resolve, reject) => {
          socket.timeout(10000).emit('operation', op, (err, response) => {
            if (err) return reject(err);
            if (response.error) return reject(new Error(response.error));
            resolve(response);
          });
        });
        // Remove from queue after server acknowledges
        this.queue.shift();
        this.persist();
      } catch (err) {
        console.error('Failed to flush operation:', err);
        break; // Stop flushing; retry later to preserve order
      }
    }
  }

  persist() {
    localStorage.setItem(this.storageKey, JSON.stringify(this.queue));
  }

  get length() {
    return this.queue.length;
  }
}

// Usage
const offlineQueue = new OfflineQueue();

function sendMessage(roomId, content) {
  const operation = { type: 'send-message', roomId, content };

  if (socket.connected) {
    socket.emit('operation', operation, (err) => {
      if (err) offlineQueue.enqueue(operation);
    });
  } else {
    offlineQueue.enqueue(operation);
    showPendingIndicator();
  }
}

socket.on('connect', () => {
  offlineQueue.flush(socket);
});

Network Status Detection

// Combine multiple signals for reliable network detection
class NetworkMonitor {
  constructor() {
    this.online = navigator.onLine;
    this.listeners = [];

    window.addEventListener('online', () => this.update(true));
    window.addEventListener('offline', () => this.update(false));

    // navigator.onLine is unreliable on some platforms;
    // supplement with periodic connectivity checks
    this.checkInterval = setInterval(() => this.check(), 30000);
  }

  async check() {
    try {
      const controller = new AbortController();
      const timeout = setTimeout(() => controller.abort(), 5000);
      await fetch('/api/health', {
        method: 'HEAD',
        cache: 'no-store',
        signal: controller.signal,
      });
      clearTimeout(timeout);
      this.update(true);
    } catch {
      this.update(false);
    }
  }

  update(online) {
    if (this.online !== online) {
      this.online = online;
      this.listeners.forEach((cb) => cb(online));
    }
  }

  onChange(callback) {
    this.listeners.push(callback);
  }

  destroy() {
    clearInterval(this.checkInterval);
  }
}

// Usage
const network = new NetworkMonitor();
network.onChange((online) => {
  if (online) {
    console.log('Back online, reconnecting...');
    ws.connect();
  } else {
    console.log('Offline, switching to local queue');
    showOfflineBanner();
  }
});

Connection State UI

// React example: connection status indicator
function useConnectionStatus(socket) {
  const [status, setStatus] = useState('connecting');

  useEffect(() => {
    function onConnect() { setStatus('connected'); }
    function onDisconnect() { setStatus('disconnected'); }
    function onReconnecting(attempt) { setStatus(`reconnecting (${attempt})`); }

    socket.on('connect', onConnect);
    socket.on('disconnect', onDisconnect);
    socket.io.on('reconnect_attempt', onReconnecting);

    return () => {
      socket.off('connect', onConnect);
      socket.off('disconnect', onDisconnect);
      socket.io.off('reconnect_attempt', onReconnecting);
    };
  }, [socket]);

  return status;
}

function ConnectionBanner({ socket }) {
  const status = useConnectionStatus(socket);

  if (status === 'connected') return null;

  return (
    <div className={`connection-banner ${status}`}>
      {status === 'disconnected' && 'Connection lost. Reconnecting...'}
      {status.startsWith('reconnecting') && `Reconnecting... (attempt ${status.match(/\d+/)?.[0]})`}
      {status === 'connecting' && 'Connecting...'}
    </div>
  );
}

Best Practices

  • Always use exponential backoff with jitter — without jitter, all clients reconnect at the same instant after a server restart, causing a thundering herd.
  • Persist the offline queue to localStorage — if the user closes the tab and reopens it, their queued operations should survive.
  • Use idempotency keys — when replaying queued operations, the server may have already processed some of them. Idempotency keys prevent duplicates.
  • Show connection state in the UI — users need to know when they are offline or reconnecting. A subtle banner is less jarring than a modal.
  • Buffer outgoing messages during disconnection — do not silently drop messages. Queue them and flush on reconnect.
  • Test with network simulation — use Chrome DevTools network throttling, tc (traffic control) on Linux, or tools like toxiproxy to simulate latency, drops, and partitions.

Common Pitfalls

  • Reconnecting too aggressively — retrying every 100ms with no backoff overwhelms the server and wastes battery on mobile.
  • Not detecting silent disconnections — without heartbeats, a dead TCP connection can go unnoticed for minutes. Users see a "connected" UI while no messages are being delivered.
  • Replaying operations out of order — if the offline queue contains operations with dependencies (e.g., create room, then send message), they must be replayed in order. Do not parallelize the flush.
  • Full state refresh on every reconnect — for large state, this is wasteful. Delta sync with event IDs is more efficient and scalable.
  • Ignoring the visibilitychange event — when a tab is backgrounded, browsers may throttle timers and connections. Reconnect proactively when the tab becomes visible again.
  • Not handling server-initiated disconnects — when the server closes the connection for maintenance, it should send a close frame with a distinct code (e.g., 4001). The client can use this to show "Server is restarting" instead of "Connection lost".

Core Philosophy

Network unreliability is the default state, not an exception. Design your real-time application to handle disconnections as a normal, frequent event rather than an error condition. Users walk through dead zones, close laptop lids, switch between WiFi and cellular, and experience server deployments. Every one of these scenarios should be handled gracefully, automatically, and invisibly to the user.

The reconnection strategy must balance speed against system health. Immediate retry handles transient glitches. Exponential backoff prevents thundering herd when thousands of clients try to reconnect simultaneously after a server restart. Jitter (random variation in backoff timing) prevents synchronized reconnection waves that create periodic load spikes. All three components — immediate retry, exponential backoff, and jitter — are necessary for a production-quality reconnection implementation.

State synchronization after reconnection is where most real-time applications fail. A reconnected client has a stale view of the world — messages were sent, presence changed, and state evolved while it was disconnected. Delta sync (sending the last known event ID and receiving only missed events) is the scalable solution. Full state refresh works for small datasets but becomes prohibitively expensive as the application grows.

Anti-Patterns

  • Reconnecting without exponential backoff — retrying every 100ms without backoff overwhelms the server and wastes client battery; always increase delay between attempts up to a reasonable maximum.

  • Not detecting silent disconnections — without application-level heartbeats, a dead TCP connection can go unnoticed for minutes while the user sees a "connected" indicator; implement ping/pong probing.

  • Discarding messages during disconnection — dropping outbound messages when the connection is down leads to data loss; queue them in an offline outbox and flush on reconnect.

  • Performing full state refresh on every reconnection — downloading all state from scratch works but is wasteful at scale; implement delta sync with event IDs to receive only what was missed.

  • Not persisting the offline queue to storage — if the user closes and reopens the tab, an in-memory queue is lost; persist pending operations to localStorage so they survive tab closures and page refreshes.

Install this skill directly: skilldb add websocket-skills

Get CLI access →