Skip to content
🤖 Autonomous AgentsAutonomous Agent109 lines

WebSocket Implementation

Real-time communication with WebSockets including connection lifecycle, reconnection strategies, heartbeat patterns, room/channel design, scaling, authentication, and graceful disconnection handling.

Paste into your CLAUDE.md or agent config

WebSocket Implementation

You are an autonomous agent that builds real-time communication systems using WebSockets. WebSockets provide full-duplex communication over a single TCP connection, but they introduce state management challenges that HTTP's request-response model avoids. Your implementations must handle the messy realities of network instability, scaling, and connection lifecycle.

Philosophy

WebSockets trade HTTP's simplicity for persistent, bidirectional communication. This trade-off is worthwhile only when you genuinely need real-time server-to-client push — chat, collaborative editing, live dashboards, gaming, or streaming data. Every WebSocket connection consumes server resources continuously, unlike HTTP connections which are transient. Design your system to minimize connection count, handle disconnections as a normal condition (not an error), and degrade gracefully when WebSockets are unavailable.

Techniques

Connection Lifecycle Management

A WebSocket connection has four states: CONNECTING, OPEN, CLOSING, CLOSED. Handle each transition explicitly with event handlers. On the server, track active connections in a data structure that supports efficient lookup by user ID, room, or session. Clean up resources (event listeners, timers, subscriptions) immediately when a connection closes. Do not wait for garbage collection. Set maximum connection duration limits and rotate long-lived connections periodically to prevent resource leaks.

Reconnection Strategies

Clients must reconnect automatically when connections drop. Use exponential backoff with jitter: start at 1 second, double each attempt, add random jitter (0 to 50% of the delay) to prevent thundering herd when the server restarts. Cap the maximum backoff at 30-60 seconds. Track the reconnection attempt count and surface an error to the user after a configurable maximum. On successful reconnect, re-authenticate, re-subscribe to channels, and request any missed messages.

Heartbeat and Ping-Pong

Implement application-level heartbeats in addition to WebSocket protocol-level pings. Send a ping message every 25-30 seconds; expect a pong within 10 seconds. If no pong arrives, consider the connection dead and close it server-side. This detects half-open connections where the TCP session appears alive but the peer is unreachable. This commonly occurs after network changes, laptop sleep, or mobile network switches. Both client and server should independently monitor heartbeats.

Message Serialization

Define a consistent message envelope format with type, payload, unique ID, and timestamp fields. Every message should have a type field for dispatching to handlers and a unique ID for deduplication. Use JSON for compatibility across platforms. Consider MessagePack or Protocol Buffers for high-throughput scenarios where serialization overhead matters. Validate incoming messages against their expected schema before processing.

Room and Channel Patterns

Organize connections into logical groups (rooms, channels, topics). A chat room, a live dashboard, or a collaborative document each forms a channel. Track room membership in a server-side map (room name to set of connections). When broadcasting to a room, serialize the message once and send the same bytes to all members. Implement join/leave events so clients can track room state. Set maximum room sizes to prevent broadcast storms.

Scaling WebSockets

A single server can handle tens of thousands of WebSocket connections, but horizontal scaling requires coordination. Use a pub/sub backbone (Redis Pub/Sub, NATS, or Kafka) to broadcast messages across server instances. When a client on Server A sends a message to a user on Server B, the pub/sub layer routes it. Use sticky sessions at the load balancer so reconnections hit the same server when possible. Maintain a connection registry to locate which server holds a given user's connection.

Fallback to Polling

Not all environments support WebSockets — corporate proxies and restrictive firewalls may block the HTTP upgrade. Implement long-polling or Server-Sent Events (SSE) as a fallback. Libraries like Socket.IO handle transport negotiation automatically, trying WebSocket first and falling back to polling. Design your application logic to be transport-agnostic — the same event handlers and message formats should work regardless of the underlying transport.

Authentication Over WebSocket

Authenticate during the initial HTTP upgrade request using cookies, a bearer token in the query string, or a custom header. Do not rely on sending credentials as the first WebSocket message — the connection is already open and an unauthenticated client could send malicious data. For token expiration during long-lived connections, implement a re-authentication flow: the server sends a token-expired event, the client obtains a fresh token, and the server validates before continuing.

Handling Disconnections Gracefully

Distinguish between intentional disconnections (user navigates away, normal closure code) and unexpected ones (network failure, abnormal closure). For unexpected disconnections, maintain a short grace period (30-60 seconds) before removing the user from rooms or marking them offline. Buffer messages during the grace period and deliver them on reconnect. Use session IDs that persist across reconnections so the server can restore state without re-subscribing.

Message Ordering and Delivery Guarantees

WebSocket guarantees in-order delivery within a single connection, but messages can be lost if the connection drops mid-transmission. For critical messages, implement application-level acknowledgment: the sender retries unacknowledged messages after a timeout. Assign monotonically increasing sequence numbers so the receiver can detect gaps. For exactly-once semantics, combine acknowledgment with idempotent processing on the receiver side.

Best Practices

  • Send the minimum data necessary. Clients should subscribe to specific channels, and the server should send only relevant events.
  • Implement message acknowledgment for critical messages to prevent data loss during brief disconnections.
  • Use compression (permessage-deflate) for text-heavy traffic, but disable it for already-compressed binary data.
  • Log connection events with client identifiers, connection duration, and closure codes for debugging.
  • Implement rate limiting per connection to prevent abuse.
  • Test with simulated network conditions: high latency, packet loss, sudden disconnection.
  • Monitor connection count, message throughput, and message latency as key operational metrics.
  • Set a maximum message size and reject oversized messages immediately to prevent memory exhaustion.

Anti-Patterns

  • Using WebSockets for request-response — If the client always waits for a single response, you have reimplemented HTTP with more complexity. Use HTTP for request-response.
  • Storing state only in connection objects — Connection state is lost on disconnect. Persist important state to a database or cache.
  • Broadcasting all events to all connections — This wastes bandwidth and can expose data to unauthorized clients. Route messages to relevant connections only.
  • Ignoring backpressure — If the server produces messages faster than the client can consume them, the send buffer grows unbounded. Monitor and manage it.
  • No reconnection logic — Connections will drop. Always. A client without automatic reconnection appears broken after any transient issue.
  • Authenticating only via the first message — The connection is open before any message arrives. Authenticate during the HTTP upgrade.
  • Using a single global connection for all features — Multiplexing unrelated traffic creates a single point of failure and makes debugging difficult.