Technology & EngineeringElixir Phoenix175 lines

Otp Supervision

OTP supervision tree design for building fault-tolerant Elixir applications

Quick Summary18 lines

You are an expert in OTP supervision for building fault-tolerant Elixir applications.

## Key Points

- **Supervisor**: A process that monitors and restarts child processes.
- **Restart Strategy**: `:one_for_one` (restart only the failed child), `:one_for_all` (restart all children), `:rest_for_one` (restart the failed child and all children started after it).
- **Child Spec**: A map describing how to start, identify, and restart a child process.
- **DynamicSupervisor**: A supervisor for starting children on demand at runtime.
- **Application**: The top-level entry point that starts the root supervision tree.
- **Registry**: A local, decentralized process registry often used with DynamicSupervisor.
- Design the supervision tree to match failure domains. Processes that depend on each other should be under the same supervisor with an appropriate strategy.
- Use `:one_for_one` as the default strategy. Only use `:one_for_all` or `:rest_for_one` when children truly depend on each other's state.
- Use `DynamicSupervisor` for processes created at runtime (user sessions, game rooms, job workers).
- Pair `DynamicSupervisor` with `Registry` for named lookups without atoms.
- Set `:restart` to `:transient` for processes that are expected to stop normally and should only be restarted on abnormal exits.
- Set explicit `:shutdown` values for processes that need time to clean up (default is 5000ms for workers, `:infinity` for supervisors).

skilldb get elixir-phoenix-skills/Otp SupervisionFull skill: 175 lines

Paste into your CLAUDE.md or agent config

OTP Supervision Trees — Elixir/Phoenix

You are an expert in OTP supervision for building fault-tolerant Elixir applications.

Overview

OTP supervision trees are the foundation of fault tolerance in Elixir. A supervisor is a process whose sole job is to monitor child processes and restart them according to a defined strategy when they crash. By structuring an application as a tree of supervisors and workers, failures are isolated and automatically recovered.

Core Philosophy

Supervision trees are not about preventing crashes — they are about recovering from them automatically. The "let it crash" philosophy does not mean writing careless code; it means designing your system so that when unexpected failures inevitably occur, the affected component restarts cleanly without bringing down the rest of the application. This is fundamentally different from the defensive programming approach of trying to handle every possible error case inline.

The structure of your supervision tree encodes your application's failure domains. Processes that share fate — where one crashing means the others cannot function correctly — belong under the same supervisor with a :one_for_all or :rest_for_one strategy. Independent processes belong under :one_for_one supervisors where a failure in one has no impact on the others. Getting this structure right requires understanding which parts of your system depend on each other.

Dynamic supervision with DynamicSupervisor and Registry is the idiomatic way to handle resources created at runtime — user sessions, game rooms, worker processes, and connection handlers. Rather than pre-allocating processes or managing their lifecycle manually, you let the supervisor handle creation, monitoring, and restart while Registry provides fast named lookups without the dangers of dynamic atom creation.

Anti-Patterns

Flat Supervision Trees: Putting every process in the application as a direct child of the root supervisor. This means any child hitting the restart limit can bring down the entire application. Group related processes under sub-supervisors to isolate failure domains.
Restart Loops Without Diagnosis: Accepting the default max_restarts: 3, max_seconds: 5 without tuning for your use case. A process that crashes immediately on start — due to a configuration error or missing dependency — will exhaust restarts in seconds and take down its supervisor. Understand why crashes happen before choosing restart parameters.
Dynamic Atoms from User Input: Creating process names with String.to_atom("session_#{user_id}"). Atoms are never garbage collected. With enough unique users, your node runs out of atom space and crashes. Use Registry with {:via, Registry, {MyRegistry, user_id}} tuples instead.
Ignoring Child Start Order: Adding children to a supervisor without considering that they start in list order and stop in reverse order. If a process depends on another being available, the dependency must appear earlier in the children list.
Supervising Everything: Making every module a GenServer under supervision when a simple function call or Agent would suffice. Supervision adds complexity. Use it for processes that hold important state or perform ongoing work, not for stateless computations that can simply be called as functions.

Core Concepts

Supervisor: A process that monitors and restarts child processes.
Restart Strategy: :one_for_one (restart only the failed child), :one_for_all (restart all children), :rest_for_one (restart the failed child and all children started after it).
Child Spec: A map describing how to start, identify, and restart a child process.
DynamicSupervisor: A supervisor for starting children on demand at runtime.
Application: The top-level entry point that starts the root supervision tree.
Registry: A local, decentralized process registry often used with DynamicSupervisor.

Implementation Patterns

Application Supervisor

defmodule MyApp.Application do
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      MyApp.Repo,
      {Phoenix.PubSub, name: MyApp.PubSub},
      {Registry, keys: :unique, name: MyApp.GameRegistry},
      {DynamicSupervisor, name: MyApp.GameSupervisor, strategy: :one_for_one},
      MyApp.SchedulerWorker,
      MyAppWeb.Endpoint
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Custom Supervisor

defmodule MyApp.Pipeline.Supervisor do
  use Supervisor

  def start_link(opts) do
    Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  @impl true
  def init(_opts) do
    children = [
      {MyApp.Pipeline.Producer, []},
      {MyApp.Pipeline.Consumer, [subscribe_to: MyApp.Pipeline.Producer]},
      {MyApp.Pipeline.MetricsCollector, []}
    ]

    Supervisor.init(children, strategy: :rest_for_one)
  end
end

DynamicSupervisor with Registry

defmodule MyApp.GameManager do
  @doc "Start a new game process dynamically"
  def start_game(game_id, opts \\ []) do
    spec = {MyApp.GameServer, Keyword.put(opts, :game_id, game_id)}
    DynamicSupervisor.start_child(MyApp.GameSupervisor, spec)
  end

  def stop_game(game_id) do
    case Registry.lookup(MyApp.GameRegistry, game_id) do
      [{pid, _}] -> DynamicSupervisor.terminate_child(MyApp.GameSupervisor, pid)
      [] -> {:error, :not_found}
    end
  end

  def list_games do
    DynamicSupervisor.which_children(MyApp.GameSupervisor)
    |> Enum.map(fn {_, pid, _, _} -> pid end)
  end
end

defmodule MyApp.GameServer do
  use GenServer

  def start_link(opts) do
    game_id = Keyword.fetch!(opts, :game_id)
    GenServer.start_link(__MODULE__, opts, name: via(game_id))
  end

  defp via(game_id) do
    {:via, Registry, {MyApp.GameRegistry, game_id}}
  end

  @impl true
  def init(opts) do
    {:ok, %{game_id: opts[:game_id], players: [], state: :waiting}}
  end
end

Custom Child Spec

defmodule MyApp.Worker do
  use GenServer, restart: :transient

  # Override the default child_spec
  def child_spec(opts) do
    %{
      id: {__MODULE__, opts[:id]},
      start: {__MODULE__, :start_link, [opts]},
      restart: :transient,
      shutdown: 10_000
    }
  end

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts)
  end

  @impl true
  def init(opts), do: {:ok, opts}
end

Best Practices

Design the supervision tree to match failure domains. Processes that depend on each other should be under the same supervisor with an appropriate strategy.
Use :one_for_one as the default strategy. Only use :one_for_all or :rest_for_one when children truly depend on each other's state.
Use DynamicSupervisor for processes created at runtime (user sessions, game rooms, job workers).
Pair DynamicSupervisor with Registry for named lookups without atoms.
Set :restart to :transient for processes that are expected to stop normally and should only be restarted on abnormal exits.
Set explicit :shutdown values for processes that need time to clean up (default is 5000ms for workers, :infinity for supervisors).
Keep the supervision tree shallow when possible — deep trees add restart latency.

Common Pitfalls

Restart loops: A child that crashes immediately on start will trigger the supervisor's max restart intensity and bring down the supervisor. Set realistic max_restarts and max_seconds values, and ensure init/1 can handle degraded conditions.
Starting order matters: Children start in list order and stop in reverse. Place dependencies before dependents.
Atoms from user input: Never create process names from user-supplied strings via String.to_atom/1. Use Registry or {:via, ...} tuples instead.
Ignoring shutdown signals: If a GenServer does not handle termination, it gets killed after the shutdown timeout. Implement terminate/2 for cleanup when needed.
Supervising tasks directly: Use Task.Supervisor for supervised tasks rather than adding raw Task children to a regular supervisor.

Install this skill directly: skilldb add elixir-phoenix-skills

Get CLI access →

Otp Supervision

OTP Supervision Trees — Elixir/Phoenix

Overview

Core Philosophy

Anti-Patterns

Core Concepts

Implementation Patterns

Application Supervisor

Custom Supervisor

DynamicSupervisor with Registry

Custom Child Spec

Best Practices

Common Pitfalls

Related Skills

Channels

Concurrency

Deployment

Ecto

Genserver

Phoenix Liveview