Otp Supervision
OTP supervision tree design for building fault-tolerant Elixir applications
You are an expert in OTP supervision for building fault-tolerant Elixir applications. ## Key Points - **Supervisor**: A process that monitors and restarts child processes. - **Restart Strategy**: `:one_for_one` (restart only the failed child), `:one_for_all` (restart all children), `:rest_for_one` (restart the failed child and all children started after it). - **Child Spec**: A map describing how to start, identify, and restart a child process. - **DynamicSupervisor**: A supervisor for starting children on demand at runtime. - **Application**: The top-level entry point that starts the root supervision tree. - **Registry**: A local, decentralized process registry often used with DynamicSupervisor. - Design the supervision tree to match failure domains. Processes that depend on each other should be under the same supervisor with an appropriate strategy. - Use `:one_for_one` as the default strategy. Only use `:one_for_all` or `:rest_for_one` when children truly depend on each other's state. - Use `DynamicSupervisor` for processes created at runtime (user sessions, game rooms, job workers). - Pair `DynamicSupervisor` with `Registry` for named lookups without atoms. - Set `:restart` to `:transient` for processes that are expected to stop normally and should only be restarted on abnormal exits. - Set explicit `:shutdown` values for processes that need time to clean up (default is 5000ms for workers, `:infinity` for supervisors).
skilldb get elixir-phoenix-skills/Otp SupervisionFull skill: 175 linesOTP Supervision Trees — Elixir/Phoenix
You are an expert in OTP supervision for building fault-tolerant Elixir applications.
Overview
OTP supervision trees are the foundation of fault tolerance in Elixir. A supervisor is a process whose sole job is to monitor child processes and restart them according to a defined strategy when they crash. By structuring an application as a tree of supervisors and workers, failures are isolated and automatically recovered.
Core Philosophy
Supervision trees are not about preventing crashes — they are about recovering from them automatically. The "let it crash" philosophy does not mean writing careless code; it means designing your system so that when unexpected failures inevitably occur, the affected component restarts cleanly without bringing down the rest of the application. This is fundamentally different from the defensive programming approach of trying to handle every possible error case inline.
The structure of your supervision tree encodes your application's failure domains. Processes that share fate — where one crashing means the others cannot function correctly — belong under the same supervisor with a :one_for_all or :rest_for_one strategy. Independent processes belong under :one_for_one supervisors where a failure in one has no impact on the others. Getting this structure right requires understanding which parts of your system depend on each other.
Dynamic supervision with DynamicSupervisor and Registry is the idiomatic way to handle resources created at runtime — user sessions, game rooms, worker processes, and connection handlers. Rather than pre-allocating processes or managing their lifecycle manually, you let the supervisor handle creation, monitoring, and restart while Registry provides fast named lookups without the dangers of dynamic atom creation.
Anti-Patterns
-
Flat Supervision Trees: Putting every process in the application as a direct child of the root supervisor. This means any child hitting the restart limit can bring down the entire application. Group related processes under sub-supervisors to isolate failure domains.
-
Restart Loops Without Diagnosis: Accepting the default
max_restarts: 3, max_seconds: 5without tuning for your use case. A process that crashes immediately on start — due to a configuration error or missing dependency — will exhaust restarts in seconds and take down its supervisor. Understand why crashes happen before choosing restart parameters. -
Dynamic Atoms from User Input: Creating process names with
String.to_atom("session_#{user_id}"). Atoms are never garbage collected. With enough unique users, your node runs out of atom space and crashes. UseRegistrywith{:via, Registry, {MyRegistry, user_id}}tuples instead. -
Ignoring Child Start Order: Adding children to a supervisor without considering that they start in list order and stop in reverse order. If a process depends on another being available, the dependency must appear earlier in the children list.
-
Supervising Everything: Making every module a GenServer under supervision when a simple function call or Agent would suffice. Supervision adds complexity. Use it for processes that hold important state or perform ongoing work, not for stateless computations that can simply be called as functions.
Core Concepts
- Supervisor: A process that monitors and restarts child processes.
- Restart Strategy:
:one_for_one(restart only the failed child),:one_for_all(restart all children),:rest_for_one(restart the failed child and all children started after it). - Child Spec: A map describing how to start, identify, and restart a child process.
- DynamicSupervisor: A supervisor for starting children on demand at runtime.
- Application: The top-level entry point that starts the root supervision tree.
- Registry: A local, decentralized process registry often used with DynamicSupervisor.
Implementation Patterns
Application Supervisor
defmodule MyApp.Application do
use Application
@impl true
def start(_type, _args) do
children = [
MyApp.Repo,
{Phoenix.PubSub, name: MyApp.PubSub},
{Registry, keys: :unique, name: MyApp.GameRegistry},
{DynamicSupervisor, name: MyApp.GameSupervisor, strategy: :one_for_one},
MyApp.SchedulerWorker,
MyAppWeb.Endpoint
]
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
end
Custom Supervisor
defmodule MyApp.Pipeline.Supervisor do
use Supervisor
def start_link(opts) do
Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
end
@impl true
def init(_opts) do
children = [
{MyApp.Pipeline.Producer, []},
{MyApp.Pipeline.Consumer, [subscribe_to: MyApp.Pipeline.Producer]},
{MyApp.Pipeline.MetricsCollector, []}
]
Supervisor.init(children, strategy: :rest_for_one)
end
end
DynamicSupervisor with Registry
defmodule MyApp.GameManager do
@doc "Start a new game process dynamically"
def start_game(game_id, opts \\ []) do
spec = {MyApp.GameServer, Keyword.put(opts, :game_id, game_id)}
DynamicSupervisor.start_child(MyApp.GameSupervisor, spec)
end
def stop_game(game_id) do
case Registry.lookup(MyApp.GameRegistry, game_id) do
[{pid, _}] -> DynamicSupervisor.terminate_child(MyApp.GameSupervisor, pid)
[] -> {:error, :not_found}
end
end
def list_games do
DynamicSupervisor.which_children(MyApp.GameSupervisor)
|> Enum.map(fn {_, pid, _, _} -> pid end)
end
end
defmodule MyApp.GameServer do
use GenServer
def start_link(opts) do
game_id = Keyword.fetch!(opts, :game_id)
GenServer.start_link(__MODULE__, opts, name: via(game_id))
end
defp via(game_id) do
{:via, Registry, {MyApp.GameRegistry, game_id}}
end
@impl true
def init(opts) do
{:ok, %{game_id: opts[:game_id], players: [], state: :waiting}}
end
end
Custom Child Spec
defmodule MyApp.Worker do
use GenServer, restart: :transient
# Override the default child_spec
def child_spec(opts) do
%{
id: {__MODULE__, opts[:id]},
start: {__MODULE__, :start_link, [opts]},
restart: :transient,
shutdown: 10_000
}
end
def start_link(opts) do
GenServer.start_link(__MODULE__, opts)
end
@impl true
def init(opts), do: {:ok, opts}
end
Best Practices
- Design the supervision tree to match failure domains. Processes that depend on each other should be under the same supervisor with an appropriate strategy.
- Use
:one_for_oneas the default strategy. Only use:one_for_allor:rest_for_onewhen children truly depend on each other's state. - Use
DynamicSupervisorfor processes created at runtime (user sessions, game rooms, job workers). - Pair
DynamicSupervisorwithRegistryfor named lookups without atoms. - Set
:restartto:transientfor processes that are expected to stop normally and should only be restarted on abnormal exits. - Set explicit
:shutdownvalues for processes that need time to clean up (default is 5000ms for workers,:infinityfor supervisors). - Keep the supervision tree shallow when possible — deep trees add restart latency.
Common Pitfalls
- Restart loops: A child that crashes immediately on start will trigger the supervisor's max restart intensity and bring down the supervisor. Set realistic
max_restartsandmax_secondsvalues, and ensureinit/1can handle degraded conditions. - Starting order matters: Children start in list order and stop in reverse. Place dependencies before dependents.
- Atoms from user input: Never create process names from user-supplied strings via
String.to_atom/1. UseRegistryor{:via, ...}tuples instead. - Ignoring shutdown signals: If a GenServer does not handle termination, it gets killed after the shutdown timeout. Implement
terminate/2for cleanup when needed. - Supervising tasks directly: Use
Task.Supervisorfor supervised tasks rather than adding rawTaskchildren to a regular supervisor.
Install this skill directly: skilldb add elixir-phoenix-skills
Related Skills
Channels
Phoenix Channels and PubSub for real-time bidirectional communication
Concurrency
Elixir processes and message passing for concurrent and parallel programming
Deployment
Deploying Elixir/Phoenix applications with Mix releases, Docker, and Fly.io
Ecto
Ecto patterns for database schemas, queries, changesets, and migrations in Elixir
Genserver
GenServer patterns for stateful processes in Elixir OTP applications
Phoenix Liveview
Phoenix LiveView patterns for building real-time, server-rendered interactive UIs