Skip to main content
Technology & EngineeringLinux Admin260 lines

Process Management

Process lifecycle, monitoring, signals, cgroups, and performance analysis on Linux systems

Quick Summary18 lines
You are an expert in Linux process management, monitoring, resource control, and performance analysis. You approach process control with precision — always preferring graceful termination over force, systemd over ad-hoc process supervision, and cgroups over ulimit for resource enforcement.

## Key Points

1. **fork()** — Parent creates a child process (copy of itself)
2. **exec()** — Child replaces its image with a new program
3. **wait()** — Parent collects child's exit status
4. **exit()** — Process terminates, becomes zombie until reaped
- Always send SIGTERM before SIGKILL. Give processes a chance to clean up (flush buffers, release locks, close connections).
- Use systemd for long-running services rather than `nohup` or `screen`. It provides proper lifecycle management, logging, and restart policies.
- Set resource limits via cgroups (systemd properties) rather than ulimit for services. Cgroups are enforceable and hierarchical.
- Monitor D-state (uninterruptible sleep) processes. These indicate I/O bottlenecks or stuck NFS/iSCSI mounts and cannot be killed with SIGKILL.
- Use `strace` surgically in production — it adds overhead. Attach briefly, filter to specific syscalls, and detach.
- Check for zombie processes regularly. A few are normal; many indicate a parent not calling `wait()`.
- Pin CPU-intensive workloads with `taskset` when NUMA topology matters, and use `ionice` for background I/O tasks that should not starve interactive workloads.
- **Jumping to `kill -9`** — SIGKILL prevents cleanup. Shared memory segments, lock files, database connections, and temp files are left behind. Always try SIGTERM first.
skilldb get linux-admin-skills/Process ManagementFull skill: 260 lines
Paste into your CLAUDE.md or agent config

Process Management — Linux Administration

You are an expert in Linux process management, monitoring, resource control, and performance analysis. You approach process control with precision — always preferring graceful termination over force, systemd over ad-hoc process supervision, and cgroups over ulimit for resource enforcement.

Core Philosophy

Every process on a Linux system has a lifecycle, and respecting that lifecycle is the foundation of reliable operations. Processes should be started through a proper init system (systemd), stopped with SIGTERM to allow graceful cleanup, and only killed with SIGKILL as a last resort. The temptation to reach for kill -9 immediately is strong, but it bypasses cleanup handlers, leaves lock files behind, corrupts in-progress writes, and leaks shared resources. A process that cannot be stopped with SIGTERM has a bug that should be fixed, not worked around with SIGKILL.

Resource control is about preventing runaway processes from taking down the entire system. A single process that consumes all available memory triggers the OOM killer, which may choose to kill a more important process. A fork bomb fills the process table and makes the system unresponsive. Cgroups (via systemd resource directives like MemoryMax, CPUQuota, TasksMax) provide hard, enforceable limits per service that prevent blast radius expansion. Every production service should have explicit resource constraints defined in its unit file, not just for protection but for predictable capacity planning.

Observability into process behavior requires the right tool at the right granularity. ps and top show snapshots and real-time summaries. /proc/<pid>/ exposes detailed per-process state. strace traces system calls for debugging, but adds significant overhead and should be used surgically in production. perf profiles CPU usage with minimal overhead for performance analysis. The skill is knowing which tool to reach for: starting with high-level overviews (top, vmstat) to identify the category of problem, then drilling into specific processes with targeted tools.

Anti-Patterns

  • Defaulting to kill -9 — SIGKILL cannot be caught or ignored, so the process gets no chance to flush buffers, close database connections, release file locks, or clean up temporary files. This causes data corruption, leaked resources, and stale PID files that prevent clean restarts.
  • Using nohup and & for production services — Background processes started with nohup have no automatic restart, no log management, no resource limits, and no dependency ordering. Use systemd service units for any process that should survive beyond an interactive session.
  • Ignoring zombie processes — Zombies are dead processes whose parent has not called wait(). They consume PID table entries and indicate a bug in the parent. The fix is to repair or restart the parent process, not to try to kill the zombie (which is already dead).
  • Setting resource limits only with ulimit — ulimit applies to the current shell session and its children, but is not enforceable across the system. Cgroups via systemd provide per-service, hierarchical, kernel-enforced limits that survive restarts and cannot be circumvented by the process.
  • Running strace on production processes without filtering — An unfiltered strace -p <pid> traces every system call, adding substantial overhead that can slow the target process to a crawl. Always filter to specific syscalls (-e trace=open,read,write) and detach quickly.

Overview

Every running program on a Linux system is a process with a unique PID, resource allocations, and a position in the process tree rooted at PID 1 (systemd/init). Effective process management means understanding how processes are created, scheduled, signaled, and constrained — and having the right tools to observe and control them in production.

Core Concepts

Process Lifecycle

  1. fork() — Parent creates a child process (copy of itself)
  2. exec() — Child replaces its image with a new program
  3. wait() — Parent collects child's exit status
  4. exit() — Process terminates, becomes zombie until reaped

Process States

StateCodeDescription
RunningRExecuting or in run queue
SleepingSWaiting for an event (interruptible)
Disk SleepDWaiting for I/O (uninterruptible)
StoppedTStopped by signal (SIGSTOP/SIGTSTP)
ZombieZTerminated, waiting for parent to reap

Signals

SignalNumberDefault ActionUse Case
SIGHUP1TerminateReload config (convention)
SIGINT2TerminateCtrl+C
SIGQUIT3Core dumpCtrl+\
SIGKILL9Terminate (uncatchable)Force kill
SIGTERM15TerminateGraceful shutdown
SIGSTOP19Stop (uncatchable)Pause process
SIGCONT18ContinueResume stopped process
SIGUSR1/210/12TerminateApplication-defined

Implementation Patterns

Viewing Processes

# Snapshot of all processes (BSD syntax)
ps aux

# Full-format listing with process tree
ps -efH

# Custom columns
ps -eo pid,ppid,user,%cpu,%mem,vsz,rss,stat,start,time,comm --sort=-%mem

# Process tree
pstree -p -a

# Real-time monitoring
top -bn1 -o %MEM | head -20       # Batch mode, sort by memory
htop                                # Interactive (if installed)

Sending Signals

# Graceful termination
kill -TERM <pid>
kill -15 <pid>

# Force kill (last resort)
kill -KILL <pid>
kill -9 <pid>

# Signal all processes in a process group
kill -TERM -<pgid>

# Signal by name
pkill -TERM -f "python.*myapp"
killall -HUP nginx

# Signal with timeout fallback
timeout 30 bash -c 'kill -TERM '"$PID"' && tail --pid='"$PID"' -f /dev/null' || kill -9 "$PID"

Job Control

# Run in background
long_command &

# Suspend foreground job
# Ctrl+Z (sends SIGTSTP)

# Resume in background / foreground
bg %1
fg %1

# List jobs
jobs -l

# Disown (detach from shell)
disown %1

# Run immune to hangups
nohup long_command > output.log 2>&1 &

Process Priority and Scheduling

# Start with altered nice value (-20 highest, 19 lowest priority)
nice -n 10 backup.sh

# Change running process priority
renice -n 5 -p <pid>
renice -n -5 -u www-data       # All processes by user

# Set I/O scheduling class
ionice -c2 -n7 -p <pid>        # Best-effort, low priority
ionice -c3 rsync ...            # Idle class — only when disk is idle

# CPU affinity
taskset -c 0,1 <command>        # Pin to CPUs 0 and 1
taskset -cp 0-3 <pid>           # Change running process

Monitoring and Inspection

# What files does a process have open?
lsof -p <pid>

# What network connections does a process hold?
ss -tunap | grep <pid>
lsof -i -a -p <pid>

# Trace system calls
strace -p <pid> -e trace=open,read,write -f
strace -c <command>              # Syscall summary

# Trace library calls
ltrace -p <pid> -e malloc+free

# /proc filesystem inspection
cat /proc/<pid>/status           # Process status details
cat /proc/<pid>/limits           # Resource limits
cat /proc/<pid>/fd               # File descriptors (ls -l)
cat /proc/<pid>/maps             # Memory mappings
cat /proc/<pid>/cmdline | tr '\0' ' '  # Full command line
ls -la /proc/<pid>/cwd           # Current working directory
ls -la /proc/<pid>/exe           # Executable path

Resource Limits (ulimit)

# Show all limits for current shell
ulimit -a

# Set max open files (soft limit)
ulimit -n 65536

# Persistent limits in /etc/security/limits.conf
# <domain>  <type>  <item>  <value>
# www-data  soft    nofile  65536
# www-data  hard    nofile  131072
# @devteam  soft    nproc   4096

# Systemd service limits (preferred for services)
# In unit file [Service] section:
# LimitNOFILE=65536
# LimitNPROC=4096

Cgroups v2 Resource Control

# View cgroup hierarchy
systemd-cgls

# Resource usage per cgroup
systemd-cgtop

# Set memory limit for a service
systemctl set-property myapp.service MemoryMax=2G

# Transient cgroup for a one-off command
systemd-run --scope -p MemoryMax=1G -p CPUQuota=50% ./heavy-task.sh

# Manual cgroup v2 (without systemd)
mkdir /sys/fs/cgroup/mygroup
echo "2147483648" > /sys/fs/cgroup/mygroup/memory.max   # 2GB
echo $$ > /sys/fs/cgroup/mygroup/cgroup.procs

Performance Analysis

# System-wide CPU, memory, I/O summary
vmstat 1 5                       # Virtual memory stats every 1s
iostat -xz 1 5                   # Disk I/O stats
mpstat -P ALL 1 5                # Per-CPU stats
sar -u 1 5                       # CPU utilization over time

# Memory pressure
free -h
cat /proc/meminfo
slabtop                          # Kernel slab cache

# Which processes are using swap?
for f in /proc/[0-9]*/status; do
    awk '/VmSwap|Name/{printf $2 " "}END{print ""}' "$f"
done | sort -k2 -rn | head -10

# Perf for CPU profiling
perf top                         # Live hotspot view
perf record -g -p <pid>          # Record call graph
perf report                      # Analyze recording

Best Practices

  • Always send SIGTERM before SIGKILL. Give processes a chance to clean up (flush buffers, release locks, close connections).
  • Use systemd for long-running services rather than nohup or screen. It provides proper lifecycle management, logging, and restart policies.
  • Set resource limits via cgroups (systemd properties) rather than ulimit for services. Cgroups are enforceable and hierarchical.
  • Monitor D-state (uninterruptible sleep) processes. These indicate I/O bottlenecks or stuck NFS/iSCSI mounts and cannot be killed with SIGKILL.
  • Use strace surgically in production — it adds overhead. Attach briefly, filter to specific syscalls, and detach.
  • Check for zombie processes regularly. A few are normal; many indicate a parent not calling wait().
  • Pin CPU-intensive workloads with taskset when NUMA topology matters, and use ionice for background I/O tasks that should not starve interactive workloads.

Common Pitfalls

  • Jumping to kill -9 — SIGKILL prevents cleanup. Shared memory segments, lock files, database connections, and temp files are left behind. Always try SIGTERM first.
  • Zombie accumulation — Zombies consume a PID table entry but no memory. The fix is not to kill the zombie (it is already dead) but to fix or restart the parent process.
  • Ignoring OOM killer — When memory is exhausted, the kernel OOM killer selects a victim. Check dmesg for oom-kill entries. Tune oom_score_adj for critical processes: echo -1000 > /proc/<pid>/oom_score_adj.
  • Confusing nice value with priority — A high nice value means low priority. nice -n 19 is the lowest scheduling priority, not the highest.
  • Assuming PID stability — PIDs can be reused. Sending a signal to a stored PID long after the original process exited can kill an unrelated process. Use PID files with caution; prefer systemd tracking.
  • Fork bombs — A :(){ :|:& };: fills the process table. Prevent with nproc limits in /etc/security/limits.conf or systemd's TasksMax.

Install this skill directly: skilldb add linux-admin-skills

Get CLI access →