Skip to main content
Technology & EngineeringDeployment Patterns478 lines

zero-downtime-deployment

Comprehensive guide to zero-downtime deployment patterns including blue-green deployments, canary releases, rolling updates, database migrations during deployments, health check strategies, rollback mechanisms, and feature flag integration for safe progressive rollouts.

Quick Summary35 lines
Any deployment that causes user-visible errors or outages erodes trust. Zero-downtime deployment ensures that at every point during the transition from old version to new version, the application is serving traffic correctly.

## Key Points

1. Blue is live, serving all traffic
2. Deploy new version to Green
3. Run smoke tests against Green
4. Switch load balancer from Blue → Green
5. Green is now live
6. Blue becomes the rollback target
- name: Deploy
- name: Smoke test
- name: Rollback on failure
1. Can you identify the last known good version?
2. Is the database schema backward-compatible?
3. Are there new API contracts that clients depend on?

## Quick Example

```nginx
upstream app {
    server 10.0.1.10:3000 weight=9;   # stable (90%)
    server 10.0.1.20:3000 weight=1;   # canary (10%)
}
```

```sql
-- Migration: Add new column (nullable, with default)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Backfill data
UPDATE users SET display_name = name WHERE display_name IS NULL;
```
skilldb get deployment-patterns-skills/zero-downtime-deploymentFull skill: 478 lines
Paste into your CLAUDE.md or agent config

Zero-Downtime Deployment Patterns

Why Zero-Downtime Matters

Any deployment that causes user-visible errors or outages erodes trust. Zero-downtime deployment ensures that at every point during the transition from old version to new version, the application is serving traffic correctly.

The core principle: never have a moment where no healthy instance is available.

Blue-Green Deployment

Two identical environments ("blue" and "green") swap between live and idle.

How It Works

1. Blue is live, serving all traffic
2. Deploy new version to Green
3. Run smoke tests against Green
4. Switch load balancer from Blue → Green
5. Green is now live
6. Blue becomes the rollback target

Implementation with Nginx

# /etc/nginx/conf.d/app.conf
upstream app {
    # Toggle between blue and green
    server 10.0.1.10:3000;  # blue
    # server 10.0.1.20:3000;  # green (uncomment to switch)
}

server {
    listen 80;
    server_name example.com;

    location / {
        proxy_pass http://app;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Implementation with AWS ALB

#!/bin/bash
# deploy-blue-green.sh

CLUSTER="my-cluster"
SERVICE="my-service"
NEW_TASK_DEF="my-app:42"

# Update service with new task definition
aws ecs update-service \
  --cluster $CLUSTER \
  --service $SERVICE \
  --task-definition $NEW_TASK_DEF \
  --deployment-configuration "minimumHealthyPercent=100,maximumPercent=200"

# ECS handles draining old tasks after new ones pass health checks
aws ecs wait services-stable --cluster $CLUSTER --services $SERVICE

Pros and Cons

Pros: Instant rollback (switch back to old environment), easy to reason about, full testing before traffic switch.

Cons: Requires 2x infrastructure during deployment, database schema must be compatible with both versions simultaneously.

Anti-pattern: Switching traffic before smoke tests pass. Always validate the new environment before the cutover.

Canary Deployment

Route a small percentage of traffic to the new version, gradually increasing if metrics look good.

Traffic Splitting with Nginx

upstream app {
    server 10.0.1.10:3000 weight=9;   # stable (90%)
    server 10.0.1.20:3000 weight=1;   # canary (10%)
}

Progressive Rollout Script

#!/bin/bash
# canary-rollout.sh

CANARY_STEPS=(5 10 25 50 75 100)
STABILITY_WAIT=300  # 5 minutes between steps

for pct in "${CANARY_STEPS[@]}"; do
    echo "Setting canary traffic to ${pct}%"
    update_traffic_split $pct

    echo "Waiting ${STABILITY_WAIT}s for metrics..."
    sleep $STABILITY_WAIT

    ERROR_RATE=$(get_error_rate "canary")
    P99_LATENCY=$(get_p99_latency "canary")

    if (( $(echo "$ERROR_RATE > 1.0" | bc -l) )); then
        echo "Error rate too high (${ERROR_RATE}%). Rolling back."
        update_traffic_split 0
        exit 1
    fi

    if (( $(echo "$P99_LATENCY > 500" | bc -l) )); then
        echo "Latency too high (${P99_LATENCY}ms). Rolling back."
        update_traffic_split 0
        exit 1
    fi

    echo "Metrics OK. Proceeding to next step."
done

echo "Canary promotion complete."

Kubernetes Canary with Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 5
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100
      canaryService: my-app-canary
      stableService: my-app-stable

Anti-pattern: Canary without automated metric checks. Manual observation doesn't scale and misses subtle regressions.

Rolling Updates

Replace instances one at a time. The default strategy for most orchestrators.

Kubernetes Rolling Update

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1    # At most 1 pod down at a time
      maxSurge: 1           # At most 1 extra pod during update
  template:
    spec:
      containers:
        - name: app
          image: my-app:v2
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
      terminationGracePeriodSeconds: 60

Graceful Shutdown

Your application must handle SIGTERM to drain in-flight requests:

const server = app.listen(3000);

process.on('SIGTERM', () => {
  console.log('SIGTERM received. Shutting down gracefully...');

  // Stop accepting new connections
  server.close(() => {
    console.log('All connections drained. Exiting.');
    process.exit(0);
  });

  // Force shutdown after 30 seconds
  setTimeout(() => {
    console.error('Forced shutdown after timeout');
    process.exit(1);
  }, 30000);
});

Anti-pattern: No graceful shutdown handler. When the orchestrator kills your pod, in-flight requests get 502 errors.

Database Migrations During Deployment

The hardest part of zero-downtime deployment. The old and new versions of your app must coexist during the transition.

The Expand-Contract Pattern

Never make breaking schema changes in one step. Use two deployments:

Step 1: Expand (add new, keep old)

-- Migration: Add new column (nullable, with default)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Backfill data
UPDATE users SET display_name = name WHERE display_name IS NULL;

Deploy new code that writes to BOTH name and display_name.

Step 2: Contract (remove old)

After all instances run the new code:

-- Migration: Remove old column
ALTER TABLE users DROP COLUMN name;

Column Rename Example

Deploy 1: Add new_column, write to both old and new
Deploy 2: Read from new_column, write to both
Deploy 3: Drop old_column

Safe Migration Practices

-- SAFE: Adding a nullable column
ALTER TABLE orders ADD COLUMN tracking_url TEXT;

-- SAFE: Adding an index concurrently (Postgres)
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);

-- DANGEROUS: Renaming a column
-- ALTER TABLE users RENAME COLUMN name TO display_name;
-- Old code will break when looking for "name"

-- DANGEROUS: Adding NOT NULL without default
-- ALTER TABLE orders ADD COLUMN priority INT NOT NULL;
-- Fails for existing rows

-- SAFE alternative:
ALTER TABLE orders ADD COLUMN priority INT DEFAULT 0 NOT NULL;

Migration Tooling

// Run migrations before deploy (not during startup)
// CI/CD pipeline:
// 1. Run migrations
// 2. Deploy new code
// 3. Verify

// package.json
{
  "scripts": {
    "migrate": "prisma migrate deploy",
    "deploy": "npm run migrate && npm run start"
  }
}

Anti-pattern: Running destructive migrations (DROP COLUMN, RENAME) while old code is still serving traffic. Always use expand-contract.

Health Checks

Layered Health Checks

// Liveness: "Is the process alive?"
app.get('/healthz', (req, res) => {
  res.status(200).json({ status: 'alive' });
});

// Readiness: "Can this instance serve traffic?"
app.get('/ready', async (req, res) => {
  try {
    await db.query('SELECT 1');
    await redis.ping();
    res.status(200).json({ status: 'ready' });
  } catch (err) {
    res.status(503).json({ status: 'not ready', error: err.message });
  }
});

// Startup: "Has initial setup completed?"
let startupComplete = false;
app.get('/startup', (req, res) => {
  if (startupComplete) {
    res.status(200).json({ status: 'started' });
  } else {
    res.status(503).json({ status: 'starting' });
  }
});

// After initial setup
async function initialize() {
  await loadCache();
  await warmConnections();
  startupComplete = true;
}

Kubernetes Probe Configuration

readinessProbe:
  httpGet:
    path: /ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

livenessProbe:
  httpGet:
    path: /healthz
    port: 3000
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /startup
    port: 3000
  periodSeconds: 5
  failureThreshold: 30  # 30 * 5s = 150s max startup time

Anti-pattern: Using a single health check for both liveness and readiness. A pod that fails to connect to the database should be marked unready (stop sending traffic), not killed and restarted.

Rollback Strategies

Immediate Rollback

# Kubernetes
kubectl rollout undo deployment/my-app

# Fly.io
fly releases
fly deploy --image registry.fly.io/my-app:previous-sha

# Vercel
vercel rollback

# Docker
docker service update --rollback my-service

Automated Rollback

# GitHub Actions
- name: Deploy
  id: deploy
  run: ./deploy.sh
  continue-on-error: true

- name: Smoke test
  id: smoke
  run: ./smoke-test.sh
  continue-on-error: true

- name: Rollback on failure
  if: steps.deploy.outcome == 'failure' || steps.smoke.outcome == 'failure'
  run: ./rollback.sh ${{ env.PREVIOUS_VERSION }}

Rollback Checklist

  1. Can you identify the last known good version?
  2. Is the database schema backward-compatible?
  3. Are there new API contracts that clients depend on?
  4. Do you need to revert configuration/secrets?

Feature Flags

Decouple deployment from release. Deploy code dark, then enable via flags:

import { getFlag } from './feature-flags';

app.get('/checkout', async (req, res) => {
  const useNewCheckout = await getFlag('new-checkout', {
    userId: req.user.id,
    percentage: 10,  // 10% rollout
  });

  if (useNewCheckout) {
    return newCheckoutHandler(req, res);
  }
  return legacyCheckoutHandler(req, res);
});

Simple Flag Implementation

// feature-flags.js
const flags = {
  'new-checkout': {
    enabled: true,
    percentage: 10,       // Percentage rollout
    allowlist: ['user-1'], // Specific users
  },
};

export function getFlag(name, context = {}) {
  const flag = flags[name];
  if (!flag || !flag.enabled) return false;
  if (flag.allowlist?.includes(context.userId)) return true;
  if (flag.percentage) {
    const hash = simpleHash(context.userId + name);
    return (hash % 100) < flag.percentage;
  }
  return true;
}

Anti-pattern: Leaving stale feature flags in the codebase. Track flags and remove them after full rollout.

Common Anti-Patterns Summary

  1. No graceful shutdown: In-flight requests get dropped during pod termination.
  2. Breaking database migrations: Column renames/drops while old code is running.
  3. No automated rollback: Manual rollback under pressure leads to mistakes.
  4. Health checks that lie: Returning 200 without checking dependencies.
  5. Canary without metrics: Flying blind during progressive rollout.
  6. Feature flags without cleanup: Technical debt accumulates with every flag.
  7. Ignoring connection draining: Load balancers need time to stop sending traffic to old instances.

Deployment Strategy Decision Matrix

StrategyRollback SpeedResource CostRiskComplexity
Blue-GreenInstant2xLowMedium
CanaryFast1.1xLowHigh
RollingModerate1.25xMedLow
RecreateSlow1xHighLow

Install this skill directly: skilldb add deployment-patterns-skills

Get CLI access →

Related Skills

database-deployment

Comprehensive guide to database deployment for web applications, covering managed database services (PlanetScale, Neon, Supabase, Turso), migration strategies, connection pooling, backup and restore procedures, data seeding, and schema management best practices for production environments.

Deployment Patterns539L

docker-deployment

Comprehensive guide to using Docker for production deployments, covering multi-stage builds, .dockerignore optimization, layer caching strategies, health checks, Docker Compose for local development, container registries, and security scanning best practices.

Deployment Patterns479L

fly-io-deployment

Complete guide to deploying applications on Fly.io, covering flyctl CLI usage, Dockerfile-based deployments, fly.toml configuration, persistent volumes, horizontal and vertical scaling, multi-region deployments, managed Postgres and Redis, private networking, and auto-scaling strategies.

Deployment Patterns412L

github-actions-cd

Comprehensive guide to implementing continuous deployment with GitHub Actions, covering deploy workflows, environment protection rules, secrets management, matrix builds, dependency caching, artifact management, and deploying to multiple targets including Vercel, Fly.io, AWS, and container registries.

Deployment Patterns469L

monitoring-post-deploy

Comprehensive guide to post-deployment monitoring for web applications, covering uptime checks, error tracking with Sentry, application performance monitoring, log aggregation, alerting strategies, public status pages, and incident response procedures for production systems.

Deployment Patterns572L

netlify-deployment

Complete guide to deploying web applications on Netlify, covering build settings, deploy previews, serverless and edge functions, forms, identity, redirects and rewrites, split testing, and environment variable management for production workflows.

Deployment Patterns399L