zero-downtime-deployment
Comprehensive guide to zero-downtime deployment patterns including blue-green deployments, canary releases, rolling updates, database migrations during deployments, health check strategies, rollback mechanisms, and feature flag integration for safe progressive rollouts.
Any deployment that causes user-visible errors or outages erodes trust. Zero-downtime deployment ensures that at every point during the transition from old version to new version, the application is serving traffic correctly.
## Key Points
1. Blue is live, serving all traffic
2. Deploy new version to Green
3. Run smoke tests against Green
4. Switch load balancer from Blue → Green
5. Green is now live
6. Blue becomes the rollback target
- name: Deploy
- name: Smoke test
- name: Rollback on failure
1. Can you identify the last known good version?
2. Is the database schema backward-compatible?
3. Are there new API contracts that clients depend on?
## Quick Example
```nginx
upstream app {
server 10.0.1.10:3000 weight=9; # stable (90%)
server 10.0.1.20:3000 weight=1; # canary (10%)
}
```
```sql
-- Migration: Add new column (nullable, with default)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
-- Backfill data
UPDATE users SET display_name = name WHERE display_name IS NULL;
```skilldb get deployment-patterns-skills/zero-downtime-deploymentFull skill: 478 linesZero-Downtime Deployment Patterns
Why Zero-Downtime Matters
Any deployment that causes user-visible errors or outages erodes trust. Zero-downtime deployment ensures that at every point during the transition from old version to new version, the application is serving traffic correctly.
The core principle: never have a moment where no healthy instance is available.
Blue-Green Deployment
Two identical environments ("blue" and "green") swap between live and idle.
How It Works
1. Blue is live, serving all traffic
2. Deploy new version to Green
3. Run smoke tests against Green
4. Switch load balancer from Blue → Green
5. Green is now live
6. Blue becomes the rollback target
Implementation with Nginx
# /etc/nginx/conf.d/app.conf
upstream app {
# Toggle between blue and green
server 10.0.1.10:3000; # blue
# server 10.0.1.20:3000; # green (uncomment to switch)
}
server {
listen 80;
server_name example.com;
location / {
proxy_pass http://app;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Implementation with AWS ALB
#!/bin/bash
# deploy-blue-green.sh
CLUSTER="my-cluster"
SERVICE="my-service"
NEW_TASK_DEF="my-app:42"
# Update service with new task definition
aws ecs update-service \
--cluster $CLUSTER \
--service $SERVICE \
--task-definition $NEW_TASK_DEF \
--deployment-configuration "minimumHealthyPercent=100,maximumPercent=200"
# ECS handles draining old tasks after new ones pass health checks
aws ecs wait services-stable --cluster $CLUSTER --services $SERVICE
Pros and Cons
Pros: Instant rollback (switch back to old environment), easy to reason about, full testing before traffic switch.
Cons: Requires 2x infrastructure during deployment, database schema must be compatible with both versions simultaneously.
Anti-pattern: Switching traffic before smoke tests pass. Always validate the new environment before the cutover.
Canary Deployment
Route a small percentage of traffic to the new version, gradually increasing if metrics look good.
Traffic Splitting with Nginx
upstream app {
server 10.0.1.10:3000 weight=9; # stable (90%)
server 10.0.1.20:3000 weight=1; # canary (10%)
}
Progressive Rollout Script
#!/bin/bash
# canary-rollout.sh
CANARY_STEPS=(5 10 25 50 75 100)
STABILITY_WAIT=300 # 5 minutes between steps
for pct in "${CANARY_STEPS[@]}"; do
echo "Setting canary traffic to ${pct}%"
update_traffic_split $pct
echo "Waiting ${STABILITY_WAIT}s for metrics..."
sleep $STABILITY_WAIT
ERROR_RATE=$(get_error_rate "canary")
P99_LATENCY=$(get_p99_latency "canary")
if (( $(echo "$ERROR_RATE > 1.0" | bc -l) )); then
echo "Error rate too high (${ERROR_RATE}%). Rolling back."
update_traffic_split 0
exit 1
fi
if (( $(echo "$P99_LATENCY > 500" | bc -l) )); then
echo "Latency too high (${P99_LATENCY}ms). Rolling back."
update_traffic_split 0
exit 1
fi
echo "Metrics OK. Proceeding to next step."
done
echo "Canary promotion complete."
Kubernetes Canary with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 5
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- analysis:
templates:
- templateName: error-rate
- setWeight: 25
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
canaryService: my-app-canary
stableService: my-app-stable
Anti-pattern: Canary without automated metric checks. Manual observation doesn't scale and misses subtle regressions.
Rolling Updates
Replace instances one at a time. The default strategy for most orchestrators.
Kubernetes Rolling Update
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # At most 1 pod down at a time
maxSurge: 1 # At most 1 extra pod during update
template:
spec:
containers:
- name: app
image: my-app:v2
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
terminationGracePeriodSeconds: 60
Graceful Shutdown
Your application must handle SIGTERM to drain in-flight requests:
const server = app.listen(3000);
process.on('SIGTERM', () => {
console.log('SIGTERM received. Shutting down gracefully...');
// Stop accepting new connections
server.close(() => {
console.log('All connections drained. Exiting.');
process.exit(0);
});
// Force shutdown after 30 seconds
setTimeout(() => {
console.error('Forced shutdown after timeout');
process.exit(1);
}, 30000);
});
Anti-pattern: No graceful shutdown handler. When the orchestrator kills your pod, in-flight requests get 502 errors.
Database Migrations During Deployment
The hardest part of zero-downtime deployment. The old and new versions of your app must coexist during the transition.
The Expand-Contract Pattern
Never make breaking schema changes in one step. Use two deployments:
Step 1: Expand (add new, keep old)
-- Migration: Add new column (nullable, with default)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
-- Backfill data
UPDATE users SET display_name = name WHERE display_name IS NULL;
Deploy new code that writes to BOTH name and display_name.
Step 2: Contract (remove old)
After all instances run the new code:
-- Migration: Remove old column
ALTER TABLE users DROP COLUMN name;
Column Rename Example
Deploy 1: Add new_column, write to both old and new
Deploy 2: Read from new_column, write to both
Deploy 3: Drop old_column
Safe Migration Practices
-- SAFE: Adding a nullable column
ALTER TABLE orders ADD COLUMN tracking_url TEXT;
-- SAFE: Adding an index concurrently (Postgres)
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);
-- DANGEROUS: Renaming a column
-- ALTER TABLE users RENAME COLUMN name TO display_name;
-- Old code will break when looking for "name"
-- DANGEROUS: Adding NOT NULL without default
-- ALTER TABLE orders ADD COLUMN priority INT NOT NULL;
-- Fails for existing rows
-- SAFE alternative:
ALTER TABLE orders ADD COLUMN priority INT DEFAULT 0 NOT NULL;
Migration Tooling
// Run migrations before deploy (not during startup)
// CI/CD pipeline:
// 1. Run migrations
// 2. Deploy new code
// 3. Verify
// package.json
{
"scripts": {
"migrate": "prisma migrate deploy",
"deploy": "npm run migrate && npm run start"
}
}
Anti-pattern: Running destructive migrations (DROP COLUMN, RENAME) while old code is still serving traffic. Always use expand-contract.
Health Checks
Layered Health Checks
// Liveness: "Is the process alive?"
app.get('/healthz', (req, res) => {
res.status(200).json({ status: 'alive' });
});
// Readiness: "Can this instance serve traffic?"
app.get('/ready', async (req, res) => {
try {
await db.query('SELECT 1');
await redis.ping();
res.status(200).json({ status: 'ready' });
} catch (err) {
res.status(503).json({ status: 'not ready', error: err.message });
}
});
// Startup: "Has initial setup completed?"
let startupComplete = false;
app.get('/startup', (req, res) => {
if (startupComplete) {
res.status(200).json({ status: 'started' });
} else {
res.status(503).json({ status: 'starting' });
}
});
// After initial setup
async function initialize() {
await loadCache();
await warmConnections();
startupComplete = true;
}
Kubernetes Probe Configuration
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
startupProbe:
httpGet:
path: /startup
port: 3000
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup time
Anti-pattern: Using a single health check for both liveness and readiness. A pod that fails to connect to the database should be marked unready (stop sending traffic), not killed and restarted.
Rollback Strategies
Immediate Rollback
# Kubernetes
kubectl rollout undo deployment/my-app
# Fly.io
fly releases
fly deploy --image registry.fly.io/my-app:previous-sha
# Vercel
vercel rollback
# Docker
docker service update --rollback my-service
Automated Rollback
# GitHub Actions
- name: Deploy
id: deploy
run: ./deploy.sh
continue-on-error: true
- name: Smoke test
id: smoke
run: ./smoke-test.sh
continue-on-error: true
- name: Rollback on failure
if: steps.deploy.outcome == 'failure' || steps.smoke.outcome == 'failure'
run: ./rollback.sh ${{ env.PREVIOUS_VERSION }}
Rollback Checklist
- Can you identify the last known good version?
- Is the database schema backward-compatible?
- Are there new API contracts that clients depend on?
- Do you need to revert configuration/secrets?
Feature Flags
Decouple deployment from release. Deploy code dark, then enable via flags:
import { getFlag } from './feature-flags';
app.get('/checkout', async (req, res) => {
const useNewCheckout = await getFlag('new-checkout', {
userId: req.user.id,
percentage: 10, // 10% rollout
});
if (useNewCheckout) {
return newCheckoutHandler(req, res);
}
return legacyCheckoutHandler(req, res);
});
Simple Flag Implementation
// feature-flags.js
const flags = {
'new-checkout': {
enabled: true,
percentage: 10, // Percentage rollout
allowlist: ['user-1'], // Specific users
},
};
export function getFlag(name, context = {}) {
const flag = flags[name];
if (!flag || !flag.enabled) return false;
if (flag.allowlist?.includes(context.userId)) return true;
if (flag.percentage) {
const hash = simpleHash(context.userId + name);
return (hash % 100) < flag.percentage;
}
return true;
}
Anti-pattern: Leaving stale feature flags in the codebase. Track flags and remove them after full rollout.
Common Anti-Patterns Summary
- No graceful shutdown: In-flight requests get dropped during pod termination.
- Breaking database migrations: Column renames/drops while old code is running.
- No automated rollback: Manual rollback under pressure leads to mistakes.
- Health checks that lie: Returning 200 without checking dependencies.
- Canary without metrics: Flying blind during progressive rollout.
- Feature flags without cleanup: Technical debt accumulates with every flag.
- Ignoring connection draining: Load balancers need time to stop sending traffic to old instances.
Deployment Strategy Decision Matrix
| Strategy | Rollback Speed | Resource Cost | Risk | Complexity |
|---|---|---|---|---|
| Blue-Green | Instant | 2x | Low | Medium |
| Canary | Fast | 1.1x | Low | High |
| Rolling | Moderate | 1.25x | Med | Low |
| Recreate | Slow | 1x | High | Low |
Install this skill directly: skilldb add deployment-patterns-skills
Related Skills
database-deployment
Comprehensive guide to database deployment for web applications, covering managed database services (PlanetScale, Neon, Supabase, Turso), migration strategies, connection pooling, backup and restore procedures, data seeding, and schema management best practices for production environments.
docker-deployment
Comprehensive guide to using Docker for production deployments, covering multi-stage builds, .dockerignore optimization, layer caching strategies, health checks, Docker Compose for local development, container registries, and security scanning best practices.
fly-io-deployment
Complete guide to deploying applications on Fly.io, covering flyctl CLI usage, Dockerfile-based deployments, fly.toml configuration, persistent volumes, horizontal and vertical scaling, multi-region deployments, managed Postgres and Redis, private networking, and auto-scaling strategies.
github-actions-cd
Comprehensive guide to implementing continuous deployment with GitHub Actions, covering deploy workflows, environment protection rules, secrets management, matrix builds, dependency caching, artifact management, and deploying to multiple targets including Vercel, Fly.io, AWS, and container registries.
monitoring-post-deploy
Comprehensive guide to post-deployment monitoring for web applications, covering uptime checks, error tracking with Sentry, application performance monitoring, log aggregation, alerting strategies, public status pages, and incident response procedures for production systems.
netlify-deployment
Complete guide to deploying web applications on Netlify, covering build settings, deploy previews, serverless and edge functions, forms, identity, redirects and rewrites, split testing, and environment variable management for production workflows.