monitoring-post-deploy
Comprehensive guide to post-deployment monitoring for web applications, covering uptime checks, error tracking with Sentry, application performance monitoring, log aggregation, alerting strategies, public status pages, and incident response procedures for production systems.
Deployment is not done when the code is live. It's done when you've verified it works under real traffic. Without monitoring, bugs ship silently, performance degrades unnoticed, and users leave before you realize anything is wrong. ## Key Points - name: High Error Rate - name: Response Time P99 - name: Memory Usage 1. **Actionable alerts only**: Every alert should require a human action. 2. **Set appropriate thresholds**: Too sensitive = noise; too lenient = missed incidents. 3. **Group related alerts**: Don't send 50 alerts for one database outage. 4. **Use warning vs critical**: Warning = investigate soon; Critical = wake someone up. - Website [Operational] - API [Operational] - Dashboard [Operational] - Database [Operational] - Background Jobs [Degraded Performance] ## Quick Example ```bash npm install @sentry/node @sentry/profiling-node ```
skilldb get deployment-patterns-skills/monitoring-post-deployFull skill: 572 linesPost-Deployment Monitoring
Why Monitoring Matters
Deployment is not done when the code is live. It's done when you've verified it works under real traffic. Without monitoring, bugs ship silently, performance degrades unnoticed, and users leave before you realize anything is wrong.
The three pillars of observability: Metrics, Logs, Traces.
Uptime Checks
External Uptime Monitoring
External checks verify your app is reachable from the outside, catching issues that internal health checks miss (DNS failures, CDN outages, TLS expiry).
Services: BetterUptime, UptimeRobot, Checkly, Pingdom.
Checkly (Programmable Monitoring)
// __checks__/homepage.check.ts
import { test, expect } from '@playwright/test';
test('Homepage loads correctly', async ({ page }) => {
const response = await page.goto('https://example.com');
expect(response?.status()).toBe(200);
// Verify critical content is present
await expect(page.locator('h1')).toBeVisible();
await expect(page.locator('[data-testid="nav"]')).toBeVisible();
});
test('API health check', async ({ request }) => {
const response = await request.get('https://api.example.com/health');
expect(response.ok()).toBeTruthy();
const body = await response.json();
expect(body.status).toBe('healthy');
expect(body.database).toBe('connected');
});
Basic Health Check Endpoint
app.get('/health', async (req, res) => {
const checks = {
uptime: process.uptime(),
timestamp: Date.now(),
database: 'unknown',
redis: 'unknown',
memory: process.memoryUsage(),
};
try {
await db.query('SELECT 1');
checks.database = 'connected';
} catch {
checks.database = 'disconnected';
}
try {
await redis.ping();
checks.redis = 'connected';
} catch {
checks.redis = 'disconnected';
}
const isHealthy = checks.database === 'connected' && checks.redis === 'connected';
res.status(isHealthy ? 200 : 503).json(checks);
});
Anti-pattern: Health checks that only return 200. Test real dependencies (database, cache, external APIs) to detect actual failures.
Error Tracking with Sentry
Setup
npm install @sentry/node @sentry/profiling-node
Node.js/Express Integration
import * as Sentry from '@sentry/node';
import { nodeProfilingIntegration } from '@sentry/profiling-node';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
release: process.env.GIT_SHA || 'unknown',
integrations: [
nodeProfilingIntegration(),
],
tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
profilesSampleRate: 0.1,
});
// Express error handler (must be after all routes)
app.use(Sentry.expressErrorHandler());
Next.js Integration
// sentry.client.config.js
import * as Sentry from '@sentry/nextjs';
Sentry.init({
dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
tracesSampleRate: 0.1,
replaysSessionSampleRate: 0.1,
replaysOnErrorSampleRate: 1.0,
integrations: [
Sentry.replayIntegration(),
Sentry.browserTracingIntegration(),
],
});
Custom Error Context
app.use((req, res, next) => {
Sentry.setUser({
id: req.user?.id,
email: req.user?.email,
});
Sentry.setTag('route', req.route?.path);
Sentry.setContext('request', {
method: req.method,
url: req.url,
query: req.query,
});
next();
});
// Capture custom errors with context
try {
await processPayment(order);
} catch (error) {
Sentry.captureException(error, {
extra: {
orderId: order.id,
amount: order.total,
paymentMethod: order.paymentMethod,
},
tags: {
feature: 'checkout',
severity: 'critical',
},
});
throw error;
}
Source Maps
// next.config.js
const { withSentryConfig } = require('@sentry/nextjs');
module.exports = withSentryConfig(nextConfig, {
org: 'my-org',
project: 'my-app',
authToken: process.env.SENTRY_AUTH_TOKEN,
silent: true,
hideSourceMaps: true, // Don't expose source maps publicly
});
Release Tracking
# In CI/CD pipeline
export SENTRY_RELEASE=$(git rev-parse --short HEAD)
# Create release
sentry-cli releases new $SENTRY_RELEASE
sentry-cli releases set-commits $SENTRY_RELEASE --auto
sentry-cli sourcemaps upload --release $SENTRY_RELEASE ./dist
sentry-cli releases finalize $SENTRY_RELEASE
# Mark deploy
sentry-cli releases deploys $SENTRY_RELEASE new -e production
Anti-pattern: Setting tracesSampleRate: 1.0 in production. This sends every transaction to Sentry, which is expensive and unnecessary. Use 0.05-0.2 for production.
Performance Monitoring
Web Vitals
Track Core Web Vitals (LCP, FID, CLS, INP, TTFB):
// Using web-vitals library
import { onCLS, onFID, onLCP, onINP, onTTFB } from 'web-vitals';
function sendMetric(metric) {
fetch('/api/metrics', {
method: 'POST',
body: JSON.stringify({
name: metric.name,
value: metric.value,
rating: metric.rating, // 'good', 'needs-improvement', 'poor'
navigationType: metric.navigationType,
}),
headers: { 'Content-Type': 'application/json' },
});
}
onCLS(sendMetric);
onFID(sendMetric);
onLCP(sendMetric);
onINP(sendMetric);
onTTFB(sendMetric);
Server-Side Performance
import { performance } from 'node:perf_hooks';
// Middleware to track response times
app.use((req, res, next) => {
const start = performance.now();
res.on('finish', () => {
const duration = performance.now() - start;
// Log slow requests
if (duration > 1000) {
console.warn(`Slow request: ${req.method} ${req.path} - ${duration.toFixed(0)}ms`);
}
// Send to metrics system
metrics.histogram('http_request_duration_ms', duration, {
method: req.method,
route: req.route?.path || 'unknown',
status: res.statusCode,
});
});
next();
});
Database Query Monitoring
// Prisma query logging
const prisma = new PrismaClient({
log: [
{ level: 'query', emit: 'event' },
{ level: 'warn', emit: 'stdout' },
{ level: 'error', emit: 'stdout' },
],
});
prisma.$on('query', (e) => {
if (e.duration > 100) {
console.warn(`Slow query (${e.duration}ms): ${e.query}`);
Sentry.addBreadcrumb({
category: 'db',
message: `Slow query: ${e.duration}ms`,
level: 'warning',
});
}
});
Log Aggregation
Structured Logging
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
base: {
service: 'my-app',
environment: process.env.NODE_ENV,
version: process.env.GIT_SHA,
},
});
// Usage
logger.info({ userId: user.id, action: 'login' }, 'User logged in');
logger.error({ err, orderId: order.id }, 'Payment processing failed');
logger.warn({ queryTime: 1500, query: 'SELECT ...' }, 'Slow database query');
Request Logging Middleware
app.use((req, res, next) => {
const requestId = req.headers['x-request-id'] || crypto.randomUUID();
const childLogger = logger.child({ requestId });
req.log = childLogger;
res.setHeader('x-request-id', requestId);
const start = Date.now();
res.on('finish', () => {
childLogger.info({
method: req.method,
url: req.url,
status: res.statusCode,
duration: Date.now() - start,
userAgent: req.get('user-agent'),
ip: req.ip,
}, 'Request completed');
});
next();
});
Log Destinations
Services: Datadog, Grafana/Loki, Axiom, Logtail, AWS CloudWatch, Papertrail.
// Pino with Axiom transport
import pino from 'pino';
const logger = pino({
transport: {
targets: [
{
target: 'pino-pretty',
level: 'debug',
options: { destination: 1 }, // stdout for local dev
},
{
target: '@axiomhq/pino',
level: 'info',
options: {
dataset: 'my-app',
token: process.env.AXIOM_TOKEN,
},
},
],
},
});
Anti-pattern: Using console.log with unstructured strings. Structured JSON logs are searchable, filterable, and parseable by log aggregation tools.
Alerting
Alert Levels
| Level | Response Time | Examples |
|---|---|---|
| Critical | < 5 min | Site down, data loss, security breach |
| High | < 30 min | Error rate spike, payment failures |
| Medium | < 4 hours | Performance degradation, disk 80% |
| Low | Next business day | Non-critical warnings, cert expiry 30d |
Alert Rules
# Example: Datadog monitor configuration
- name: High Error Rate
type: metric alert
query: "sum(last_5m):sum:http.errors{env:production} / sum:http.requests{env:production} > 0.05"
message: |
Error rate is above 5% in production.
Current value: {{value}}
@slack-oncall @pagerduty-critical
thresholds:
critical: 0.05
warning: 0.02
- name: Response Time P99
type: metric alert
query: "avg(last_10m):p99:http.request_duration{env:production} > 2000"
message: |
P99 response time exceeds 2 seconds.
@slack-engineering
thresholds:
critical: 2000
warning: 1000
- name: Memory Usage
type: metric alert
query: "avg(last_5m):system.mem.pct_usable{env:production} < 0.1"
message: |
Memory usage is above 90%.
@slack-oncall
Avoiding Alert Fatigue
- Actionable alerts only: Every alert should require a human action.
- Set appropriate thresholds: Too sensitive = noise; too lenient = missed incidents.
- Group related alerts: Don't send 50 alerts for one database outage.
- Use warning vs critical: Warning = investigate soon; Critical = wake someone up.
Anti-pattern: Alerting on every 500 error. Set a rate threshold (e.g., >1% error rate over 5 minutes), not individual events.
Status Pages
Hosted Status Pages
Services: Instatus, Statuspage (Atlassian), BetterUptime Status, Cachet (self-hosted).
Components to Track
Status Page: status.example.com
Components:
- Website [Operational]
- API [Operational]
- Dashboard [Operational]
- Database [Operational]
- Background Jobs [Degraded Performance]
- Email Delivery [Operational]
Metrics:
- API Response Time (p50, p95, p99)
- Uptime percentage (30-day rolling)
Automated Status Updates
// Update status page component via API (Instatus example)
async function updateComponentStatus(componentId, status) {
await fetch(`https://api.instatus.com/v2/pages/${PAGE_ID}/components/${componentId}`, {
method: 'PUT',
headers: {
'Authorization': `Bearer ${process.env.INSTATUS_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ status }), // OPERATIONAL, DEGRADEDPERFORMANCE, PARTIALOUTAGE, MAJOROUTAGE
});
}
// In your health check
const isHealthy = await checkServices();
if (!isHealthy) {
await updateComponentStatus(API_COMPONENT_ID, 'PARTIALOUTAGE');
await createIncident('API experiencing elevated error rates');
}
Incident Response
Incident Workflow
1. DETECT → Automated alert fires
2. TRIAGE → Assess severity and impact
3. RESPOND → Assign responder, begin mitigation
4. MITIGATE → Restore service (rollback, scale, hotfix)
5. RESOLVE → Confirm service is restored
6. REVIEW → Post-incident review (blameless)
Post-Deploy Verification Checklist
Run this after every production deployment:
#!/bin/bash
# post-deploy-check.sh
APP_URL="https://example.com"
API_URL="https://api.example.com"
echo "Running post-deploy verification..."
# 1. Health check
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API_URL/health")
if [ "$HTTP_STATUS" -ne 200 ]; then
echo "FAIL: Health check returned $HTTP_STATUS"
exit 1
fi
# 2. Homepage loads
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$APP_URL")
if [ "$HTTP_STATUS" -ne 200 ]; then
echo "FAIL: Homepage returned $HTTP_STATUS"
exit 1
fi
# 3. Check error rate (via metrics API)
ERROR_RATE=$(curl -s "$METRICS_URL/api/v1/query?query=rate(http_errors_total[5m])" | jq '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "WARN: Error rate elevated at $ERROR_RATE"
fi
# 4. Check response time
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" "$API_URL/health")
if (( $(echo "$RESPONSE_TIME > 2.0" | bc -l) )); then
echo "WARN: Response time high at ${RESPONSE_TIME}s"
fi
echo "Post-deploy verification complete."
Post-Incident Review Template
## Incident: [Brief description]
**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**Severity**: Critical / High / Medium
**Impact**: [What users experienced]
## Timeline
- HH:MM - Alert fired
- HH:MM - Responder acknowledged
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Service restored
## Root Cause
[What actually broke and why]
## What Went Well
- [Detection was fast]
- [Rollback worked]
## What Went Wrong
- [Alert was noisy, took time to find signal]
- [Runbook was outdated]
## Action Items
- [ ] [Specific improvement with owner and deadline]
- [ ] [Update runbook for this scenario]
- [ ] [Add monitoring for early detection]
Common Anti-Patterns
- No monitoring until something breaks: Set up monitoring before the first deploy, not after the first incident.
- Alert fatigue: Too many non-actionable alerts means real alerts get ignored.
- Unstructured logging:
console.log("error happened")is useless for debugging at scale. - No post-deploy verification: "It deployed successfully" is not the same as "It works correctly".
- Monitoring only the happy path: Monitor error rates, slow queries, and edge cases, not just uptime.
- No runbooks: When the pager fires at 3 AM, you need step-by-step instructions, not tribal knowledge.
Monitoring Stack Checklist
- Uptime monitoring configured (external checks every 1-5 minutes)
- Error tracking (Sentry) initialized with source maps
- Structured logging with log aggregation
- Performance metrics collected (response times, Core Web Vitals)
- Alerting rules defined with appropriate thresholds
- Status page created and linked from your app
- Post-deploy verification script automated
- Incident response process documented
- On-call rotation established (if team size warrants)
Install this skill directly: skilldb add deployment-patterns-skills
Related Skills
database-deployment
Comprehensive guide to database deployment for web applications, covering managed database services (PlanetScale, Neon, Supabase, Turso), migration strategies, connection pooling, backup and restore procedures, data seeding, and schema management best practices for production environments.
docker-deployment
Comprehensive guide to using Docker for production deployments, covering multi-stage builds, .dockerignore optimization, layer caching strategies, health checks, Docker Compose for local development, container registries, and security scanning best practices.
fly-io-deployment
Complete guide to deploying applications on Fly.io, covering flyctl CLI usage, Dockerfile-based deployments, fly.toml configuration, persistent volumes, horizontal and vertical scaling, multi-region deployments, managed Postgres and Redis, private networking, and auto-scaling strategies.
github-actions-cd
Comprehensive guide to implementing continuous deployment with GitHub Actions, covering deploy workflows, environment protection rules, secrets management, matrix builds, dependency caching, artifact management, and deploying to multiple targets including Vercel, Fly.io, AWS, and container registries.
netlify-deployment
Complete guide to deploying web applications on Netlify, covering build settings, deploy previews, serverless and edge functions, forms, identity, redirects and rewrites, split testing, and environment variable management for production workflows.
railway-deployment
Complete guide to deploying applications on Railway, covering project setup, environment variable management, services and databases (Postgres, Redis, MySQL), persistent volumes, monorepo support, private networking between services, and scheduled cron jobs.