Skip to main content
Technology & EngineeringDeployment Patterns572 lines

monitoring-post-deploy

Comprehensive guide to post-deployment monitoring for web applications, covering uptime checks, error tracking with Sentry, application performance monitoring, log aggregation, alerting strategies, public status pages, and incident response procedures for production systems.

Quick Summary24 lines
Deployment is not done when the code is live. It's done when you've verified it works under real traffic. Without monitoring, bugs ship silently, performance degrades unnoticed, and users leave before you realize anything is wrong.

## Key Points

- name: High Error Rate
- name: Response Time P99
- name: Memory Usage
1. **Actionable alerts only**: Every alert should require a human action.
2. **Set appropriate thresholds**: Too sensitive = noise; too lenient = missed incidents.
3. **Group related alerts**: Don't send 50 alerts for one database outage.
4. **Use warning vs critical**: Warning = investigate soon; Critical = wake someone up.
- Website           [Operational]
- API               [Operational]
- Dashboard         [Operational]
- Database          [Operational]
- Background Jobs   [Degraded Performance]

## Quick Example

```bash
npm install @sentry/node @sentry/profiling-node
```
skilldb get deployment-patterns-skills/monitoring-post-deployFull skill: 572 lines
Paste into your CLAUDE.md or agent config

Post-Deployment Monitoring

Why Monitoring Matters

Deployment is not done when the code is live. It's done when you've verified it works under real traffic. Without monitoring, bugs ship silently, performance degrades unnoticed, and users leave before you realize anything is wrong.

The three pillars of observability: Metrics, Logs, Traces.

Uptime Checks

External Uptime Monitoring

External checks verify your app is reachable from the outside, catching issues that internal health checks miss (DNS failures, CDN outages, TLS expiry).

Services: BetterUptime, UptimeRobot, Checkly, Pingdom.

Checkly (Programmable Monitoring)

// __checks__/homepage.check.ts
import { test, expect } from '@playwright/test';

test('Homepage loads correctly', async ({ page }) => {
  const response = await page.goto('https://example.com');
  expect(response?.status()).toBe(200);

  // Verify critical content is present
  await expect(page.locator('h1')).toBeVisible();
  await expect(page.locator('[data-testid="nav"]')).toBeVisible();
});

test('API health check', async ({ request }) => {
  const response = await request.get('https://api.example.com/health');
  expect(response.ok()).toBeTruthy();

  const body = await response.json();
  expect(body.status).toBe('healthy');
  expect(body.database).toBe('connected');
});

Basic Health Check Endpoint

app.get('/health', async (req, res) => {
  const checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    database: 'unknown',
    redis: 'unknown',
    memory: process.memoryUsage(),
  };

  try {
    await db.query('SELECT 1');
    checks.database = 'connected';
  } catch {
    checks.database = 'disconnected';
  }

  try {
    await redis.ping();
    checks.redis = 'connected';
  } catch {
    checks.redis = 'disconnected';
  }

  const isHealthy = checks.database === 'connected' && checks.redis === 'connected';
  res.status(isHealthy ? 200 : 503).json(checks);
});

Anti-pattern: Health checks that only return 200. Test real dependencies (database, cache, external APIs) to detect actual failures.

Error Tracking with Sentry

Setup

npm install @sentry/node @sentry/profiling-node

Node.js/Express Integration

import * as Sentry from '@sentry/node';
import { nodeProfilingIntegration } from '@sentry/profiling-node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  release: process.env.GIT_SHA || 'unknown',
  integrations: [
    nodeProfilingIntegration(),
  ],
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
  profilesSampleRate: 0.1,
});

// Express error handler (must be after all routes)
app.use(Sentry.expressErrorHandler());

Next.js Integration

// sentry.client.config.js
import * as Sentry from '@sentry/nextjs';

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  tracesSampleRate: 0.1,
  replaysSessionSampleRate: 0.1,
  replaysOnErrorSampleRate: 1.0,
  integrations: [
    Sentry.replayIntegration(),
    Sentry.browserTracingIntegration(),
  ],
});

Custom Error Context

app.use((req, res, next) => {
  Sentry.setUser({
    id: req.user?.id,
    email: req.user?.email,
  });

  Sentry.setTag('route', req.route?.path);
  Sentry.setContext('request', {
    method: req.method,
    url: req.url,
    query: req.query,
  });

  next();
});

// Capture custom errors with context
try {
  await processPayment(order);
} catch (error) {
  Sentry.captureException(error, {
    extra: {
      orderId: order.id,
      amount: order.total,
      paymentMethod: order.paymentMethod,
    },
    tags: {
      feature: 'checkout',
      severity: 'critical',
    },
  });
  throw error;
}

Source Maps

// next.config.js
const { withSentryConfig } = require('@sentry/nextjs');

module.exports = withSentryConfig(nextConfig, {
  org: 'my-org',
  project: 'my-app',
  authToken: process.env.SENTRY_AUTH_TOKEN,
  silent: true,
  hideSourceMaps: true,  // Don't expose source maps publicly
});

Release Tracking

# In CI/CD pipeline
export SENTRY_RELEASE=$(git rev-parse --short HEAD)

# Create release
sentry-cli releases new $SENTRY_RELEASE
sentry-cli releases set-commits $SENTRY_RELEASE --auto
sentry-cli sourcemaps upload --release $SENTRY_RELEASE ./dist
sentry-cli releases finalize $SENTRY_RELEASE

# Mark deploy
sentry-cli releases deploys $SENTRY_RELEASE new -e production

Anti-pattern: Setting tracesSampleRate: 1.0 in production. This sends every transaction to Sentry, which is expensive and unnecessary. Use 0.05-0.2 for production.

Performance Monitoring

Web Vitals

Track Core Web Vitals (LCP, FID, CLS, INP, TTFB):

// Using web-vitals library
import { onCLS, onFID, onLCP, onINP, onTTFB } from 'web-vitals';

function sendMetric(metric) {
  fetch('/api/metrics', {
    method: 'POST',
    body: JSON.stringify({
      name: metric.name,
      value: metric.value,
      rating: metric.rating,  // 'good', 'needs-improvement', 'poor'
      navigationType: metric.navigationType,
    }),
    headers: { 'Content-Type': 'application/json' },
  });
}

onCLS(sendMetric);
onFID(sendMetric);
onLCP(sendMetric);
onINP(sendMetric);
onTTFB(sendMetric);

Server-Side Performance

import { performance } from 'node:perf_hooks';

// Middleware to track response times
app.use((req, res, next) => {
  const start = performance.now();

  res.on('finish', () => {
    const duration = performance.now() - start;

    // Log slow requests
    if (duration > 1000) {
      console.warn(`Slow request: ${req.method} ${req.path} - ${duration.toFixed(0)}ms`);
    }

    // Send to metrics system
    metrics.histogram('http_request_duration_ms', duration, {
      method: req.method,
      route: req.route?.path || 'unknown',
      status: res.statusCode,
    });
  });

  next();
});

Database Query Monitoring

// Prisma query logging
const prisma = new PrismaClient({
  log: [
    { level: 'query', emit: 'event' },
    { level: 'warn', emit: 'stdout' },
    { level: 'error', emit: 'stdout' },
  ],
});

prisma.$on('query', (e) => {
  if (e.duration > 100) {
    console.warn(`Slow query (${e.duration}ms): ${e.query}`);
    Sentry.addBreadcrumb({
      category: 'db',
      message: `Slow query: ${e.duration}ms`,
      level: 'warning',
    });
  }
});

Log Aggregation

Structured Logging

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  base: {
    service: 'my-app',
    environment: process.env.NODE_ENV,
    version: process.env.GIT_SHA,
  },
});

// Usage
logger.info({ userId: user.id, action: 'login' }, 'User logged in');
logger.error({ err, orderId: order.id }, 'Payment processing failed');
logger.warn({ queryTime: 1500, query: 'SELECT ...' }, 'Slow database query');

Request Logging Middleware

app.use((req, res, next) => {
  const requestId = req.headers['x-request-id'] || crypto.randomUUID();
  const childLogger = logger.child({ requestId });

  req.log = childLogger;
  res.setHeader('x-request-id', requestId);

  const start = Date.now();
  res.on('finish', () => {
    childLogger.info({
      method: req.method,
      url: req.url,
      status: res.statusCode,
      duration: Date.now() - start,
      userAgent: req.get('user-agent'),
      ip: req.ip,
    }, 'Request completed');
  });

  next();
});

Log Destinations

Services: Datadog, Grafana/Loki, Axiom, Logtail, AWS CloudWatch, Papertrail.

// Pino with Axiom transport
import pino from 'pino';

const logger = pino({
  transport: {
    targets: [
      {
        target: 'pino-pretty',
        level: 'debug',
        options: { destination: 1 },  // stdout for local dev
      },
      {
        target: '@axiomhq/pino',
        level: 'info',
        options: {
          dataset: 'my-app',
          token: process.env.AXIOM_TOKEN,
        },
      },
    ],
  },
});

Anti-pattern: Using console.log with unstructured strings. Structured JSON logs are searchable, filterable, and parseable by log aggregation tools.

Alerting

Alert Levels

LevelResponse TimeExamples
Critical< 5 minSite down, data loss, security breach
High< 30 minError rate spike, payment failures
Medium< 4 hoursPerformance degradation, disk 80%
LowNext business dayNon-critical warnings, cert expiry 30d

Alert Rules

# Example: Datadog monitor configuration
- name: High Error Rate
  type: metric alert
  query: "sum(last_5m):sum:http.errors{env:production} / sum:http.requests{env:production} > 0.05"
  message: |
    Error rate is above 5% in production.
    Current value: {{value}}
    @slack-oncall @pagerduty-critical
  thresholds:
    critical: 0.05
    warning: 0.02

- name: Response Time P99
  type: metric alert
  query: "avg(last_10m):p99:http.request_duration{env:production} > 2000"
  message: |
    P99 response time exceeds 2 seconds.
    @slack-engineering
  thresholds:
    critical: 2000
    warning: 1000

- name: Memory Usage
  type: metric alert
  query: "avg(last_5m):system.mem.pct_usable{env:production} < 0.1"
  message: |
    Memory usage is above 90%.
    @slack-oncall

Avoiding Alert Fatigue

  1. Actionable alerts only: Every alert should require a human action.
  2. Set appropriate thresholds: Too sensitive = noise; too lenient = missed incidents.
  3. Group related alerts: Don't send 50 alerts for one database outage.
  4. Use warning vs critical: Warning = investigate soon; Critical = wake someone up.

Anti-pattern: Alerting on every 500 error. Set a rate threshold (e.g., >1% error rate over 5 minutes), not individual events.

Status Pages

Hosted Status Pages

Services: Instatus, Statuspage (Atlassian), BetterUptime Status, Cachet (self-hosted).

Components to Track

Status Page: status.example.com

Components:
  - Website           [Operational]
  - API               [Operational]
  - Dashboard         [Operational]
  - Database          [Operational]
  - Background Jobs   [Degraded Performance]
  - Email Delivery    [Operational]

Metrics:
  - API Response Time (p50, p95, p99)
  - Uptime percentage (30-day rolling)

Automated Status Updates

// Update status page component via API (Instatus example)
async function updateComponentStatus(componentId, status) {
  await fetch(`https://api.instatus.com/v2/pages/${PAGE_ID}/components/${componentId}`, {
    method: 'PUT',
    headers: {
      'Authorization': `Bearer ${process.env.INSTATUS_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ status }),  // OPERATIONAL, DEGRADEDPERFORMANCE, PARTIALOUTAGE, MAJOROUTAGE
  });
}

// In your health check
const isHealthy = await checkServices();
if (!isHealthy) {
  await updateComponentStatus(API_COMPONENT_ID, 'PARTIALOUTAGE');
  await createIncident('API experiencing elevated error rates');
}

Incident Response

Incident Workflow

1. DETECT   → Automated alert fires
2. TRIAGE   → Assess severity and impact
3. RESPOND  → Assign responder, begin mitigation
4. MITIGATE → Restore service (rollback, scale, hotfix)
5. RESOLVE  → Confirm service is restored
6. REVIEW   → Post-incident review (blameless)

Post-Deploy Verification Checklist

Run this after every production deployment:

#!/bin/bash
# post-deploy-check.sh

APP_URL="https://example.com"
API_URL="https://api.example.com"

echo "Running post-deploy verification..."

# 1. Health check
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$API_URL/health")
if [ "$HTTP_STATUS" -ne 200 ]; then
  echo "FAIL: Health check returned $HTTP_STATUS"
  exit 1
fi

# 2. Homepage loads
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$APP_URL")
if [ "$HTTP_STATUS" -ne 200 ]; then
  echo "FAIL: Homepage returned $HTTP_STATUS"
  exit 1
fi

# 3. Check error rate (via metrics API)
ERROR_RATE=$(curl -s "$METRICS_URL/api/v1/query?query=rate(http_errors_total[5m])" | jq '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
  echo "WARN: Error rate elevated at $ERROR_RATE"
fi

# 4. Check response time
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" "$API_URL/health")
if (( $(echo "$RESPONSE_TIME > 2.0" | bc -l) )); then
  echo "WARN: Response time high at ${RESPONSE_TIME}s"
fi

echo "Post-deploy verification complete."

Post-Incident Review Template

## Incident: [Brief description]
**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**Severity**: Critical / High / Medium
**Impact**: [What users experienced]

## Timeline
- HH:MM - Alert fired
- HH:MM - Responder acknowledged
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Service restored

## Root Cause
[What actually broke and why]

## What Went Well
- [Detection was fast]
- [Rollback worked]

## What Went Wrong
- [Alert was noisy, took time to find signal]
- [Runbook was outdated]

## Action Items
- [ ] [Specific improvement with owner and deadline]
- [ ] [Update runbook for this scenario]
- [ ] [Add monitoring for early detection]

Common Anti-Patterns

  1. No monitoring until something breaks: Set up monitoring before the first deploy, not after the first incident.
  2. Alert fatigue: Too many non-actionable alerts means real alerts get ignored.
  3. Unstructured logging: console.log("error happened") is useless for debugging at scale.
  4. No post-deploy verification: "It deployed successfully" is not the same as "It works correctly".
  5. Monitoring only the happy path: Monitor error rates, slow queries, and edge cases, not just uptime.
  6. No runbooks: When the pager fires at 3 AM, you need step-by-step instructions, not tribal knowledge.

Monitoring Stack Checklist

  • Uptime monitoring configured (external checks every 1-5 minutes)
  • Error tracking (Sentry) initialized with source maps
  • Structured logging with log aggregation
  • Performance metrics collected (response times, Core Web Vitals)
  • Alerting rules defined with appropriate thresholds
  • Status page created and linked from your app
  • Post-deploy verification script automated
  • Incident response process documented
  • On-call rotation established (if team size warrants)

Install this skill directly: skilldb add deployment-patterns-skills

Get CLI access →

Related Skills

database-deployment

Comprehensive guide to database deployment for web applications, covering managed database services (PlanetScale, Neon, Supabase, Turso), migration strategies, connection pooling, backup and restore procedures, data seeding, and schema management best practices for production environments.

Deployment Patterns539L

docker-deployment

Comprehensive guide to using Docker for production deployments, covering multi-stage builds, .dockerignore optimization, layer caching strategies, health checks, Docker Compose for local development, container registries, and security scanning best practices.

Deployment Patterns479L

fly-io-deployment

Complete guide to deploying applications on Fly.io, covering flyctl CLI usage, Dockerfile-based deployments, fly.toml configuration, persistent volumes, horizontal and vertical scaling, multi-region deployments, managed Postgres and Redis, private networking, and auto-scaling strategies.

Deployment Patterns412L

github-actions-cd

Comprehensive guide to implementing continuous deployment with GitHub Actions, covering deploy workflows, environment protection rules, secrets management, matrix builds, dependency caching, artifact management, and deploying to multiple targets including Vercel, Fly.io, AWS, and container registries.

Deployment Patterns469L

netlify-deployment

Complete guide to deploying web applications on Netlify, covering build settings, deploy previews, serverless and edge functions, forms, identity, redirects and rewrites, split testing, and environment variable management for production workflows.

Deployment Patterns399L

railway-deployment

Complete guide to deploying applications on Railway, covering project setup, environment variable management, services and databases (Postgres, Redis, MySQL), persistent volumes, monorepo support, private networking between services, and scheduled cron jobs.

Deployment Patterns434L