Technology & EngineeringMonitoring Services328 lines

Datadog

"Datadog: APM, log management, infrastructure monitoring, RUM, custom metrics, dashboards, Node.js tracing"

Quick Summary28 lines

Datadog is a comprehensive observability platform spanning infrastructure, APM, logs, and real user monitoring. Its guiding principles are:

## Key Points

- **Distributed tracing connects the dots** — in microservice architectures, a single request touches many services. APM traces show the full call chain with latency breakdowns per service.
- **Dashboards are communication tools** — a well-built dashboard answers questions before they are asked. Build them for audiences (engineering, product, leadership), not just for yourself.
- **Tag everything consistently** — tags like `env`, `service`, `version`, and `team` applied uniformly across metrics, traces, and logs enable powerful filtering and correlation.
1. **Import the tracer before anything else** — `dd-trace` monkey-patches modules to auto-instrument them; importing it after `express` or `pg` means those libraries will not be traced.
2. **Use unified service tagging** — set `DD_SERVICE`, `DD_ENV`, and `DD_VERSION` as environment variables; these propagate to metrics, traces, and logs automatically.
3. **Enable log injection** — `logInjection: true` adds trace and span IDs to every log line, enabling one-click correlation from a trace to its logs.
4. **Connect RUM to APM** — use `allowedTracingUrls` in RUM config so frontend requests are linked to backend traces for full end-to-end visibility.
5. **Define monitors as code** — use Terraform or the API to version-control alert definitions; manual dashboard monitors drift and are not reproducible.
6. **Use histograms for latency, not gauges** — histograms give you percentiles (p50, p95, p99); gauges only show the latest value.
7. **Tag custom metrics with bounded cardinality** — tags like `endpoint:/users` are fine; tags like `user_id:12345` create millions of time series and explode costs.
8. **Set up SLOs** — define Service Level Objectives in Datadog to track error budget burn rate and get alerted before the budget is exhausted.
1. **High-cardinality tags on custom metrics** — tagging metrics with user IDs, request IDs, or UUIDs creates millions of unique time series, causing cost explosions.

## Quick Example

```typescript
// Entry point: import instrument first
import "./instrument";
import { createServer } from "./server";

createServer();
```

skilldb get monitoring-services-skills/DatadogFull skill: 328 lines

Paste into your CLAUDE.md or agent config

Datadog Monitoring Skill

Core Philosophy

Datadog is a comprehensive observability platform spanning infrastructure, APM, logs, and real user monitoring. Its guiding principles are:

Three pillars in one platform — metrics, traces, and logs should be correlated in a single view. When a metric spikes, click through to the trace that caused it, then to the log lines within that trace.
Custom metrics drive business observability — infrastructure metrics are table stakes; the real value comes from tracking business KPIs (orders per minute, payment failures, queue depth) as custom metrics.
Distributed tracing connects the dots — in microservice architectures, a single request touches many services. APM traces show the full call chain with latency breakdowns per service.
Dashboards are communication tools — a well-built dashboard answers questions before they are asked. Build them for audiences (engineering, product, leadership), not just for yourself.
Tag everything consistently — tags like env, service, version, and team applied uniformly across metrics, traces, and logs enable powerful filtering and correlation.

Setup

Node.js APM Tracing

// instrument.ts — must be imported BEFORE any other module
import tracer from "dd-trace";

tracer.init({
  service: process.env.DD_SERVICE ?? "api-service",
  env: process.env.DD_ENV ?? "production",
  version: process.env.DD_VERSION ?? "1.0.0",
  logInjection: true,
  runtimeMetrics: true,
  profiling: true,
  appsec: true,
  plugins: true,
});

export default tracer;

// Entry point: import instrument first
import "./instrument";
import { createServer } from "./server";

createServer();

Structured Logging with Trace Correlation

// lib/logger.ts
import winston from "winston";

const logger = winston.createLogger({
  level: "info",
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: process.env.DD_SERVICE,
    env: process.env.DD_ENV,
    version: process.env.DD_VERSION,
  },
  transports: [new winston.transports.Console()],
});

export { logger };

// dd-trace's logInjection:true automatically adds
// dd.trace_id, dd.span_id, and dd.service to every log line,
// enabling Datadog to link logs to the originating trace.

Real User Monitoring (RUM)

// lib/datadog-rum.ts
import { datadogRum } from "@datadog/browser-rum";

export function initDatadogRUM() {
  datadogRum.init({
    applicationId: process.env.NEXT_PUBLIC_DD_RUM_APP_ID!,
    clientToken: process.env.NEXT_PUBLIC_DD_RUM_CLIENT_TOKEN!,
    site: "datadoghq.com",
    service: "web-app",
    env: process.env.NODE_ENV,
    version: process.env.NEXT_PUBLIC_APP_VERSION,
    sessionSampleRate: 100,
    sessionReplaySampleRate: 20,
    trackUserInteractions: true,
    trackResources: true,
    trackLongTasks: true,
    defaultPrivacyLevel: "mask-user-input",
    allowedTracingUrls: [
      { match: /https:\/\/api\.example\.com/, propagatorTypes: ["tracecontext"] },
    ],
  });
}

export function setRUMUser(user: { id: string; name: string; plan: string }) {
  datadogRum.setUser({
    id: user.id,
    name: user.name,
    plan: user.plan,
  });
}

export function trackRUMAction(name: string, context?: Record<string, unknown>) {
  datadogRum.addAction(name, context);
}

Key Techniques

Custom Metrics

// lib/metrics.ts
import tracer from "dd-trace";
import StatsD from "hot-shots";

const dogstatsd = new StatsD({
  host: process.env.DD_AGENT_HOST ?? "localhost",
  port: 8125,
  prefix: "app.",
  globalTags: {
    env: process.env.DD_ENV!,
    service: process.env.DD_SERVICE!,
    version: process.env.DD_VERSION!,
  },
  errorHandler(error) {
    console.error("StatsD error:", error);
  },
});

export const metrics = {
  increment(name: string, tags?: Record<string, string>) {
    dogstatsd.increment(name, 1, formatTags(tags));
  },

  gauge(name: string, value: number, tags?: Record<string, string>) {
    dogstatsd.gauge(name, value, formatTags(tags));
  },

  histogram(name: string, value: number, tags?: Record<string, string>) {
    dogstatsd.histogram(name, value, formatTags(tags));
  },

  timing(name: string, startMs: number, tags?: Record<string, string>) {
    dogstatsd.timing(name, Date.now() - startMs, formatTags(tags));
  },

  async trackAsync<T>(
    name: string,
    fn: () => Promise<T>,
    tags?: Record<string, string>
  ): Promise<T> {
    const start = Date.now();
    try {
      const result = await fn();
      metrics.timing(name, start, { ...tags, status: "success" });
      metrics.increment(`${name}.success`, tags);
      return result;
    } catch (error) {
      metrics.timing(name, start, { ...tags, status: "error" });
      metrics.increment(`${name}.error`, tags);
      throw error;
    }
  },
};

function formatTags(tags?: Record<string, string>): string[] {
  if (!tags) return [];
  return Object.entries(tags).map(([k, v]) => `${k}:${v}`);
}

Custom Span Instrumentation

// lib/tracing.ts
import tracer from "dd-trace";
import type { Span } from "dd-trace";

export async function traced<T>(
  operationName: string,
  resourceName: string,
  fn: (span: Span) => Promise<T>,
  options?: { service?: string; type?: string; tags?: Record<string, string> }
): Promise<T> {
  return tracer.trace(
    operationName,
    {
      resource: resourceName,
      service: options?.service,
      type: options?.type ?? "custom",
      tags: options?.tags,
    },
    async (span) => {
      try {
        const result = await fn(span);
        return result;
      } catch (error) {
        span.setTag("error", true);
        if (error instanceof Error) {
          span.setTag("error.message", error.message);
          span.setTag("error.stack", error.stack);
        }
        throw error;
      }
    }
  );
}

// Usage
async function processOrder(orderId: string) {
  return traced("process_order", `order:${orderId}`, async (span) => {
    span.setTag("order.id", orderId);

    const order = await traced("fetch_order", "db.query", async () => {
      return db.order.findUnique({ where: { id: orderId } });
    }, { type: "sql" });

    await traced("charge_payment", "payment.stripe", async (paymentSpan) => {
      paymentSpan.setTag("payment.amount", order!.total);
      return stripe.charges.create({ amount: order!.total });
    }, { service: "payment-service" });

    return order;
  });
}

Express/Fastify Middleware for Request Metrics

// middleware/datadog-metrics.ts
import type { Request, Response, NextFunction } from "express";
import { metrics } from "@/lib/metrics";

export function datadogRequestMetrics(
  req: Request,
  res: Response,
  next: NextFunction
) {
  const start = Date.now();

  res.on("finish", () => {
    const route = req.route?.path ?? req.path;
    const tags = {
      method: req.method,
      route,
      status_code: String(res.statusCode),
      status_class: `${Math.floor(res.statusCode / 100)}xx`,
    };

    metrics.timing("http.request.duration", start, tags);
    metrics.increment("http.request.count", tags);

    if (res.statusCode >= 500) {
      metrics.increment("http.request.error", tags);
    }
  });

  next();
}

Monitor Definitions as Code

// monitors/api-latency.ts
// Use with Datadog Terraform provider or API

interface DatadogMonitor {
  name: string;
  type: "metric alert" | "log alert" | "apm alert";
  query: string;
  message: string;
  tags: string[];
  thresholds: { critical: number; warning?: number };
}

const monitors: DatadogMonitor[] = [
  {
    name: "[API] P99 latency exceeds 2s",
    type: "metric alert",
    query: "avg(last_5m):percentile(trace.express.request.duration, p99){env:production,service:api-service} > 2",
    message: "@slack-engineering @pagerduty-api-team P99 latency is above 2 seconds. Check recent deploys and database query times.",
    tags: ["team:backend", "service:api-service", "env:production"],
    thresholds: { critical: 2, warning: 1.5 },
  },
  {
    name: "[API] Error rate exceeds 5%",
    type: "metric alert",
    query: "sum(last_5m):sum:http.request.error{env:production,service:api-service}.as_count() / sum:http.request.count{env:production,service:api-service}.as_count() * 100 > 5",
    message: "@slack-engineering Error rate has exceeded 5%. Review error traces in APM.",
    tags: ["team:backend", "service:api-service"],
    thresholds: { critical: 5, warning: 2 },
  },
];

export { monitors };

Best Practices

Import the tracer before anything else — dd-trace monkey-patches modules to auto-instrument them; importing it after express or pg means those libraries will not be traced.
Use unified service tagging — set DD_SERVICE, DD_ENV, and DD_VERSION as environment variables; these propagate to metrics, traces, and logs automatically.
Enable log injection — logInjection: true adds trace and span IDs to every log line, enabling one-click correlation from a trace to its logs.
Connect RUM to APM — use allowedTracingUrls in RUM config so frontend requests are linked to backend traces for full end-to-end visibility.
Define monitors as code — use Terraform or the API to version-control alert definitions; manual dashboard monitors drift and are not reproducible.
Use histograms for latency, not gauges — histograms give you percentiles (p50, p95, p99); gauges only show the latest value.
Tag custom metrics with bounded cardinality — tags like endpoint:/users are fine; tags like user_id:12345 create millions of time series and explode costs.
Set up SLOs — define Service Level Objectives in Datadog to track error budget burn rate and get alerted before the budget is exhausted.

Anti-Patterns

High-cardinality tags on custom metrics — tagging metrics with user IDs, request IDs, or UUIDs creates millions of unique time series, causing cost explosions.
Not instrumenting before imports — dd-trace must be the first import; loading it after HTTP or database libraries means auto-instrumentation silently fails.
Logging at debug level in production — generates enormous log volume, burns through log ingestion quotas, and makes searching for real issues harder.
Manual dashboard creation without code — dashboards created through the UI are hard to version, review, or replicate across environments.
Ignoring the env tag — without it, production and staging metrics mix together, making alerts unreliable and dashboards misleading.
Alerting on averages instead of percentiles — average latency can look fine while the p99 is catastrophic; always alert on percentiles for latency.
Not connecting RUM to backend traces — without allowedTracingUrls, frontend and backend observability remain siloed, losing the end-to-end view.

Install this skill directly: skilldb add monitoring-services-skills

Get CLI access →

Related Skills

Baselime

Baselime is a serverless-native observability platform designed for AWS, unifying logs, traces, and metrics. It provides real-time insights and contextualized data to help you understand and troubleshoot your distributed serverless applications.

Monitoring Services•245L

BetterStack

"BetterStack (formerly Better Uptime + Logtail): uptime monitoring, log management, status pages, incident management, alerting"

Monitoring Services•348L

Checkly

"Checkly: synthetic monitoring, API checks, browser checks, Playwright-based E2E monitoring, monitoring-as-code CLI"

Monitoring Services•202L

Cronitor

Cronitor is a robust monitoring service designed to ensure your background jobs (cron jobs, scheduled tasks, async workers) and APIs run reliably. It actively monitors the health and execution of automated processes, alerting you instantly to missed runs, failures, or delays. Use Cronitor to gain peace of mind and critical visibility into your application's backend operations.

Monitoring Services•218L

Grafana Cloud

Grafana Cloud is a fully managed observability platform that unifies metrics (Prometheus/Graphite), logs (Loki), and traces (Tempo) within a single Grafana interface. Use it to gain deep insights into your applications and infrastructure without the operational overhead of managing your own observability stack, allowing you to focus on building and improving your services.

Monitoring Services•202L

Highlight.io

"Highlight.io: open-source monitoring, session replay, error tracking, logging, tracing, Next.js SDK, self-hosted option"

Monitoring Services•354L