Datadog
"Datadog: APM, log management, infrastructure monitoring, RUM, custom metrics, dashboards, Node.js tracing"
Datadog is a comprehensive observability platform spanning infrastructure, APM, logs, and real user monitoring. Its guiding principles are:
## Key Points
- **Distributed tracing connects the dots** — in microservice architectures, a single request touches many services. APM traces show the full call chain with latency breakdowns per service.
- **Dashboards are communication tools** — a well-built dashboard answers questions before they are asked. Build them for audiences (engineering, product, leadership), not just for yourself.
- **Tag everything consistently** — tags like `env`, `service`, `version`, and `team` applied uniformly across metrics, traces, and logs enable powerful filtering and correlation.
1. **Import the tracer before anything else** — `dd-trace` monkey-patches modules to auto-instrument them; importing it after `express` or `pg` means those libraries will not be traced.
2. **Use unified service tagging** — set `DD_SERVICE`, `DD_ENV`, and `DD_VERSION` as environment variables; these propagate to metrics, traces, and logs automatically.
3. **Enable log injection** — `logInjection: true` adds trace and span IDs to every log line, enabling one-click correlation from a trace to its logs.
4. **Connect RUM to APM** — use `allowedTracingUrls` in RUM config so frontend requests are linked to backend traces for full end-to-end visibility.
5. **Define monitors as code** — use Terraform or the API to version-control alert definitions; manual dashboard monitors drift and are not reproducible.
6. **Use histograms for latency, not gauges** — histograms give you percentiles (p50, p95, p99); gauges only show the latest value.
7. **Tag custom metrics with bounded cardinality** — tags like `endpoint:/users` are fine; tags like `user_id:12345` create millions of time series and explode costs.
8. **Set up SLOs** — define Service Level Objectives in Datadog to track error budget burn rate and get alerted before the budget is exhausted.
1. **High-cardinality tags on custom metrics** — tagging metrics with user IDs, request IDs, or UUIDs creates millions of unique time series, causing cost explosions.
## Quick Example
```typescript
// Entry point: import instrument first
import "./instrument";
import { createServer } from "./server";
createServer();
```skilldb get monitoring-services-skills/DatadogFull skill: 328 linesDatadog Monitoring Skill
Core Philosophy
Datadog is a comprehensive observability platform spanning infrastructure, APM, logs, and real user monitoring. Its guiding principles are:
- Three pillars in one platform — metrics, traces, and logs should be correlated in a single view. When a metric spikes, click through to the trace that caused it, then to the log lines within that trace.
- Custom metrics drive business observability — infrastructure metrics are table stakes; the real value comes from tracking business KPIs (orders per minute, payment failures, queue depth) as custom metrics.
- Distributed tracing connects the dots — in microservice architectures, a single request touches many services. APM traces show the full call chain with latency breakdowns per service.
- Dashboards are communication tools — a well-built dashboard answers questions before they are asked. Build them for audiences (engineering, product, leadership), not just for yourself.
- Tag everything consistently — tags like
env,service,version, andteamapplied uniformly across metrics, traces, and logs enable powerful filtering and correlation.
Setup
Node.js APM Tracing
// instrument.ts — must be imported BEFORE any other module
import tracer from "dd-trace";
tracer.init({
service: process.env.DD_SERVICE ?? "api-service",
env: process.env.DD_ENV ?? "production",
version: process.env.DD_VERSION ?? "1.0.0",
logInjection: true,
runtimeMetrics: true,
profiling: true,
appsec: true,
plugins: true,
});
export default tracer;
// Entry point: import instrument first
import "./instrument";
import { createServer } from "./server";
createServer();
Structured Logging with Trace Correlation
// lib/logger.ts
import winston from "winston";
const logger = winston.createLogger({
level: "info",
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: process.env.DD_SERVICE,
env: process.env.DD_ENV,
version: process.env.DD_VERSION,
},
transports: [new winston.transports.Console()],
});
export { logger };
// dd-trace's logInjection:true automatically adds
// dd.trace_id, dd.span_id, and dd.service to every log line,
// enabling Datadog to link logs to the originating trace.
Real User Monitoring (RUM)
// lib/datadog-rum.ts
import { datadogRum } from "@datadog/browser-rum";
export function initDatadogRUM() {
datadogRum.init({
applicationId: process.env.NEXT_PUBLIC_DD_RUM_APP_ID!,
clientToken: process.env.NEXT_PUBLIC_DD_RUM_CLIENT_TOKEN!,
site: "datadoghq.com",
service: "web-app",
env: process.env.NODE_ENV,
version: process.env.NEXT_PUBLIC_APP_VERSION,
sessionSampleRate: 100,
sessionReplaySampleRate: 20,
trackUserInteractions: true,
trackResources: true,
trackLongTasks: true,
defaultPrivacyLevel: "mask-user-input",
allowedTracingUrls: [
{ match: /https:\/\/api\.example\.com/, propagatorTypes: ["tracecontext"] },
],
});
}
export function setRUMUser(user: { id: string; name: string; plan: string }) {
datadogRum.setUser({
id: user.id,
name: user.name,
plan: user.plan,
});
}
export function trackRUMAction(name: string, context?: Record<string, unknown>) {
datadogRum.addAction(name, context);
}
Key Techniques
Custom Metrics
// lib/metrics.ts
import tracer from "dd-trace";
import StatsD from "hot-shots";
const dogstatsd = new StatsD({
host: process.env.DD_AGENT_HOST ?? "localhost",
port: 8125,
prefix: "app.",
globalTags: {
env: process.env.DD_ENV!,
service: process.env.DD_SERVICE!,
version: process.env.DD_VERSION!,
},
errorHandler(error) {
console.error("StatsD error:", error);
},
});
export const metrics = {
increment(name: string, tags?: Record<string, string>) {
dogstatsd.increment(name, 1, formatTags(tags));
},
gauge(name: string, value: number, tags?: Record<string, string>) {
dogstatsd.gauge(name, value, formatTags(tags));
},
histogram(name: string, value: number, tags?: Record<string, string>) {
dogstatsd.histogram(name, value, formatTags(tags));
},
timing(name: string, startMs: number, tags?: Record<string, string>) {
dogstatsd.timing(name, Date.now() - startMs, formatTags(tags));
},
async trackAsync<T>(
name: string,
fn: () => Promise<T>,
tags?: Record<string, string>
): Promise<T> {
const start = Date.now();
try {
const result = await fn();
metrics.timing(name, start, { ...tags, status: "success" });
metrics.increment(`${name}.success`, tags);
return result;
} catch (error) {
metrics.timing(name, start, { ...tags, status: "error" });
metrics.increment(`${name}.error`, tags);
throw error;
}
},
};
function formatTags(tags?: Record<string, string>): string[] {
if (!tags) return [];
return Object.entries(tags).map(([k, v]) => `${k}:${v}`);
}
Custom Span Instrumentation
// lib/tracing.ts
import tracer from "dd-trace";
import type { Span } from "dd-trace";
export async function traced<T>(
operationName: string,
resourceName: string,
fn: (span: Span) => Promise<T>,
options?: { service?: string; type?: string; tags?: Record<string, string> }
): Promise<T> {
return tracer.trace(
operationName,
{
resource: resourceName,
service: options?.service,
type: options?.type ?? "custom",
tags: options?.tags,
},
async (span) => {
try {
const result = await fn(span);
return result;
} catch (error) {
span.setTag("error", true);
if (error instanceof Error) {
span.setTag("error.message", error.message);
span.setTag("error.stack", error.stack);
}
throw error;
}
}
);
}
// Usage
async function processOrder(orderId: string) {
return traced("process_order", `order:${orderId}`, async (span) => {
span.setTag("order.id", orderId);
const order = await traced("fetch_order", "db.query", async () => {
return db.order.findUnique({ where: { id: orderId } });
}, { type: "sql" });
await traced("charge_payment", "payment.stripe", async (paymentSpan) => {
paymentSpan.setTag("payment.amount", order!.total);
return stripe.charges.create({ amount: order!.total });
}, { service: "payment-service" });
return order;
});
}
Express/Fastify Middleware for Request Metrics
// middleware/datadog-metrics.ts
import type { Request, Response, NextFunction } from "express";
import { metrics } from "@/lib/metrics";
export function datadogRequestMetrics(
req: Request,
res: Response,
next: NextFunction
) {
const start = Date.now();
res.on("finish", () => {
const route = req.route?.path ?? req.path;
const tags = {
method: req.method,
route,
status_code: String(res.statusCode),
status_class: `${Math.floor(res.statusCode / 100)}xx`,
};
metrics.timing("http.request.duration", start, tags);
metrics.increment("http.request.count", tags);
if (res.statusCode >= 500) {
metrics.increment("http.request.error", tags);
}
});
next();
}
Monitor Definitions as Code
// monitors/api-latency.ts
// Use with Datadog Terraform provider or API
interface DatadogMonitor {
name: string;
type: "metric alert" | "log alert" | "apm alert";
query: string;
message: string;
tags: string[];
thresholds: { critical: number; warning?: number };
}
const monitors: DatadogMonitor[] = [
{
name: "[API] P99 latency exceeds 2s",
type: "metric alert",
query: "avg(last_5m):percentile(trace.express.request.duration, p99){env:production,service:api-service} > 2",
message: "@slack-engineering @pagerduty-api-team P99 latency is above 2 seconds. Check recent deploys and database query times.",
tags: ["team:backend", "service:api-service", "env:production"],
thresholds: { critical: 2, warning: 1.5 },
},
{
name: "[API] Error rate exceeds 5%",
type: "metric alert",
query: "sum(last_5m):sum:http.request.error{env:production,service:api-service}.as_count() / sum:http.request.count{env:production,service:api-service}.as_count() * 100 > 5",
message: "@slack-engineering Error rate has exceeded 5%. Review error traces in APM.",
tags: ["team:backend", "service:api-service"],
thresholds: { critical: 5, warning: 2 },
},
];
export { monitors };
Best Practices
- Import the tracer before anything else —
dd-tracemonkey-patches modules to auto-instrument them; importing it afterexpressorpgmeans those libraries will not be traced. - Use unified service tagging — set
DD_SERVICE,DD_ENV, andDD_VERSIONas environment variables; these propagate to metrics, traces, and logs automatically. - Enable log injection —
logInjection: trueadds trace and span IDs to every log line, enabling one-click correlation from a trace to its logs. - Connect RUM to APM — use
allowedTracingUrlsin RUM config so frontend requests are linked to backend traces for full end-to-end visibility. - Define monitors as code — use Terraform or the API to version-control alert definitions; manual dashboard monitors drift and are not reproducible.
- Use histograms for latency, not gauges — histograms give you percentiles (p50, p95, p99); gauges only show the latest value.
- Tag custom metrics with bounded cardinality — tags like
endpoint:/usersare fine; tags likeuser_id:12345create millions of time series and explode costs. - Set up SLOs — define Service Level Objectives in Datadog to track error budget burn rate and get alerted before the budget is exhausted.
Anti-Patterns
- High-cardinality tags on custom metrics — tagging metrics with user IDs, request IDs, or UUIDs creates millions of unique time series, causing cost explosions.
- Not instrumenting before imports —
dd-tracemust be the first import; loading it after HTTP or database libraries means auto-instrumentation silently fails. - Logging at debug level in production — generates enormous log volume, burns through log ingestion quotas, and makes searching for real issues harder.
- Manual dashboard creation without code — dashboards created through the UI are hard to version, review, or replicate across environments.
- Ignoring the
envtag — without it, production and staging metrics mix together, making alerts unreliable and dashboards misleading. - Alerting on averages instead of percentiles — average latency can look fine while the p99 is catastrophic; always alert on percentiles for latency.
- Not connecting RUM to backend traces — without
allowedTracingUrls, frontend and backend observability remain siloed, losing the end-to-end view.
Install this skill directly: skilldb add monitoring-services-skills
Related Skills
Baselime
Baselime is a serverless-native observability platform designed for AWS, unifying logs, traces, and metrics. It provides real-time insights and contextualized data to help you understand and troubleshoot your distributed serverless applications.
BetterStack
"BetterStack (formerly Better Uptime + Logtail): uptime monitoring, log management, status pages, incident management, alerting"
Checkly
"Checkly: synthetic monitoring, API checks, browser checks, Playwright-based E2E monitoring, monitoring-as-code CLI"
Cronitor
Cronitor is a robust monitoring service designed to ensure your background jobs (cron jobs, scheduled tasks, async workers) and APIs run reliably. It actively monitors the health and execution of automated processes, alerting you instantly to missed runs, failures, or delays. Use Cronitor to gain peace of mind and critical visibility into your application's backend operations.
Grafana Cloud
Grafana Cloud is a fully managed observability platform that unifies metrics (Prometheus/Graphite), logs (Loki), and traces (Tempo) within a single Grafana interface. Use it to gain deep insights into your applications and infrastructure without the operational overhead of managing your own observability stack, allowing you to focus on building and improving your services.
Highlight.io
"Highlight.io: open-source monitoring, session replay, error tracking, logging, tracing, Next.js SDK, self-hosted option"