Prometheus
Configure Prometheus monitoring with scrape targets, recording rules, and alerting.
You are an expert in Prometheus monitoring. You guide developers through metric instrumentation, scrape configuration, PromQL queries, recording rules, and Alertmanager integration for production-grade monitoring.
## Key Points
- "rules/*.yml"
- job_name: 'node-exporter'
- job_name: 'app'
- name: http_rules
- name: service_alerts
- name: default
- name: pagerduty
- name: slack
- alert: SLOBurnRate
- **Cardinality explosion** - Never use unbounded values (user IDs, request IDs, email addresses) as metric labels.
- **Missing `rate()` on counters** - Always apply `rate()` or `increase()` to counters before aggregating; raw counter values are meaningless across restarts.
- **Scraping too frequently** - Sub-5s scrape intervals stress Prometheus and rarely improve insight; 15s is the standard default.
## Quick Example
```typescript
// WRONG - user_id creates unbounded cardinality
const requests = new Counter({
name: 'requests_total',
labelNames: ['user_id', 'endpoint'], // user_id will explode series count
});
```skilldb get observability-services-skills/PrometheusFull skill: 236 linesPrometheus Integration
You are an expert in Prometheus monitoring. You guide developers through metric instrumentation, scrape configuration, PromQL queries, recording rules, and Alertmanager integration for production-grade monitoring.
Core Philosophy
Pull-Based Collection
Prometheus scrapes HTTP endpoints on a schedule. Services expose /metrics in the Prometheus exposition format. This pull model simplifies service discovery and firewall rules.
Dimensional Data Model
Every metric is identified by a name and a set of key-value label pairs. Labels enable powerful filtering and aggregation but must be kept low-cardinality.
Recording Rules for Performance
Precompute expensive queries as recording rules. This reduces dashboard load times and ensures alerting rules evaluate quickly.
Setup
Prometheus config (prometheus.yml):
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'app'
metrics_path: /metrics
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: 'true'
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:${2}
Instrument a Node.js app with prom-client:
import { Registry, collectDefaultMetrics, Counter, Histogram } from 'prom-client';
const register = new Registry();
collectDefaultMetrics({ register });
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code'] as const,
registers: [register],
});
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route'] as const,
buckets: [0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [register],
});
// Express middleware
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer({ method: req.method, route: req.route?.path || req.path });
res.on('finish', () => {
httpRequestsTotal.inc({ method: req.method, route: req.route?.path || req.path, status_code: String(res.statusCode) });
end();
});
next();
});
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Key Patterns
Do: Use histograms for latency with appropriate buckets
const dbQueryDuration = new Histogram({
name: 'db_query_duration_seconds',
help: 'Database query duration',
labelNames: ['query_type'] as const,
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});
// Query percentiles with PromQL
// histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m]))
Not: Using high-cardinality labels on metrics
// WRONG - user_id creates unbounded cardinality
const requests = new Counter({
name: 'requests_total',
labelNames: ['user_id', 'endpoint'], // user_id will explode series count
});
Do: Write recording rules for dashboard queries
# rules/recording.yml
groups:
- name: http_rules
interval: 15s
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
- record: job:http_error_rate:ratio5m
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
Common Patterns
Alerting Rules
# rules/alerts.yml
groups:
- name: service_alerts
rules:
- alert: HighErrorRate
expr: job:http_error_rate:ratio5m > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: HighLatency
expr: job:http_request_duration:p99 > 1
for: 10m
labels:
severity: warning
annotations:
summary: "p99 latency above 1s on {{ $labels.job }}"
Alertmanager Routing
# alertmanager.yml
route:
receiver: default
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty
- match:
severity: warning
receiver: slack
receivers:
- name: default
webhook_configs:
- url: http://webhook:5001/
- name: pagerduty
pagerduty_configs:
- service_key: '<key>'
- name: slack
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
SLO Burn Rate Alert (Multi-Window)
- alert: SLOBurnRate
expr: |
(
job:http_error_rate:ratio5m > (14.4 * 0.001)
and
sum(rate(http_requests_total{status_code=~"5.."}[1h])) by (job)
/ sum(rate(http_requests_total[1h])) by (job) > (14.4 * 0.001)
)
for: 2m
labels:
severity: critical
annotations:
summary: "SLO burn rate critical for {{ $labels.job }}"
Anti-Patterns
- Cardinality explosion - Never use unbounded values (user IDs, request IDs, email addresses) as metric labels.
- Missing
rate()on counters - Always applyrate()orincrease()to counters before aggregating; raw counter values are meaningless across restarts. - Scraping too frequently - Sub-5s scrape intervals stress Prometheus and rarely improve insight; 15s is the standard default.
- No retention planning - Default 15-day retention fills disk fast; configure
--storage.tsdb.retention.timeand use remote write for long-term storage.
When to Use
- You need pull-based metrics collection for containerized or Kubernetes workloads.
- You are building SLO-based alerting with multi-window burn rate calculations.
- You want a well-established CNCF metrics backend compatible with Grafana and Thanos.
- You need service discovery for dynamic infrastructure (Kubernetes, Consul, EC2).
- You are instrumenting applications with RED (Rate, Errors, Duration) metrics.
Install this skill directly: skilldb add observability-services-skills
Related Skills
Axiom
Integrate Axiom for log management, analytics, and real-time dashboards.
Elastic Apm
Instrument applications with Elastic APM and the ELK Stack for traces, logs, and metrics.
Grafana
Build Grafana dashboards, configure data sources, and set up alerting rules.
Honeycomb
Integrate Honeycomb for event-driven observability with high-cardinality tracing.
Jaeger
Deploy and integrate Jaeger for distributed tracing across microservices.
New Relic
Integrate New Relic APM for application performance monitoring and distributed tracing.