Technology & EngineeringObservability Services236 lines

Prometheus

Configure Prometheus monitoring with scrape targets, recording rules, and alerting.

Quick Summary28 lines

You are an expert in Prometheus monitoring. You guide developers through metric instrumentation, scrape configuration, PromQL queries, recording rules, and Alertmanager integration for production-grade monitoring.

## Key Points

- "rules/*.yml"
- job_name: 'node-exporter'
- job_name: 'app'
- name: http_rules
- name: service_alerts
- name: default
- name: pagerduty
- name: slack
- alert: SLOBurnRate
- **Cardinality explosion** - Never use unbounded values (user IDs, request IDs, email addresses) as metric labels.
- **Missing `rate()` on counters** - Always apply `rate()` or `increase()` to counters before aggregating; raw counter values are meaningless across restarts.
- **Scraping too frequently** - Sub-5s scrape intervals stress Prometheus and rarely improve insight; 15s is the standard default.

## Quick Example

```typescript
// WRONG - user_id creates unbounded cardinality
const requests = new Counter({
  name: 'requests_total',
  labelNames: ['user_id', 'endpoint'], // user_id will explode series count
});
```

skilldb get observability-services-skills/PrometheusFull skill: 236 lines

Paste into your CLAUDE.md or agent config

Prometheus Integration

You are an expert in Prometheus monitoring. You guide developers through metric instrumentation, scrape configuration, PromQL queries, recording rules, and Alertmanager integration for production-grade monitoring.

Core Philosophy

Pull-Based Collection

Prometheus scrapes HTTP endpoints on a schedule. Services expose /metrics in the Prometheus exposition format. This pull model simplifies service discovery and firewall rules.

Dimensional Data Model

Every metric is identified by a name and a set of key-value label pairs. Labels enable powerful filtering and aggregation but must be kept low-cardinality.

Recording Rules for Performance

Precompute expensive queries as recording rules. This reduces dashboard load times and ensures alerting rules evaluate quickly.

Setup

Prometheus config (prometheus.yml):

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'app'
    metrics_path: /metrics
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}:${2}

Instrument a Node.js app with prom-client:

import { Registry, collectDefaultMetrics, Counter, Histogram } from 'prom-client';

const register = new Registry();
collectDefaultMetrics({ register });

const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'] as const,
  registers: [register],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'] as const,
  buckets: [0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [register],
});

// Express middleware
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({ method: req.method, route: req.route?.path || req.path });
  res.on('finish', () => {
    httpRequestsTotal.inc({ method: req.method, route: req.route?.path || req.path, status_code: String(res.statusCode) });
    end();
  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Key Patterns

Do: Use histograms for latency with appropriate buckets

const dbQueryDuration = new Histogram({
  name: 'db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['query_type'] as const,
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});

// Query percentiles with PromQL
// histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m]))

Not: Using high-cardinality labels on metrics

// WRONG - user_id creates unbounded cardinality
const requests = new Counter({
  name: 'requests_total',
  labelNames: ['user_id', 'endpoint'], // user_id will explode series count
});

Do: Write recording rules for dashboard queries

# rules/recording.yml
groups:
  - name: http_rules
    interval: 15s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
      - record: job:http_request_duration:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
      - record: job:http_error_rate:ratio5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
            /
          sum(rate(http_requests_total[5m])) by (job)

Common Patterns

Alerting Rules

# rules/alerts.yml
groups:
  - name: service_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_error_rate:ratio5m > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighLatency
        expr: job:http_request_duration:p99 > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 1s on {{ $labels.job }}"

Alertmanager Routing

# alertmanager.yml
route:
  receiver: default
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty
    - match:
        severity: warning
      receiver: slack

receivers:
  - name: default
    webhook_configs:
      - url: http://webhook:5001/
  - name: pagerduty
    pagerduty_configs:
      - service_key: '<key>'
  - name: slack
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

SLO Burn Rate Alert (Multi-Window)

- alert: SLOBurnRate
  expr: |
    (
      job:http_error_rate:ratio5m > (14.4 * 0.001)
      and
      sum(rate(http_requests_total{status_code=~"5.."}[1h])) by (job)
        / sum(rate(http_requests_total[1h])) by (job) > (14.4 * 0.001)
    )
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "SLO burn rate critical for {{ $labels.job }}"

Anti-Patterns

Cardinality explosion - Never use unbounded values (user IDs, request IDs, email addresses) as metric labels.
Missing rate() on counters - Always apply rate() or increase() to counters before aggregating; raw counter values are meaningless across restarts.
Scraping too frequently - Sub-5s scrape intervals stress Prometheus and rarely improve insight; 15s is the standard default.
No retention planning - Default 15-day retention fills disk fast; configure --storage.tsdb.retention.time and use remote write for long-term storage.

When to Use

You need pull-based metrics collection for containerized or Kubernetes workloads.
You are building SLO-based alerting with multi-window burn rate calculations.
You want a well-established CNCF metrics backend compatible with Grafana and Thanos.
You need service discovery for dynamic infrastructure (Kubernetes, Consul, EC2).
You are instrumenting applications with RED (Rate, Errors, Duration) metrics.

Install this skill directly: skilldb add observability-services-skills

Get CLI access →