Skip to main content
Technology & EngineeringObservability Patterns238 lines

Metrics Collection

Prometheus and Grafana metrics collection patterns for monitoring application and infrastructure health

Quick Summary18 lines
You are an expert in Prometheus and Grafana metrics for building observable systems.

## Key Points

- **Metric types**: Counter (monotonically increasing), Gauge (can go up or down), Histogram (bucketed distributions), Summary (quantiles computed client-side).
- **Labels**: Key-value pairs attached to a metric that create dimensional time series. E.g., `http_requests_total{method="GET", status="200"}`.
- **Scrape target**: An HTTP endpoint (usually `/metrics`) that Prometheus polls at a configured interval.
- **PromQL**: Prometheus Query Language for querying and aggregating time-series data. Supports rate, aggregation, vector matching, and subqueries.
- **Recording rules**: Precomputed PromQL expressions stored as new time series to speed up dashboards and alerts.
- **Alerting rules**: PromQL expressions that fire alerts when conditions are met, sent to Alertmanager for routing.
- **Cardinality**: The total number of unique time series. High cardinality (from high-dimensional labels) is the primary scaling challenge.
- "recording_rules.yml"
- "alerting_rules.yml"
- job_name: "api-servers"
- job_name: "k8s-pods"
- name: http_request_rates
skilldb get observability-patterns-skills/Metrics CollectionFull skill: 238 lines
Paste into your CLAUDE.md or agent config

Metrics Collection — Observability

You are an expert in Prometheus and Grafana metrics for building observable systems.

Overview

Metrics are numeric time-series data that quantify system behavior — request rates, error counts, latencies, queue depths, resource utilization. Prometheus is the dominant open-source metrics system, using a pull-based model where it scrapes HTTP endpoints exposing metrics in a text or OpenMetrics format. Grafana provides visualization and dashboarding on top of Prometheus (and other data sources). Together they form the backbone of most modern monitoring stacks.

Core Concepts

  • Metric types: Counter (monotonically increasing), Gauge (can go up or down), Histogram (bucketed distributions), Summary (quantiles computed client-side).
  • Labels: Key-value pairs attached to a metric that create dimensional time series. E.g., http_requests_total{method="GET", status="200"}.
  • Scrape target: An HTTP endpoint (usually /metrics) that Prometheus polls at a configured interval.
  • PromQL: Prometheus Query Language for querying and aggregating time-series data. Supports rate, aggregation, vector matching, and subqueries.
  • Recording rules: Precomputed PromQL expressions stored as new time series to speed up dashboards and alerts.
  • Alerting rules: PromQL expressions that fire alerts when conditions are met, sent to Alertmanager for routing.
  • Cardinality: The total number of unique time series. High cardinality (from high-dimensional labels) is the primary scaling challenge.

Implementation Patterns

Python — prometheus_client

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics at module level
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"],
)
REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency in seconds",
    ["method", "endpoint"],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
IN_PROGRESS = Gauge(
    "http_requests_in_progress",
    "Number of HTTP requests currently being processed",
)

# Middleware / decorator usage
def observe_request(method, endpoint, handler):
    IN_PROGRESS.inc()
    start = time.perf_counter()
    try:
        response = handler()
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=response.status_code).inc()
        return response
    except Exception as exc:
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=500).inc()
        raise
    finally:
        REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(time.perf_counter() - start)
        IN_PROGRESS.dec()

# Start a /metrics endpoint on port 9090
start_http_server(9090)

Go — prometheus/client_golang

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestCount = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latency",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(requestCount, requestDuration)
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":9090", nil)
}

Prometheus configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "recording_rules.yml"
  - "alerting_rules.yml"

scrape_configs:
  - job_name: "api-servers"
    metrics_path: /metrics
    static_configs:
      - targets: ["api-1:9090", "api-2:9090"]

  # Kubernetes service discovery
  - job_name: "k8s-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: "${1}"

Recording rules

# recording_rules.yml
groups:
  - name: http_request_rates
    interval: 30s
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      - record: job:http_request_duration_seconds:p99_5m
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))

Key PromQL queries

# Request rate per second over the last 5 minutes
sum(rate(http_requests_total[5m])) by (endpoint)

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# 99th percentile latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))

# Saturation — CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
  / sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod) * 100

Grafana dashboard JSON model (partial)

{
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
          "legendFormat": "{{ endpoint }}"
        }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
          "legendFormat": "p99"
        }
      ]
    }
  ]
}

Core Philosophy

Metrics are the quantitative pulse of your system — the numeric signals that answer "how is the system behaving right now?" and "how has behavior changed over time?" Unlike logs (which are per-event) and traces (which are per-request), metrics are aggregated by design: a counter that says "5,000 requests in the last minute" is cheap to store, fast to query, and immediately meaningful. This aggregation is both the strength and the limitation of metrics. They tell you that something changed but not what happened to any specific request. Metrics should be your first line of detection and your always-on dashboard, complemented by traces and logs for investigation.

Cardinality is the central constraint of any metrics system. Every unique combination of metric name and label values creates a separate time series that Prometheus must scrape, store, and query. A single metric with a high-cardinality label (like user_id or request_path with full URLs) can create millions of time series, consuming memory and degrading query performance until Prometheus crashes. The discipline of metrics instrumentation is choosing labels that create meaningful dimensions for aggregation (method, endpoint, status code) while keeping total series counts bounded. If you need per-user or per-request analysis, that is what traces and logs are for.

The RED method (Rate, Errors, Duration) for services and the USE method (Utilization, Saturation, Errors) for resources provide a framework for knowing which metrics matter before you start instrumenting. Without a framework, teams either instrument too little (missing critical signals) or too much (creating cardinality problems and dashboard sprawl). Start with RED for every service and USE for every infrastructure component. Add custom business metrics (order rate, payment success rate, queue depth) only after the foundational signals are in place. A service with good RED metrics and no custom metrics is better instrumented than one with fifty custom metrics and no latency histogram.

Anti-Patterns

  • Unbounded label values. Using full URLs, user IDs, or request IDs as metric labels creates a cardinality explosion that can OOM Prometheus. Label values must come from a small, known set. If you need per-user or per-path analysis, use traces or logs, not metrics.

  • Summaries instead of histograms. Using Summary metrics for latency means quantiles are computed client-side and cannot be aggregated across instances. If you have ten pods and want the p99 across all of them, summaries cannot provide it. Use Histogram metrics, which are aggregatable with histogram_quantile().

  • Alerting on raw counter values. A counter only goes up, so http_requests_total > 10000 is meaningless — it will always be true eventually. Alert on rates (rate(http_requests_total[5m])) which represent the current throughput, not the cumulative total.

  • Mismatched scrape interval and rate window. Using rate(...[15s]) with a 15-second scrape interval produces gaps and unreliable results because rate() needs at least two data points within the window. The rate window must be at least 2x the scrape interval (e.g., rate(...[5m]) with 15s scrape).

  • Dashboard queries without recording rules. Complex PromQL expressions computed on every dashboard load put load on Prometheus proportional to the number of people viewing dashboards. Precompute expensive aggregations as recording rules so dashboards read pre-materialized time series instead of running ad-hoc queries.

Best Practices

  • Follow the RED method for services: Rate (requests/sec), Errors (error rate), Duration (latency distribution).
  • Follow the USE method for resources: Utilization, Saturation, Errors for CPU, memory, disk, network.
  • Use histograms over summaries for latency. Histograms are aggregatable across instances; summaries are not.
  • Choose bucket boundaries carefully. Align histogram buckets with your SLO thresholds (e.g., if your SLO is p99 < 500ms, have a bucket at 0.5).
  • Control cardinality. Never use unbounded values (user IDs, request IDs, full URLs) as label values. Keep total series count per metric under a few thousand.
  • Use recording rules for dashboard queries. Precompute expensive aggregations so dashboards load instantly and do not overload Prometheus.
  • Name metrics following conventions. Use <namespace>_<name>_<unit> with _total suffix for counters, _seconds / _bytes for units.

Common Pitfalls

  • Cardinality explosion. Adding a user_id label to a counter creates millions of time series and can OOM Prometheus. Use logs or traces for per-user analysis.
  • Using rate() on a gauge. rate() is for counters. For gauges, use the value directly or deriv().
  • Missing le label in histogram aggregations. histogram_quantile requires the le (less-than-or-equal) label to be preserved; aggregating it away produces wrong results.
  • Scrape interval mismatch with rate window. The rate window must be at least 2x the scrape interval (e.g., rate(...[5m]) with a 15s scrape). Using rate(...[15s]) with a 15s scrape yields gaps.
  • Alert on raw values instead of rates. Alerting on http_requests_total > 10000 is meaningless since counters only grow. Alert on rate(http_requests_total[5m]) instead.

Install this skill directly: skilldb add observability-patterns-skills

Get CLI access →