Metrics Collection
Prometheus and Grafana metrics collection patterns for monitoring application and infrastructure health
You are an expert in Prometheus and Grafana metrics for building observable systems.
## Key Points
- **Metric types**: Counter (monotonically increasing), Gauge (can go up or down), Histogram (bucketed distributions), Summary (quantiles computed client-side).
- **Labels**: Key-value pairs attached to a metric that create dimensional time series. E.g., `http_requests_total{method="GET", status="200"}`.
- **Scrape target**: An HTTP endpoint (usually `/metrics`) that Prometheus polls at a configured interval.
- **PromQL**: Prometheus Query Language for querying and aggregating time-series data. Supports rate, aggregation, vector matching, and subqueries.
- **Recording rules**: Precomputed PromQL expressions stored as new time series to speed up dashboards and alerts.
- **Alerting rules**: PromQL expressions that fire alerts when conditions are met, sent to Alertmanager for routing.
- **Cardinality**: The total number of unique time series. High cardinality (from high-dimensional labels) is the primary scaling challenge.
- "recording_rules.yml"
- "alerting_rules.yml"
- job_name: "api-servers"
- job_name: "k8s-pods"
- name: http_request_ratesskilldb get observability-patterns-skills/Metrics CollectionFull skill: 238 linesMetrics Collection — Observability
You are an expert in Prometheus and Grafana metrics for building observable systems.
Overview
Metrics are numeric time-series data that quantify system behavior — request rates, error counts, latencies, queue depths, resource utilization. Prometheus is the dominant open-source metrics system, using a pull-based model where it scrapes HTTP endpoints exposing metrics in a text or OpenMetrics format. Grafana provides visualization and dashboarding on top of Prometheus (and other data sources). Together they form the backbone of most modern monitoring stacks.
Core Concepts
- Metric types: Counter (monotonically increasing), Gauge (can go up or down), Histogram (bucketed distributions), Summary (quantiles computed client-side).
- Labels: Key-value pairs attached to a metric that create dimensional time series. E.g.,
http_requests_total{method="GET", status="200"}. - Scrape target: An HTTP endpoint (usually
/metrics) that Prometheus polls at a configured interval. - PromQL: Prometheus Query Language for querying and aggregating time-series data. Supports rate, aggregation, vector matching, and subqueries.
- Recording rules: Precomputed PromQL expressions stored as new time series to speed up dashboards and alerts.
- Alerting rules: PromQL expressions that fire alerts when conditions are met, sent to Alertmanager for routing.
- Cardinality: The total number of unique time series. High cardinality (from high-dimensional labels) is the primary scaling challenge.
Implementation Patterns
Python — prometheus_client
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics at module level
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"],
)
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"HTTP request latency in seconds",
["method", "endpoint"],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
IN_PROGRESS = Gauge(
"http_requests_in_progress",
"Number of HTTP requests currently being processed",
)
# Middleware / decorator usage
def observe_request(method, endpoint, handler):
IN_PROGRESS.inc()
start = time.perf_counter()
try:
response = handler()
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=response.status_code).inc()
return response
except Exception as exc:
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=500).inc()
raise
finally:
REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(time.perf_counter() - start)
IN_PROGRESS.dec()
# Start a /metrics endpoint on port 9090
start_http_server(9090)
Go — prometheus/client_golang
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
requestCount = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(requestCount, requestDuration)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":9090", nil)
}
Prometheus configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "recording_rules.yml"
- "alerting_rules.yml"
scrape_configs:
- job_name: "api-servers"
metrics_path: /metrics
static_configs:
- targets: ["api-1:9090", "api-2:9090"]
# Kubernetes service discovery
- job_name: "k8s-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: "${1}"
Recording rules
# recording_rules.yml
groups:
- name: http_request_rates
interval: 30s
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration_seconds:p99_5m
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
Key PromQL queries
# Request rate per second over the last 5 minutes
sum(rate(http_requests_total[5m])) by (endpoint)
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# 99th percentile latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
# Saturation — CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
/ sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod) * 100
Grafana dashboard JSON model (partial)
{
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
"legendFormat": "{{ endpoint }}"
}
]
},
{
"title": "P99 Latency",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p99"
}
]
}
]
}
Core Philosophy
Metrics are the quantitative pulse of your system — the numeric signals that answer "how is the system behaving right now?" and "how has behavior changed over time?" Unlike logs (which are per-event) and traces (which are per-request), metrics are aggregated by design: a counter that says "5,000 requests in the last minute" is cheap to store, fast to query, and immediately meaningful. This aggregation is both the strength and the limitation of metrics. They tell you that something changed but not what happened to any specific request. Metrics should be your first line of detection and your always-on dashboard, complemented by traces and logs for investigation.
Cardinality is the central constraint of any metrics system. Every unique combination of metric name and label values creates a separate time series that Prometheus must scrape, store, and query. A single metric with a high-cardinality label (like user_id or request_path with full URLs) can create millions of time series, consuming memory and degrading query performance until Prometheus crashes. The discipline of metrics instrumentation is choosing labels that create meaningful dimensions for aggregation (method, endpoint, status code) while keeping total series counts bounded. If you need per-user or per-request analysis, that is what traces and logs are for.
The RED method (Rate, Errors, Duration) for services and the USE method (Utilization, Saturation, Errors) for resources provide a framework for knowing which metrics matter before you start instrumenting. Without a framework, teams either instrument too little (missing critical signals) or too much (creating cardinality problems and dashboard sprawl). Start with RED for every service and USE for every infrastructure component. Add custom business metrics (order rate, payment success rate, queue depth) only after the foundational signals are in place. A service with good RED metrics and no custom metrics is better instrumented than one with fifty custom metrics and no latency histogram.
Anti-Patterns
-
Unbounded label values. Using full URLs, user IDs, or request IDs as metric labels creates a cardinality explosion that can OOM Prometheus. Label values must come from a small, known set. If you need per-user or per-path analysis, use traces or logs, not metrics.
-
Summaries instead of histograms. Using Summary metrics for latency means quantiles are computed client-side and cannot be aggregated across instances. If you have ten pods and want the p99 across all of them, summaries cannot provide it. Use Histogram metrics, which are aggregatable with
histogram_quantile(). -
Alerting on raw counter values. A counter only goes up, so
http_requests_total > 10000is meaningless — it will always be true eventually. Alert on rates (rate(http_requests_total[5m])) which represent the current throughput, not the cumulative total. -
Mismatched scrape interval and rate window. Using
rate(...[15s])with a 15-second scrape interval produces gaps and unreliable results becauserate()needs at least two data points within the window. The rate window must be at least 2x the scrape interval (e.g.,rate(...[5m])with 15s scrape). -
Dashboard queries without recording rules. Complex PromQL expressions computed on every dashboard load put load on Prometheus proportional to the number of people viewing dashboards. Precompute expensive aggregations as recording rules so dashboards read pre-materialized time series instead of running ad-hoc queries.
Best Practices
- Follow the RED method for services: Rate (requests/sec), Errors (error rate), Duration (latency distribution).
- Follow the USE method for resources: Utilization, Saturation, Errors for CPU, memory, disk, network.
- Use histograms over summaries for latency. Histograms are aggregatable across instances; summaries are not.
- Choose bucket boundaries carefully. Align histogram buckets with your SLO thresholds (e.g., if your SLO is p99 < 500ms, have a bucket at 0.5).
- Control cardinality. Never use unbounded values (user IDs, request IDs, full URLs) as label values. Keep total series count per metric under a few thousand.
- Use recording rules for dashboard queries. Precompute expensive aggregations so dashboards load instantly and do not overload Prometheus.
- Name metrics following conventions. Use
<namespace>_<name>_<unit>with_totalsuffix for counters,_seconds/_bytesfor units.
Common Pitfalls
- Cardinality explosion. Adding a
user_idlabel to a counter creates millions of time series and can OOM Prometheus. Use logs or traces for per-user analysis. - Using
rate()on a gauge.rate()is for counters. For gauges, use the value directly orderiv(). - Missing
lelabel in histogram aggregations.histogram_quantilerequires thele(less-than-or-equal) label to be preserved; aggregating it away produces wrong results. - Scrape interval mismatch with rate window. The rate window must be at least 2x the scrape interval (e.g.,
rate(...[5m])with a 15s scrape). Usingrate(...[15s])with a 15s scrape yields gaps. - Alert on raw values instead of rates. Alerting on
http_requests_total > 10000is meaningless since counters only grow. Alert onrate(http_requests_total[5m])instead.
Install this skill directly: skilldb add observability-patterns-skills
Related Skills
Alerting Strategies
On-call alerting strategies for actionable, low-noise alert systems that reduce fatigue and improve response times
Distributed Tracing
OpenTelemetry distributed tracing patterns for end-to-end request visibility across microservices
Health Checks
Health check endpoint patterns for liveness, readiness, and startup probes in distributed services
Incident Response
Incident response and postmortem patterns for structured handling, communication, and learning from production incidents
Log Aggregation
Centralized log aggregation patterns for collecting, indexing, and querying logs across distributed systems
Sli Slo
SLI, SLO, and error budget patterns for defining and managing service reliability targets