Monitoring and Observability
Build observability systems using metrics, logs, and traces to understand system
Monitoring and Observability
Core Philosophy
Observability is the ability to understand the internal state of a system by examining its external outputs. While monitoring tells you when something is wrong, observability helps you understand why. The three pillars — metrics, logs, and traces — provide complementary views into system behavior. Metrics show trends and aggregates, logs capture discrete events with context, and traces follow individual requests across service boundaries.
Key Techniques
- RED Method: Monitor Rate (requests per second), Errors (failed requests per second), and Duration (latency distribution) for every service. Covers the essential user-facing health signals.
- USE Method: Monitor Utilization, Saturation, and Errors for every resource (CPU, memory, disk, network). Covers infrastructure health.
- Distributed Tracing: Propagate trace context across service boundaries to reconstruct the full path of a request through a microservices architecture.
- SLO-Based Alerting: Define Service Level Objectives and alert on error budget burn rate rather than arbitrary thresholds, reducing alert fatigue.
- Structured Logging: Emit logs as structured data (JSON) with consistent fields (timestamp, service, trace ID, level) to enable machine parsing and correlation.
- Custom Metrics: Instrument application code to emit business-relevant metrics (orders processed, payments completed) alongside technical metrics.
Best Practices
- Instrument before you need it. Adding observability after an outage is too late.
- Use consistent naming conventions for metrics across all services.
- Set alerts on symptoms (user-visible errors, latency) not causes (CPU usage). High CPU that does not affect users is not an emergency.
- Include runbooks with every alert that explain what the alert means and the first diagnostic steps to take.
- Retain high-resolution metrics for days, downsampled metrics for months, and logs for weeks unless compliance requires longer.
- Correlate metrics, logs, and traces using shared identifiers (trace ID, request ID) to enable seamless debugging workflows.
- Dashboard for understanding, alert for action. Dashboards should tell a story; alerts should be actionable.
Common Patterns
- Golden Signals Dashboard: A single dashboard per service showing latency, traffic, errors, and saturation — the four signals that matter most.
- On-Call Escalation: Tiered alerting that pages the primary on-call, escalates to secondary after a timeout, and notifies management for extended incidents.
- Anomaly Detection: Use statistical methods or ML to detect unusual patterns in metrics that static thresholds would miss.
- Chaos Engineering Validation: Use controlled failure injection to verify that monitoring detects and alerts on real failure modes.
Anti-Patterns
- Alert fatigue from too many low-priority or flapping alerts. Every ignored alert makes it more likely that a real incident gets missed.
- Dashboard sprawl with hundreds of charts nobody looks at. Curate dashboards for specific audiences and purposes.
- Logging everything at DEBUG level in production, creating massive storage costs and making it impossible to find signal in noise.
- Monitoring only infrastructure metrics while ignoring application-level and business-level signals.
- Not testing alerting pipelines. Verify that alerts actually reach on-call engineers through regular fire drills.
- Treating observability as a separate team's responsibility rather than an integral part of every developer's workflow.
Related Skills
CI/CD Pipelines
Design and maintain continuous integration and continuous delivery pipelines
Cloud Architecture
Design scalable, resilient, and cost-effective systems on cloud platforms like
Configuration Management
Manage system configurations consistently across environments using automation
Container Orchestration
Manage containerized applications at scale using orchestration platforms like
Cloud Cost Optimization
Reduce and optimize cloud infrastructure spending without sacrificing performance
Incident Management
Coordinate effective incident response from detection through resolution and