Skip to main content
Technology & EngineeringDevops Cloud72 lines

Monitoring Observability

Build observability systems using metrics, logs, and traces to understand system

Quick Summary21 lines
Observability is the ability to understand the internal state of a system by
examining its external outputs. While monitoring tells you when something is wrong,
observability helps you understand why. The three pillars — metrics, logs, and
traces — provide complementary views into system behavior. Metrics show trends and

## Key Points

- **RED Method**: Monitor Rate (requests per second), Errors (failed requests per
- **USE Method**: Monitor Utilization, Saturation, and Errors for every resource
- **Distributed Tracing**: Propagate trace context across service boundaries to
- **SLO-Based Alerting**: Define Service Level Objectives and alert on error
- **Structured Logging**: Emit logs as structured data (JSON) with consistent
- **Custom Metrics**: Instrument application code to emit business-relevant metrics
- Instrument before you need it. Adding observability after an outage is too late.
- Use consistent naming conventions for metrics across all services.
- Set alerts on symptoms (user-visible errors, latency) not causes (CPU usage).
- Include runbooks with every alert that explain what the alert means and the
- Retain high-resolution metrics for days, downsampled metrics for months, and
- Correlate metrics, logs, and traces using shared identifiers (trace ID, request
skilldb get devops-cloud-skills/Monitoring ObservabilityFull skill: 72 lines
Paste into your CLAUDE.md or agent config

Monitoring and Observability

Core Philosophy

Observability is the ability to understand the internal state of a system by examining its external outputs. While monitoring tells you when something is wrong, observability helps you understand why. The three pillars — metrics, logs, and traces — provide complementary views into system behavior. Metrics show trends and aggregates, logs capture discrete events with context, and traces follow individual requests across service boundaries.

Key Techniques

  • RED Method: Monitor Rate (requests per second), Errors (failed requests per second), and Duration (latency distribution) for every service. Covers the essential user-facing health signals.
  • USE Method: Monitor Utilization, Saturation, and Errors for every resource (CPU, memory, disk, network). Covers infrastructure health.
  • Distributed Tracing: Propagate trace context across service boundaries to reconstruct the full path of a request through a microservices architecture.
  • SLO-Based Alerting: Define Service Level Objectives and alert on error budget burn rate rather than arbitrary thresholds, reducing alert fatigue.
  • Structured Logging: Emit logs as structured data (JSON) with consistent fields (timestamp, service, trace ID, level) to enable machine parsing and correlation.
  • Custom Metrics: Instrument application code to emit business-relevant metrics (orders processed, payments completed) alongside technical metrics.

Best Practices

  • Instrument before you need it. Adding observability after an outage is too late.
  • Use consistent naming conventions for metrics across all services.
  • Set alerts on symptoms (user-visible errors, latency) not causes (CPU usage). High CPU that does not affect users is not an emergency.
  • Include runbooks with every alert that explain what the alert means and the first diagnostic steps to take.
  • Retain high-resolution metrics for days, downsampled metrics for months, and logs for weeks unless compliance requires longer.
  • Correlate metrics, logs, and traces using shared identifiers (trace ID, request ID) to enable seamless debugging workflows.
  • Dashboard for understanding, alert for action. Dashboards should tell a story; alerts should be actionable.

Common Patterns

  • Golden Signals Dashboard: A single dashboard per service showing latency, traffic, errors, and saturation — the four signals that matter most.
  • On-Call Escalation: Tiered alerting that pages the primary on-call, escalates to secondary after a timeout, and notifies management for extended incidents.
  • Anomaly Detection: Use statistical methods or ML to detect unusual patterns in metrics that static thresholds would miss.
  • Chaos Engineering Validation: Use controlled failure injection to verify that monitoring detects and alerts on real failure modes.

Anti-Patterns

  • Alert fatigue from too many low-priority or flapping alerts. Every ignored alert makes it more likely that a real incident gets missed.
  • Dashboard sprawl with hundreds of charts nobody looks at. Curate dashboards for specific audiences and purposes.
  • Logging everything at DEBUG level in production, creating massive storage costs and making it impossible to find signal in noise.
  • Monitoring only infrastructure metrics while ignoring application-level and business-level signals.
  • Not testing alerting pipelines. Verify that alerts actually reach on-call engineers through regular fire drills.
  • Treating observability as a separate team's responsibility rather than an integral part of every developer's workflow.

Install this skill directly: skilldb add devops-cloud-skills

Get CLI access →