Technology & EngineeringDevops Cloud72 lines

Monitoring Observability

Build observability systems using metrics, logs, and traces to understand system

Quick Summary21 lines

Observability is the ability to understand the internal state of a system by
examining its external outputs. While monitoring tells you when something is wrong,
observability helps you understand why. The three pillars — metrics, logs, and
traces — provide complementary views into system behavior. Metrics show trends and

## Key Points

- **RED Method**: Monitor Rate (requests per second), Errors (failed requests per
- **USE Method**: Monitor Utilization, Saturation, and Errors for every resource
- **Distributed Tracing**: Propagate trace context across service boundaries to
- **SLO-Based Alerting**: Define Service Level Objectives and alert on error
- **Structured Logging**: Emit logs as structured data (JSON) with consistent
- **Custom Metrics**: Instrument application code to emit business-relevant metrics
- Instrument before you need it. Adding observability after an outage is too late.
- Use consistent naming conventions for metrics across all services.
- Set alerts on symptoms (user-visible errors, latency) not causes (CPU usage).
- Include runbooks with every alert that explain what the alert means and the
- Retain high-resolution metrics for days, downsampled metrics for months, and
- Correlate metrics, logs, and traces using shared identifiers (trace ID, request

skilldb get devops-cloud-skills/Monitoring ObservabilityFull skill: 72 lines

Paste into your CLAUDE.md or agent config

Monitoring and Observability

Core Philosophy

Observability is the ability to understand the internal state of a system by examining its external outputs. While monitoring tells you when something is wrong, observability helps you understand why. The three pillars — metrics, logs, and traces — provide complementary views into system behavior. Metrics show trends and aggregates, logs capture discrete events with context, and traces follow individual requests across service boundaries.

Key Techniques

RED Method: Monitor Rate (requests per second), Errors (failed requests per second), and Duration (latency distribution) for every service. Covers the essential user-facing health signals.
USE Method: Monitor Utilization, Saturation, and Errors for every resource (CPU, memory, disk, network). Covers infrastructure health.
Distributed Tracing: Propagate trace context across service boundaries to reconstruct the full path of a request through a microservices architecture.
SLO-Based Alerting: Define Service Level Objectives and alert on error budget burn rate rather than arbitrary thresholds, reducing alert fatigue.
Structured Logging: Emit logs as structured data (JSON) with consistent fields (timestamp, service, trace ID, level) to enable machine parsing and correlation.
Custom Metrics: Instrument application code to emit business-relevant metrics (orders processed, payments completed) alongside technical metrics.

Best Practices

Instrument before you need it. Adding observability after an outage is too late.
Use consistent naming conventions for metrics across all services.
Set alerts on symptoms (user-visible errors, latency) not causes (CPU usage). High CPU that does not affect users is not an emergency.
Include runbooks with every alert that explain what the alert means and the first diagnostic steps to take.
Retain high-resolution metrics for days, downsampled metrics for months, and logs for weeks unless compliance requires longer.
Correlate metrics, logs, and traces using shared identifiers (trace ID, request ID) to enable seamless debugging workflows.
Dashboard for understanding, alert for action. Dashboards should tell a story; alerts should be actionable.

Common Patterns

Golden Signals Dashboard: A single dashboard per service showing latency, traffic, errors, and saturation — the four signals that matter most.
On-Call Escalation: Tiered alerting that pages the primary on-call, escalates to secondary after a timeout, and notifies management for extended incidents.
Anomaly Detection: Use statistical methods or ML to detect unusual patterns in metrics that static thresholds would miss.
Chaos Engineering Validation: Use controlled failure injection to verify that monitoring detects and alerts on real failure modes.

Anti-Patterns

Alert fatigue from too many low-priority or flapping alerts. Every ignored alert makes it more likely that a real incident gets missed.
Dashboard sprawl with hundreds of charts nobody looks at. Curate dashboards for specific audiences and purposes.
Logging everything at DEBUG level in production, creating massive storage costs and making it impossible to find signal in noise.
Monitoring only infrastructure metrics while ignoring application-level and business-level signals.
Not testing alerting pipelines. Verify that alerts actually reach on-call engineers through regular fire drills.
Treating observability as a separate team's responsibility rather than an integral part of every developer's workflow.

Install this skill directly: skilldb add devops-cloud-skills

Get CLI access →

Related Skills

CI CD Pipelines

Design and maintain continuous integration and continuous delivery pipelines

Devops Cloud•144L

Cloud Architecture

Design scalable, resilient, and cost-effective systems on cloud platforms like

Devops Cloud•73L

Configuration Management

Manage system configurations consistently across environments using automation

Devops Cloud•71L

Container Orchestration

Manage containerized applications at scale using orchestration platforms like

Devops Cloud•74L

Cost Optimization

Reduce and optimize cloud infrastructure spending without sacrificing performance

Devops Cloud•72L

Incident Management

Coordinate effective incident response from detection through resolution and

Devops Cloud•71L