Log Aggregation
Centralized log aggregation patterns for collecting, indexing, and querying logs across distributed systems
You are an expert in centralized log aggregation for building observable systems. ## Key Points - **Log shipper/agent**: A lightweight process on each host that tails log files or reads from stdout/journal and forwards them. Examples: Promtail, Fluent Bit, Fluentd, Filebeat, Vector. - **Log pipeline**: Intermediate processing — parsing, enriching, filtering, sampling, and routing logs before storage. - **Log storage/index**: The central store. Loki stores compressed log chunks indexed only by labels. Elasticsearch/OpenSearch provides full-text inverted indexes. - **Query interface**: Grafana Explore (for Loki), Kibana/OpenSearch Dashboards (for Elasticsearch), or API-based querying. - **Log levels**: Severity tiers (DEBUG, INFO, WARN, ERROR) that control what gets shipped and how long it is retained. - url: http://loki:3100/loki/api/v1/push - job_name: containers - job_name: kubernetes - **Ship logs to stdout, not files.** Let the container runtime or log agent handle collection. Writing to files inside containers adds complexity and risks filling disk. - **Parse and enrich at the agent, not the store.** Extract structured fields, drop unnecessary verbosity, and add metadata (cluster, namespace, region) in the shipper pipeline. - **Implement tiered retention.** Keep ERROR logs longer than DEBUG logs. Use separate Loki tenants or index lifecycle policies in Elasticsearch for different retention tiers. - **Correlate logs with traces.** Include `trace_id` in every log line so you can jump from a log entry to the full distributed trace in your tracing backend.
skilldb get observability-patterns-skills/Log AggregationFull skill: 249 linesLog Aggregation — Observability
You are an expert in centralized log aggregation for building observable systems.
Overview
In distributed systems, logs are produced by dozens or hundreds of services across many hosts. Without centralized aggregation, debugging requires SSH-ing into individual machines and grepping through files — a process that does not scale. A log aggregation pipeline collects logs from all sources, ships them to a central store, and provides a query interface for search, filtering, and correlation. The dominant open-source stack is the Grafana Loki ecosystem (lightweight, label-indexed) or the Elasticsearch/OpenSearch ecosystem (full-text indexed).
Core Concepts
- Log shipper/agent: A lightweight process on each host that tails log files or reads from stdout/journal and forwards them. Examples: Promtail, Fluent Bit, Fluentd, Filebeat, Vector.
- Log pipeline: Intermediate processing — parsing, enriching, filtering, sampling, and routing logs before storage.
- Log storage/index: The central store. Loki stores compressed log chunks indexed only by labels. Elasticsearch/OpenSearch provides full-text inverted indexes.
- Query interface: Grafana Explore (for Loki), Kibana/OpenSearch Dashboards (for Elasticsearch), or API-based querying.
- Retention policy: How long logs are kept. Balances cost, compliance, and debugging needs. Often tiered: hot (fast SSD, 7-14 days), warm (cheaper storage, 30-90 days), cold (object storage, 1+ years).
- Log levels: Severity tiers (DEBUG, INFO, WARN, ERROR) that control what gets shipped and how long it is retained.
Implementation Patterns
Grafana Loki stack (Promtail + Loki + Grafana)
Promtail configuration
# promtail-config.yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Scrape all container logs from Docker
- job_name: containers
static_configs:
- targets: [localhost]
labels:
job: docker
__path__: /var/lib/docker/containers/*/*-json.log
pipeline_stages:
- json:
expressions:
log: log
stream: stream
time: time
- timestamp:
source: time
format: RFC3339Nano
- output:
source: log
# Kubernetes pod logs via service discovery
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
pipeline_stages:
- cri: {}
# Parse JSON logs and extract structured fields
- json:
expressions:
level: level
trace_id: trace_id
service: service
- labels:
level:
service:
Loki configuration
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
common:
ring:
kvstore:
store: inmemory
replication_factor: 1
path_prefix: /loki
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
aws:
s3: s3://us-east-1/loki-logs-bucket
bucketnames: loki-logs-bucket
region: us-east-1
limits_config:
retention_period: 30d
max_query_series: 500
max_query_parallelism: 16
compactor:
working_directory: /loki/compactor
retention_enabled: true
Fluent Bit for lightweight log shipping
# fluent-bit.conf
[SERVICE]
Flush 5
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/app/*.log
Tag app.*
Parser json
Mem_Buf_Limit 10MB
[FILTER]
Name modify
Match app.*
Add cluster production
Add region us-east-1
[FILTER]
Name grep
Match app.*
Exclude level DEBUG
[OUTPUT]
Name loki
Match app.*
Host loki.internal
Port 3100
Labels job=app,cluster=production
Label_keys $level,$service
Line_Format json
LogQL queries (Loki query language)
# All error logs from the order service in the last hour
{service="order-service", level="error"}
# Full-text filter within a label selection
{namespace="production"} |= "timeout" | json | duration_ms > 5000
# Aggregate: error rate per service
sum(rate({level="error"}[5m])) by (service)
# Top 10 most frequent error messages
topk(10, sum(count_over_time({level="error"} | json | pattern `<msg>` [1h])) by (msg))
# Logs correlated to a trace
{service=~"order-service|payment-service"} | json | trace_id = "abc-123-def-456"
Docker Compose for local Loki stack
services:
loki:
image: grafana/loki:2.9.4
command: -config.file=/etc/loki/config.yaml
volumes:
- ./loki-config.yaml:/etc/loki/config.yaml
ports:
- "3100:3100"
promtail:
image: grafana/promtail:2.9.4
command: -config.file=/etc/promtail/config.yaml
volumes:
- ./promtail-config.yaml:/etc/promtail/config.yaml
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
grafana:
image: grafana/grafana:10.3.1
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
Core Philosophy
Logs are the narrative record of what your system did — the detailed, human-readable accounting of events, decisions, and errors that no other telemetry signal provides. Metrics tell you that something is wrong; traces tell you where in the request path it happened; logs tell you exactly what happened in the code at that moment. But this richness comes at a cost: logs are the most expensive observability signal to store and query at scale. The art of log aggregation is capturing enough detail to debug any problem without generating so much volume that the logs become unsearchable and unaffordable.
The pipeline model — collect, process, ship, store, query — must be treated as a first-class system with its own monitoring, alerting, and capacity planning. A log pipeline that silently drops entries during peak load is worse than no pipeline at all, because engineers trust that the logs exist and make decisions based on their absence. "I checked the logs and there were no errors" is only meaningful if you can verify the pipeline was healthy during the time window in question. Monitor agent backpressure, ingestion lag, and dropped log counts as carefully as you monitor application metrics.
The choice between Loki (label-indexed, lightweight) and Elasticsearch (full-text indexed, powerful) reflects a fundamental trade-off in log architecture. Loki is cheaper to operate because it indexes only labels, not log content, but this means full-text searches are slower. Elasticsearch provides instant full-text search but requires significant resources for indexing and storage. The right choice depends on your query patterns: if most queries start with "show me errors from service X in the last hour" (label-first), Loki excels. If most queries start with "find all log lines containing this error message across all services" (content-first), Elasticsearch is the better fit.
Anti-Patterns
-
High-cardinality labels in Loki. Using
pod_id,request_id, oruser_idas Loki labels creates millions of streams, degrades ingestion performance, and can crash the cluster. Loki labels should be low-cardinality dimensions (service, namespace, level). Use log line content filters (|= "user-42") for high-cardinality searches. -
Logging everything at INFO. Services that emit an INFO log line for every incoming request generate enormous volume that drowns out meaningful events. Reserve INFO for significant state changes (service started, configuration reloaded, batch job completed). Use DEBUG for per-request detail and keep it disabled in production by default.
-
No retention policy. Storing all logs forever is the default path of least resistance and it leads to unbounded storage costs and degraded query performance as indexes grow. Define retention policies from day one, tiered by log level: ERROR logs retained longer than DEBUG, with clear cost projections.
-
Unmonitored log pipeline. A log shipper that crashes, a full disk buffer, or an overwhelmed ingestion endpoint silently drops logs during exactly the moments when you need them most — incidents. Alert on agent health, ingestion lag, and dropped log counts so pipeline failures are detected before they matter.
-
Ignoring multi-line log entries. Stack traces, multi-line exception messages, and formatted data structures get split into separate log entries by default, making them impossible to read or search. Configure the log shipper's multi-line parser to merge patterns that start with whitespace or match known exception formats.
Best Practices
- Ship logs to stdout, not files. Let the container runtime or log agent handle collection. Writing to files inside containers adds complexity and risks filling disk.
- Use labels sparingly in Loki. Loki indexes by labels, not full text. High-cardinality labels (like
user_id) create too many streams and degrade performance. Use filter expressions (|= "user-42") for high-cardinality searches. - Parse and enrich at the agent, not the store. Extract structured fields, drop unnecessary verbosity, and add metadata (cluster, namespace, region) in the shipper pipeline.
- Implement tiered retention. Keep ERROR logs longer than DEBUG logs. Use separate Loki tenants or index lifecycle policies in Elasticsearch for different retention tiers.
- Correlate logs with traces. Include
trace_idin every log line so you can jump from a log entry to the full distributed trace in your tracing backend. - Set ingestion rate limits. Protect the log store from runaway logging (e.g., a tight-loop error that produces millions of lines/minute). Use per-tenant or per-service rate limits.
- Monitor the log pipeline itself. Alert on agent backpressure, dropped logs, ingestion lag, and storage utilization. A broken log pipeline is invisible until you need the logs.
Common Pitfalls
- Logging too much at high severity. Services that log INFO for every request generate enormous volume. Reserve INFO for significant state changes; use DEBUG for per-request detail and disable it in production by default.
- High-cardinality labels in Loki. Using
pod_idorrequest_idas a Loki label creates millions of streams. Use log line content filters instead. - No retention policy. Unbounded log retention fills storage and increases costs linearly. Define retention from day one.
- Ignoring multi-line logs. Stack traces and multi-line exceptions get split into separate log entries. Configure the shipper to merge multi-line patterns (e.g., Fluent Bit's
multilineparser). - Single-point-of-failure in the pipeline. If the log aggregator goes down and agents have no local buffer, logs are lost during the outage. Use agents with disk-backed buffers (Fluent Bit
storage.type filesystem).
Install this skill directly: skilldb add observability-patterns-skills
Related Skills
Alerting Strategies
On-call alerting strategies for actionable, low-noise alert systems that reduce fatigue and improve response times
Distributed Tracing
OpenTelemetry distributed tracing patterns for end-to-end request visibility across microservices
Health Checks
Health check endpoint patterns for liveness, readiness, and startup probes in distributed services
Incident Response
Incident response and postmortem patterns for structured handling, communication, and learning from production incidents
Metrics Collection
Prometheus and Grafana metrics collection patterns for monitoring application and infrastructure health
Sli Slo
SLI, SLO, and error budget patterns for defining and managing service reliability targets