Skip to main content
Technology & EngineeringObservability Patterns249 lines

Log Aggregation

Centralized log aggregation patterns for collecting, indexing, and querying logs across distributed systems

Quick Summary18 lines
You are an expert in centralized log aggregation for building observable systems.

## Key Points

- **Log shipper/agent**: A lightweight process on each host that tails log files or reads from stdout/journal and forwards them. Examples: Promtail, Fluent Bit, Fluentd, Filebeat, Vector.
- **Log pipeline**: Intermediate processing — parsing, enriching, filtering, sampling, and routing logs before storage.
- **Log storage/index**: The central store. Loki stores compressed log chunks indexed only by labels. Elasticsearch/OpenSearch provides full-text inverted indexes.
- **Query interface**: Grafana Explore (for Loki), Kibana/OpenSearch Dashboards (for Elasticsearch), or API-based querying.
- **Log levels**: Severity tiers (DEBUG, INFO, WARN, ERROR) that control what gets shipped and how long it is retained.
- url: http://loki:3100/loki/api/v1/push
- job_name: containers
- job_name: kubernetes
- **Ship logs to stdout, not files.** Let the container runtime or log agent handle collection. Writing to files inside containers adds complexity and risks filling disk.
- **Parse and enrich at the agent, not the store.** Extract structured fields, drop unnecessary verbosity, and add metadata (cluster, namespace, region) in the shipper pipeline.
- **Implement tiered retention.** Keep ERROR logs longer than DEBUG logs. Use separate Loki tenants or index lifecycle policies in Elasticsearch for different retention tiers.
- **Correlate logs with traces.** Include `trace_id` in every log line so you can jump from a log entry to the full distributed trace in your tracing backend.
skilldb get observability-patterns-skills/Log AggregationFull skill: 249 lines
Paste into your CLAUDE.md or agent config

Log Aggregation — Observability

You are an expert in centralized log aggregation for building observable systems.

Overview

In distributed systems, logs are produced by dozens or hundreds of services across many hosts. Without centralized aggregation, debugging requires SSH-ing into individual machines and grepping through files — a process that does not scale. A log aggregation pipeline collects logs from all sources, ships them to a central store, and provides a query interface for search, filtering, and correlation. The dominant open-source stack is the Grafana Loki ecosystem (lightweight, label-indexed) or the Elasticsearch/OpenSearch ecosystem (full-text indexed).

Core Concepts

  • Log shipper/agent: A lightweight process on each host that tails log files or reads from stdout/journal and forwards them. Examples: Promtail, Fluent Bit, Fluentd, Filebeat, Vector.
  • Log pipeline: Intermediate processing — parsing, enriching, filtering, sampling, and routing logs before storage.
  • Log storage/index: The central store. Loki stores compressed log chunks indexed only by labels. Elasticsearch/OpenSearch provides full-text inverted indexes.
  • Query interface: Grafana Explore (for Loki), Kibana/OpenSearch Dashboards (for Elasticsearch), or API-based querying.
  • Retention policy: How long logs are kept. Balances cost, compliance, and debugging needs. Often tiered: hot (fast SSD, 7-14 days), warm (cheaper storage, 30-90 days), cold (object storage, 1+ years).
  • Log levels: Severity tiers (DEBUG, INFO, WARN, ERROR) that control what gets shipped and how long it is retained.

Implementation Patterns

Grafana Loki stack (Promtail + Loki + Grafana)

Promtail configuration

# promtail-config.yaml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # Scrape all container logs from Docker
  - job_name: containers
    static_configs:
      - targets: [localhost]
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*-json.log
    pipeline_stages:
      - json:
          expressions:
            log: log
            stream: stream
            time: time
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: log

  # Kubernetes pod logs via service discovery
  - job_name: kubernetes
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
    pipeline_stages:
      - cri: {}
      # Parse JSON logs and extract structured fields
      - json:
          expressions:
            level: level
            trace_id: trace_id
            service: service
      - labels:
          level:
          service:

Loki configuration

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  ring:
    kvstore:
      store: inmemory
  replication_factor: 1
  path_prefix: /loki

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/index_cache
  aws:
    s3: s3://us-east-1/loki-logs-bucket
    bucketnames: loki-logs-bucket
    region: us-east-1

limits_config:
  retention_period: 30d
  max_query_series: 500
  max_query_parallelism: 16

compactor:
  working_directory: /loki/compactor
  retention_enabled: true

Fluent Bit for lightweight log shipping

# fluent-bit.conf
[SERVICE]
    Flush        5
    Log_Level    info
    Parsers_File parsers.conf

[INPUT]
    Name         tail
    Path         /var/log/app/*.log
    Tag          app.*
    Parser       json
    Mem_Buf_Limit 10MB

[FILTER]
    Name         modify
    Match        app.*
    Add          cluster production
    Add          region us-east-1

[FILTER]
    Name         grep
    Match        app.*
    Exclude      level DEBUG

[OUTPUT]
    Name         loki
    Match        app.*
    Host         loki.internal
    Port         3100
    Labels       job=app,cluster=production
    Label_keys   $level,$service
    Line_Format  json

LogQL queries (Loki query language)

# All error logs from the order service in the last hour
{service="order-service", level="error"}

# Full-text filter within a label selection
{namespace="production"} |= "timeout" | json | duration_ms > 5000

# Aggregate: error rate per service
sum(rate({level="error"}[5m])) by (service)

# Top 10 most frequent error messages
topk(10, sum(count_over_time({level="error"} | json | pattern `<msg>` [1h])) by (msg))

# Logs correlated to a trace
{service=~"order-service|payment-service"} | json | trace_id = "abc-123-def-456"

Docker Compose for local Loki stack

services:
  loki:
    image: grafana/loki:2.9.4
    command: -config.file=/etc/loki/config.yaml
    volumes:
      - ./loki-config.yaml:/etc/loki/config.yaml
    ports:
      - "3100:3100"

  promtail:
    image: grafana/promtail:2.9.4
    command: -config.file=/etc/promtail/config.yaml
    volumes:
      - ./promtail-config.yaml:/etc/promtail/config.yaml
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro

  grafana:
    image: grafana/grafana:10.3.1
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin

Core Philosophy

Logs are the narrative record of what your system did — the detailed, human-readable accounting of events, decisions, and errors that no other telemetry signal provides. Metrics tell you that something is wrong; traces tell you where in the request path it happened; logs tell you exactly what happened in the code at that moment. But this richness comes at a cost: logs are the most expensive observability signal to store and query at scale. The art of log aggregation is capturing enough detail to debug any problem without generating so much volume that the logs become unsearchable and unaffordable.

The pipeline model — collect, process, ship, store, query — must be treated as a first-class system with its own monitoring, alerting, and capacity planning. A log pipeline that silently drops entries during peak load is worse than no pipeline at all, because engineers trust that the logs exist and make decisions based on their absence. "I checked the logs and there were no errors" is only meaningful if you can verify the pipeline was healthy during the time window in question. Monitor agent backpressure, ingestion lag, and dropped log counts as carefully as you monitor application metrics.

The choice between Loki (label-indexed, lightweight) and Elasticsearch (full-text indexed, powerful) reflects a fundamental trade-off in log architecture. Loki is cheaper to operate because it indexes only labels, not log content, but this means full-text searches are slower. Elasticsearch provides instant full-text search but requires significant resources for indexing and storage. The right choice depends on your query patterns: if most queries start with "show me errors from service X in the last hour" (label-first), Loki excels. If most queries start with "find all log lines containing this error message across all services" (content-first), Elasticsearch is the better fit.

Anti-Patterns

  • High-cardinality labels in Loki. Using pod_id, request_id, or user_id as Loki labels creates millions of streams, degrades ingestion performance, and can crash the cluster. Loki labels should be low-cardinality dimensions (service, namespace, level). Use log line content filters (|= "user-42") for high-cardinality searches.

  • Logging everything at INFO. Services that emit an INFO log line for every incoming request generate enormous volume that drowns out meaningful events. Reserve INFO for significant state changes (service started, configuration reloaded, batch job completed). Use DEBUG for per-request detail and keep it disabled in production by default.

  • No retention policy. Storing all logs forever is the default path of least resistance and it leads to unbounded storage costs and degraded query performance as indexes grow. Define retention policies from day one, tiered by log level: ERROR logs retained longer than DEBUG, with clear cost projections.

  • Unmonitored log pipeline. A log shipper that crashes, a full disk buffer, or an overwhelmed ingestion endpoint silently drops logs during exactly the moments when you need them most — incidents. Alert on agent health, ingestion lag, and dropped log counts so pipeline failures are detected before they matter.

  • Ignoring multi-line log entries. Stack traces, multi-line exception messages, and formatted data structures get split into separate log entries by default, making them impossible to read or search. Configure the log shipper's multi-line parser to merge patterns that start with whitespace or match known exception formats.

Best Practices

  • Ship logs to stdout, not files. Let the container runtime or log agent handle collection. Writing to files inside containers adds complexity and risks filling disk.
  • Use labels sparingly in Loki. Loki indexes by labels, not full text. High-cardinality labels (like user_id) create too many streams and degrade performance. Use filter expressions (|= "user-42") for high-cardinality searches.
  • Parse and enrich at the agent, not the store. Extract structured fields, drop unnecessary verbosity, and add metadata (cluster, namespace, region) in the shipper pipeline.
  • Implement tiered retention. Keep ERROR logs longer than DEBUG logs. Use separate Loki tenants or index lifecycle policies in Elasticsearch for different retention tiers.
  • Correlate logs with traces. Include trace_id in every log line so you can jump from a log entry to the full distributed trace in your tracing backend.
  • Set ingestion rate limits. Protect the log store from runaway logging (e.g., a tight-loop error that produces millions of lines/minute). Use per-tenant or per-service rate limits.
  • Monitor the log pipeline itself. Alert on agent backpressure, dropped logs, ingestion lag, and storage utilization. A broken log pipeline is invisible until you need the logs.

Common Pitfalls

  • Logging too much at high severity. Services that log INFO for every request generate enormous volume. Reserve INFO for significant state changes; use DEBUG for per-request detail and disable it in production by default.
  • High-cardinality labels in Loki. Using pod_id or request_id as a Loki label creates millions of streams. Use log line content filters instead.
  • No retention policy. Unbounded log retention fills storage and increases costs linearly. Define retention from day one.
  • Ignoring multi-line logs. Stack traces and multi-line exceptions get split into separate log entries. Configure the shipper to merge multi-line patterns (e.g., Fluent Bit's multiline parser).
  • Single-point-of-failure in the pipeline. If the log aggregator goes down and agents have no local buffer, logs are lost during the outage. Use agents with disk-backed buffers (Fluent Bit storage.type filesystem).

Install this skill directly: skilldb add observability-patterns-skills

Get CLI access →