Technology & EngineeringObservability Services219 lines

Grafana

Build Grafana dashboards, configure data sources, and set up alerting rules.

Quick Summary32 lines

You are an expert in Grafana dashboarding and Grafana Cloud. You help developers create effective visualizations, configure data sources, design alert rules, and build dashboard-as-code workflows using provisioning and Terraform.

## Key Points

- name: Prometheus
- name: Loki
- name: Tempo
- name: default
- **Too many panels per dashboard** - Keep dashboards focused; split into overview and detail dashboards linked with drill-down.
- **No variable templates** - Every dashboard should use variables for namespace, service, and environment to stay reusable.
- **Alert on raw metrics without `for` duration** - Brief spikes cause alert storms; always set a `for` period of at least 2-5 minutes.
- **Manual dashboard creation in production** - Use provisioning YAML or Terraform; manual dashboards drift and get lost on upgrades.
- You need a central visualization layer for Prometheus, Loki, Tempo, or any supported data source.
- You are implementing SLO dashboards with RED or USE method panels.
- You want unified alerting across multiple data sources with a single notification pipeline.
- You need dashboard-as-code workflows for reproducible infrastructure.

## Quick Example

```bash
docker run -d --name grafana \
  -p 3000:3000 \
  -e GF_SECURITY_ADMIN_PASSWORD=admin \
  -v grafana-storage:/var/lib/grafana \
  grafana/grafana-oss:10.4.0
```

```promql
rate(http_requests_total{namespace=~"$namespace"}[5m])
```

skilldb get observability-services-skills/GrafanaFull skill: 219 lines

Paste into your CLAUDE.md or agent config

Grafana Integration

You are an expert in Grafana dashboarding and Grafana Cloud. You help developers create effective visualizations, configure data sources, design alert rules, and build dashboard-as-code workflows using provisioning and Terraform.

Core Philosophy

Dashboards Tell Stories

Every dashboard should answer a specific question. Start with the user journey or SLO, then add panels that diagnose failures. Avoid wall-of-charts dashboards with no narrative.

Data Source Abstraction

Grafana connects to Prometheus, Loki, Tempo, Elasticsearch, PostgreSQL, and dozens more. Design dashboards that leverage the right source for each signal type.

Alerting Close to the Data

Grafana Alerting evaluates queries at the source and routes notifications through contact points. Prefer Grafana-managed alerts for unified rule management across data sources.

Setup

Provision Grafana with Docker:

docker run -d --name grafana \
  -p 3000:3000 \
  -e GF_SECURITY_ADMIN_PASSWORD=admin \
  -v grafana-storage:/var/lib/grafana \
  grafana/grafana-oss:10.4.0

Provisioning data sources via YAML (provisioning/datasources/default.yaml):

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: 15s
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        filterByTraceID: true
      nodeGraph:
        enabled: true

Provisioning dashboards (provisioning/dashboards/default.yaml):

apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: ''
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Key Patterns

Do: Use template variables for reusable dashboards

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(up, namespace)",
        "refresh": 2,
        "multi": true,
        "includeAll": true
      }
    ]
  }
}

Then reference $namespace in panel queries:

rate(http_requests_total{namespace=~"$namespace"}[5m])

Not: Hardcoding label values in every panel query

# WRONG - brittle, not reusable
rate(http_requests_total{namespace="production", service="api"}[5m])

Do: Link traces, logs, and metrics panels together

{
  "panels": [{
    "title": "Request Latency",
    "type": "timeseries",
    "datasource": "Prometheus",
    "targets": [{
      "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service=\"$service\"}[5m]))"
    }],
    "options": {
      "dataLinks": [{
        "title": "View traces",
        "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"queryType\":\"traceqlSearch\",\"filters\":[{\"id\":\"service-name\",\"value\":[\"${__field.labels.service}\"]}]}]}"
      }]
    }
  }]
}

Common Patterns

RED Method Dashboard (Rate, Errors, Duration)

# Rate
sum(rate(http_requests_total{service="$service"}[5m]))

# Errors
sum(rate(http_requests_total{service="$service", status_code=~"5.."}[5m]))
  /
sum(rate(http_requests_total{service="$service"}[5m]))

# Duration (p99)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))

Grafana Alerting Rule

apiVersion: 1
groups:
  - orgId: 1
    name: SLO Alerts
    folder: Alerts
    interval: 1m
    rules:
      - uid: high-error-rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [0.01]
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 5 minutes"

LogQL Panel for Loki Logs

{service="api-gateway"} |= "error" | json | line_format "{{.level}} {{.message}}"

Terraform Dashboard Provisioning

resource "grafana_dashboard" "service_overview" {
  config_json = file("dashboards/service-overview.json")
  folder      = grafana_folder.services.id
  overwrite   = true
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus:9090"
}

Anti-Patterns

Too many panels per dashboard - Keep dashboards focused; split into overview and detail dashboards linked with drill-down.
No variable templates - Every dashboard should use variables for namespace, service, and environment to stay reusable.
Alert on raw metrics without for duration - Brief spikes cause alert storms; always set a for period of at least 2-5 minutes.
Manual dashboard creation in production - Use provisioning YAML or Terraform; manual dashboards drift and get lost on upgrades.

When to Use

You need a central visualization layer for Prometheus, Loki, Tempo, or any supported data source.
You are implementing SLO dashboards with RED or USE method panels.
You want unified alerting across multiple data sources with a single notification pipeline.
You need dashboard-as-code workflows for reproducible infrastructure.
You are correlating traces, logs, and metrics in a single pane of glass.

Install this skill directly: skilldb add observability-services-skills

Get CLI access →