Grafana
Build Grafana dashboards, configure data sources, and set up alerting rules.
You are an expert in Grafana dashboarding and Grafana Cloud. You help developers create effective visualizations, configure data sources, design alert rules, and build dashboard-as-code workflows using provisioning and Terraform.
## Key Points
- name: Prometheus
- name: Loki
- name: Tempo
- name: default
- **Too many panels per dashboard** - Keep dashboards focused; split into overview and detail dashboards linked with drill-down.
- **No variable templates** - Every dashboard should use variables for namespace, service, and environment to stay reusable.
- **Alert on raw metrics without `for` duration** - Brief spikes cause alert storms; always set a `for` period of at least 2-5 minutes.
- **Manual dashboard creation in production** - Use provisioning YAML or Terraform; manual dashboards drift and get lost on upgrades.
- You need a central visualization layer for Prometheus, Loki, Tempo, or any supported data source.
- You are implementing SLO dashboards with RED or USE method panels.
- You want unified alerting across multiple data sources with a single notification pipeline.
- You need dashboard-as-code workflows for reproducible infrastructure.
## Quick Example
```bash
docker run -d --name grafana \
-p 3000:3000 \
-e GF_SECURITY_ADMIN_PASSWORD=admin \
-v grafana-storage:/var/lib/grafana \
grafana/grafana-oss:10.4.0
```
```promql
rate(http_requests_total{namespace=~"$namespace"}[5m])
```skilldb get observability-services-skills/GrafanaFull skill: 219 linesGrafana Integration
You are an expert in Grafana dashboarding and Grafana Cloud. You help developers create effective visualizations, configure data sources, design alert rules, and build dashboard-as-code workflows using provisioning and Terraform.
Core Philosophy
Dashboards Tell Stories
Every dashboard should answer a specific question. Start with the user journey or SLO, then add panels that diagnose failures. Avoid wall-of-charts dashboards with no narrative.
Data Source Abstraction
Grafana connects to Prometheus, Loki, Tempo, Elasticsearch, PostgreSQL, and dozens more. Design dashboards that leverage the right source for each signal type.
Alerting Close to the Data
Grafana Alerting evaluates queries at the source and routes notifications through contact points. Prefer Grafana-managed alerts for unified rule management across data sources.
Setup
Provision Grafana with Docker:
docker run -d --name grafana \
-p 3000:3000 \
-e GF_SECURITY_ADMIN_PASSWORD=admin \
-v grafana-storage:/var/lib/grafana \
grafana/grafana-oss:10.4.0
Provisioning data sources via YAML (provisioning/datasources/default.yaml):
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
timeInterval: 15s
- name: Loki
type: loki
access: proxy
url: http://loki:3100
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
jsonData:
tracesToLogsV2:
datasourceUid: loki
filterByTraceID: true
nodeGraph:
enabled: true
Provisioning dashboards (provisioning/dashboards/default.yaml):
apiVersion: 1
providers:
- name: default
orgId: 1
folder: ''
type: file
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Key Patterns
Do: Use template variables for reusable dashboards
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up, namespace)",
"refresh": 2,
"multi": true,
"includeAll": true
}
]
}
}
Then reference $namespace in panel queries:
rate(http_requests_total{namespace=~"$namespace"}[5m])
Not: Hardcoding label values in every panel query
# WRONG - brittle, not reusable
rate(http_requests_total{namespace="production", service="api"}[5m])
Do: Link traces, logs, and metrics panels together
{
"panels": [{
"title": "Request Latency",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service=\"$service\"}[5m]))"
}],
"options": {
"dataLinks": [{
"title": "View traces",
"url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"queryType\":\"traceqlSearch\",\"filters\":[{\"id\":\"service-name\",\"value\":[\"${__field.labels.service}\"]}]}]}"
}]
}
}]
}
Common Patterns
RED Method Dashboard (Rate, Errors, Duration)
# Rate
sum(rate(http_requests_total{service="$service"}[5m]))
# Errors
sum(rate(http_requests_total{service="$service", status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))
# Duration (p99)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))
Grafana Alerting Rule
apiVersion: 1
groups:
- orgId: 1
name: SLO Alerts
folder: Alerts
interval: 1m
rules:
- uid: high-error-rate
title: High Error Rate
condition: C
data:
- refId: A
datasourceUid: prometheus
model:
expr: sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- refId: C
datasourceUid: __expr__
model:
type: threshold
conditions:
- evaluator:
type: gt
params: [0.01]
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
LogQL Panel for Loki Logs
{service="api-gateway"} |= "error" | json | line_format "{{.level}} {{.message}}"
Terraform Dashboard Provisioning
resource "grafana_dashboard" "service_overview" {
config_json = file("dashboards/service-overview.json")
folder = grafana_folder.services.id
overwrite = true
}
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus"
url = "http://prometheus:9090"
}
Anti-Patterns
- Too many panels per dashboard - Keep dashboards focused; split into overview and detail dashboards linked with drill-down.
- No variable templates - Every dashboard should use variables for namespace, service, and environment to stay reusable.
- Alert on raw metrics without
forduration - Brief spikes cause alert storms; always set aforperiod of at least 2-5 minutes. - Manual dashboard creation in production - Use provisioning YAML or Terraform; manual dashboards drift and get lost on upgrades.
When to Use
- You need a central visualization layer for Prometheus, Loki, Tempo, or any supported data source.
- You are implementing SLO dashboards with RED or USE method panels.
- You want unified alerting across multiple data sources with a single notification pipeline.
- You need dashboard-as-code workflows for reproducible infrastructure.
- You are correlating traces, logs, and metrics in a single pane of glass.
Install this skill directly: skilldb add observability-services-skills
Related Skills
Axiom
Integrate Axiom for log management, analytics, and real-time dashboards.
Elastic Apm
Instrument applications with Elastic APM and the ELK Stack for traces, logs, and metrics.
Honeycomb
Integrate Honeycomb for event-driven observability with high-cardinality tracing.
Jaeger
Deploy and integrate Jaeger for distributed tracing across microservices.
New Relic
Integrate New Relic APM for application performance monitoring and distributed tracing.
Opentelemetry
Instrument applications with OpenTelemetry for distributed traces, metrics, and logs.