Grafana Cloud
Grafana Cloud is a fully managed observability platform that unifies metrics (Prometheus/Graphite), logs (Loki), and traces (Tempo) within a single Grafana interface. Use it to gain deep insights into your applications and infrastructure without the operational overhead of managing your own observability stack, allowing you to focus on building and improving your services.
You are an expert in full-stack observability, adept at leveraging Grafana Cloud to gain deep insights into complex systems, from infrastructure health to application performance and user experience. You master its open-source-native approach to metrics, logs, and traces, enabling seamless monitoring and troubleshooting across diverse environments.
skilldb get monitoring-services-skills/Grafana CloudFull skill: 202 linesYou are an expert in full-stack observability, adept at leveraging Grafana Cloud to gain deep insights into complex systems, from infrastructure health to application performance and user experience. You master its open-source-native approach to metrics, logs, and traces, enabling seamless monitoring and troubleshooting across diverse environments.
Core Philosophy
Grafana Cloud's core philosophy centers on providing a unified, scalable, and open-source-native observability stack as a managed service. It embraces the industry-standard tools—Prometheus for metrics, Loki for logs, and Tempo for traces—integrating them seamlessly into the familiar Grafana dashboarding and alerting interface. This approach allows developers and operations teams to leverage powerful, community-driven tools without the burden of maintaining their underlying infrastructure.
You choose Grafana Cloud when you prioritize speed to insight, operational simplicity, and the flexibility of open standards. It's particularly well-suited for teams building cloud-native applications, microservices architectures, or those already familiar with Grafana, Prometheus, or Loki. By offloading the complexities of scaling and managing these systems, Grafana Cloud empowers you to focus on instrumenting your applications effectively and extracting actionable intelligence from your observability data, leading to faster incident resolution and continuous improvement.
Setup
Getting started with Grafana Cloud involves signing up for an account and then configuring your applications or infrastructure to send metrics, logs, and traces to the respective endpoints. The Grafana Agent is often the simplest and most robust way to achieve this, as it can collect all three types of telemetry data.
First, create a Grafana Cloud account. Navigate to your Grafana Cloud portal to find your Stack ID, Prometheus/Loki/Tempo URLs, and API keys.
Next, install the Grafana Agent on your hosts or as a sidecar in your Kubernetes cluster. For a simple host, you might download and run it.
# Download the Grafana Agent (adjust for your OS and architecture)
wget https://github.com/grafana/agent/releases/download/v0.38.0/grafana-agent-linux-amd64.zip
unzip grafana-agent-linux-amd64.zip
mv grafana-agent-linux-amd64 grafana-agent
# Create an agent configuration file (agent-config.yaml)
# (See examples in Key Techniques for content)
# Run the agent (replace with actual path to config)
./grafana-agent -config.file=./agent-config.yaml
For Kubernetes, you would deploy it as a DaemonSet or StatefulSet. A common approach is to use the official Helm chart:
# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Grafana Agent, providing your Grafana Cloud details
# Replace <YOUR_CLOUD_ORG_ID>, <YOUR_PROMETHEUS_USERNAME>, <YOUR_PROMETHEUS_PASSWORD> etc.
helm install grafana-agent grafana/grafana-agent \
--set agent.mode=flow \
--set agent.deploymentMode=DaemonSet \
--set prometheus.remote_write[0].url="https://prometheus-us-central1.grafana.net/api/prom/push" \
--set prometheus.remote_write[0].basic_auth.username="<YOUR_PROMETHEUS_USERNAME>" \
--set prometheus.remote_write[0].basic_auth.password="<YOUR_PROMETHEUS_PASSWORD>" \
--set logs.configs[0].clients[0].url="https://logs-us-central1.grafana.net/loki/api/v1/push" \
--set logs.configs[0].clients[0].basic_auth.username="<YOUR_LOKI_USERNAME>" \
--set logs.configs[0].clients[0].basic_auth.password="<YOUR_LOKI_PASSWORD>" \
--set traces.configs[0].receivers.otlp.protocols.grpc="" \
--set traces.configs[0].receivers.otlp.protocols.http="" \
--set traces.configs[0].clients[0].url="https://tempo-us-central1.grafana.net/api/traces/otlp/v1/grpc" \
--set traces.configs[0].clients[0].basic_auth.username="<YOUR_TEMPO_USERNAME>" \
--set traces.configs[0].clients[0].basic_auth.password="<YOUR_TEMPO_PASSWORD>" \
--namespace monitoring --create-namespace
Note: Replace us-central1 with your actual Grafana Cloud region and fill in your specific credentials.
Key Techniques
1. Sending Prometheus Metrics with Grafana Agent
You configure the Grafana Agent to scrape Prometheus-compatible endpoints from your applications and forward them to Grafana Cloud. This is ideal for services exposing /metrics endpoints.
# agent-config.yaml for Grafana Agent (standalone)
server:
http_listen_port: 12345
metrics:
configs:
- name: default
host_filter: false
scrape_configs:
- job_name: 'node_exporter' # Example: scraping a node exporter
static_configs:
- targets: ['localhost:9100'] # Or your application's metrics endpoint
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: (.+):9100
replacement: $1
- job_name: 'my-web-app' # Example: scraping your web application
metrics_path: /metrics
static_configs:
- targets: ['my-web-app:8080'] # Your application's internal service endpoint
remote_write:
- url: https://prometheus-us-central1.grafana.net/api/prom/push
basic_auth:
username: <YOUR_PROMETHEUS_USERNAME>
password: <YOUR_PROMETHEUS_PASSWORD>
2. Sending Loki Logs with Grafana Agent
You instruct the Grafana Agent to tail log files from your application or system and send them to Loki. This works well for applications writing logs to standard output or log files.
# agent-config.yaml for Grafana Agent (standalone), continued from above or separate
logs:
configs:
- name: default
clients:
- url: https://logs-us-central1.grafana.net/loki/api/v1/push
basic_auth:
username: <YOUR_LOKI_USERNAME>
password: <YOUR_LOKI_PASSWORD>
positions:
filename: /tmp/positions.yaml
scrape_configs:
- job_name: system
static_configs:
- targets: [localhost]
labels:
job: varlogs
__path__: /var/log/*log # Tail all .log files in /var/log
- job_name: my-web-app-logs
static_configs:
- targets: [localhost]
labels:
job: my-web-app
__path__: /var/log/my-app/access.log # Tail a specific application log file
3. Sending Tempo Traces via OpenTelemetry
You instrument your application using OpenTelemetry SDKs to generate traces and configure the exporter to send them to Grafana Cloud Tempo. This is crucial for distributed tracing.
For a Node.js web application using Express:
// tracing.ts (or a dedicated OpenTelemetry setup file)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
// Configure the OTLP exporter to send traces to Grafana Cloud Tempo
const exporter = new OTLPTraceExporter({
url: 'https://tempo-us-central1.grafana.net/api/traces/otlp/v1/grpc', // Your Tempo OTLP gRPC endpoint
headers: {
// Basic auth for Grafana Cloud Tempo
Authorization: `Basic ${Buffer.from('<YOUR_TEMPO_USERNAME>:<YOUR_TEMPO_PASSWORD>').toString('base64')}`,
},
});
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-web-app',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
}),
traceExporter: exporter,
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
// Add other relevant instrumentations (e.g., for database clients)
],
});
sdk.start()
.then(() => console.log('OpenTelemetry tracing initialized for Grafana Cloud Tempo.'))
.catch((error) => console.error('Error initializing tracing:', error));
// Ensure the SDK is shut down gracefully on process exit
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('OpenTelemetry SDK shut down successfully.'))
.catch((error) => console.error('Error shutting down OpenTelemetry SDK:', error))
.finally(() => process.exit(0));
});
Then, ensure this tracing.ts file is imported and run at the very beginning of your application's entry point before other modules load.
Best Practices
- Establish a strong labeling strategy: Consistently apply labels like
service_name,environment,cluster,namespace, andinstanceto all metrics, logs, and traces. This enables powerful filtering, aggregation, and correlation across your observability data. - Leverage Grafana Agent for comprehensive collection: Use the Grafana Agent whenever possible to collect metrics, logs, and traces from your infrastructure and applications, simplifying deployment and management.
- Implement correlated dashboards: Design Grafana dashboards that combine metrics, logs, and traces relevant to a specific service or component. Use variables and linking to seamlessly jump between different views for efficient troubleshooting.
- Configure robust alerting: Set up alerts in Grafana for critical metrics (e.g., CPU, memory, error rates, latency) and specific log patterns. Use alert groups and notification channels to ensure the right people are notified without creating alert fatigue.
- Optimize for cardinality: Be mindful of high-cardinality labels in Prometheus metrics and Loki logs, as they can significantly increase costs and impact query performance. Use
relabel_configsto drop or sanitize unnecessary labels. - Utilize Trace to Logs/Metrics linking: Configure your Grafana dashboards to automatically link from a trace span to relevant logs in Loki or metrics in Prometheus, making root cause analysis much faster.
- Centralize configuration management: Store your Grafana Agent configurations and application instrumentation details in version control. Automate their deployment using tools like Helm, Ansible, or Terraform.
Anti-Patterns
Ignoring Cardinality. Sending metrics with constantly changing or unique labels (e.g., a timestamp or a unique user ID) creates high cardinality, which can lead to expensive storage and slow queries in Prometheus. Instead, strip or aggregate such labels before sending them to Grafana Cloud.
Siloed Observability. Only sending metrics or logs, but not traces, prevents a holistic view of your system's behavior and makes complex issue diagnosis difficult. Instead, implement a full observability strategy by sending metrics, logs, and traces to Grafana Cloud for complete correlation.
Over-alerting on Symptoms. Setting up too many alerts on minor deviations or symptoms rather than root causes leads to alert fatigue and missed critical incidents. Instead, focus on actionable alerts based on service-level objectives (SLOs) and key performance indicators (KPIs) that truly indicate user impact.
Lack of Contextual Labels. Sending raw metrics or logs without sufficient contextual labels (e.g., environment, service_name, version) makes it impossible to effectively filter, group, or understand the data in large environments. Always enrich your telemetry data with meaningful, consistent labels.
Manual Agent Updates. Manually updating Grafana Agents on individual servers or containers is time-consuming and prone to errors. Instead, automate agent deployment and updates using infrastructure-as-code tools and CI/CD pipelines to ensure consistency and efficiency.
Install this skill directly: skilldb add monitoring-services-skills
Related Skills
Baselime
Baselime is a serverless-native observability platform designed for AWS, unifying logs, traces, and metrics. It provides real-time insights and contextualized data to help you understand and troubleshoot your distributed serverless applications.
BetterStack
"BetterStack (formerly Better Uptime + Logtail): uptime monitoring, log management, status pages, incident management, alerting"
Checkly
"Checkly: synthetic monitoring, API checks, browser checks, Playwright-based E2E monitoring, monitoring-as-code CLI"
Cronitor
Cronitor is a robust monitoring service designed to ensure your background jobs (cron jobs, scheduled tasks, async workers) and APIs run reliably. It actively monitors the health and execution of automated processes, alerting you instantly to missed runs, failures, or delays. Use Cronitor to gain peace of mind and critical visibility into your application's backend operations.
Datadog
"Datadog: APM, log management, infrastructure monitoring, RUM, custom metrics, dashboards, Node.js tracing"
Highlight.io
"Highlight.io: open-source monitoring, session replay, error tracking, logging, tracing, Next.js SDK, self-hosted option"