Skip to main content
Technology & EngineeringMonitoring Services202 lines

Grafana Cloud

Grafana Cloud is a fully managed observability platform that unifies metrics (Prometheus/Graphite), logs (Loki), and traces (Tempo) within a single Grafana interface. Use it to gain deep insights into your applications and infrastructure without the operational overhead of managing your own observability stack, allowing you to focus on building and improving your services.

Quick Summary3 lines
You are an expert in full-stack observability, adept at leveraging Grafana Cloud to gain deep insights into complex systems, from infrastructure health to application performance and user experience. You master its open-source-native approach to metrics, logs, and traces, enabling seamless monitoring and troubleshooting across diverse environments.
skilldb get monitoring-services-skills/Grafana CloudFull skill: 202 lines
Paste into your CLAUDE.md or agent config

You are an expert in full-stack observability, adept at leveraging Grafana Cloud to gain deep insights into complex systems, from infrastructure health to application performance and user experience. You master its open-source-native approach to metrics, logs, and traces, enabling seamless monitoring and troubleshooting across diverse environments.

Core Philosophy

Grafana Cloud's core philosophy centers on providing a unified, scalable, and open-source-native observability stack as a managed service. It embraces the industry-standard tools—Prometheus for metrics, Loki for logs, and Tempo for traces—integrating them seamlessly into the familiar Grafana dashboarding and alerting interface. This approach allows developers and operations teams to leverage powerful, community-driven tools without the burden of maintaining their underlying infrastructure.

You choose Grafana Cloud when you prioritize speed to insight, operational simplicity, and the flexibility of open standards. It's particularly well-suited for teams building cloud-native applications, microservices architectures, or those already familiar with Grafana, Prometheus, or Loki. By offloading the complexities of scaling and managing these systems, Grafana Cloud empowers you to focus on instrumenting your applications effectively and extracting actionable intelligence from your observability data, leading to faster incident resolution and continuous improvement.

Setup

Getting started with Grafana Cloud involves signing up for an account and then configuring your applications or infrastructure to send metrics, logs, and traces to the respective endpoints. The Grafana Agent is often the simplest and most robust way to achieve this, as it can collect all three types of telemetry data.

First, create a Grafana Cloud account. Navigate to your Grafana Cloud portal to find your Stack ID, Prometheus/Loki/Tempo URLs, and API keys.

Next, install the Grafana Agent on your hosts or as a sidecar in your Kubernetes cluster. For a simple host, you might download and run it.

# Download the Grafana Agent (adjust for your OS and architecture)
wget https://github.com/grafana/agent/releases/download/v0.38.0/grafana-agent-linux-amd64.zip
unzip grafana-agent-linux-amd64.zip
mv grafana-agent-linux-amd64 grafana-agent

# Create an agent configuration file (agent-config.yaml)
# (See examples in Key Techniques for content)

# Run the agent (replace with actual path to config)
./grafana-agent -config.file=./agent-config.yaml

For Kubernetes, you would deploy it as a DaemonSet or StatefulSet. A common approach is to use the official Helm chart:

# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Grafana Agent, providing your Grafana Cloud details
# Replace <YOUR_CLOUD_ORG_ID>, <YOUR_PROMETHEUS_USERNAME>, <YOUR_PROMETHEUS_PASSWORD> etc.
helm install grafana-agent grafana/grafana-agent \
  --set agent.mode=flow \
  --set agent.deploymentMode=DaemonSet \
  --set prometheus.remote_write[0].url="https://prometheus-us-central1.grafana.net/api/prom/push" \
  --set prometheus.remote_write[0].basic_auth.username="<YOUR_PROMETHEUS_USERNAME>" \
  --set prometheus.remote_write[0].basic_auth.password="<YOUR_PROMETHEUS_PASSWORD>" \
  --set logs.configs[0].clients[0].url="https://logs-us-central1.grafana.net/loki/api/v1/push" \
  --set logs.configs[0].clients[0].basic_auth.username="<YOUR_LOKI_USERNAME>" \
  --set logs.configs[0].clients[0].basic_auth.password="<YOUR_LOKI_PASSWORD>" \
  --set traces.configs[0].receivers.otlp.protocols.grpc="" \
  --set traces.configs[0].receivers.otlp.protocols.http="" \
  --set traces.configs[0].clients[0].url="https://tempo-us-central1.grafana.net/api/traces/otlp/v1/grpc" \
  --set traces.configs[0].clients[0].basic_auth.username="<YOUR_TEMPO_USERNAME>" \
  --set traces.configs[0].clients[0].basic_auth.password="<YOUR_TEMPO_PASSWORD>" \
  --namespace monitoring --create-namespace

Note: Replace us-central1 with your actual Grafana Cloud region and fill in your specific credentials.

Key Techniques

1. Sending Prometheus Metrics with Grafana Agent

You configure the Grafana Agent to scrape Prometheus-compatible endpoints from your applications and forward them to Grafana Cloud. This is ideal for services exposing /metrics endpoints.

# agent-config.yaml for Grafana Agent (standalone)
server:
  http_listen_port: 12345

metrics:
  configs:
    - name: default
      host_filter: false
      scrape_configs:
        - job_name: 'node_exporter' # Example: scraping a node exporter
          static_configs:
            - targets: ['localhost:9100'] # Or your application's metrics endpoint
          relabel_configs:
            - source_labels: [__address__]
              target_label: instance
              regex: (.+):9100
              replacement: $1
        - job_name: 'my-web-app' # Example: scraping your web application
          metrics_path: /metrics
          static_configs:
            - targets: ['my-web-app:8080'] # Your application's internal service endpoint
      remote_write:
        - url: https://prometheus-us-central1.grafana.net/api/prom/push
          basic_auth:
            username: <YOUR_PROMETHEUS_USERNAME>
            password: <YOUR_PROMETHEUS_PASSWORD>

2. Sending Loki Logs with Grafana Agent

You instruct the Grafana Agent to tail log files from your application or system and send them to Loki. This works well for applications writing logs to standard output or log files.

# agent-config.yaml for Grafana Agent (standalone), continued from above or separate
logs:
  configs:
    - name: default
      clients:
        - url: https://logs-us-central1.grafana.net/loki/api/v1/push
          basic_auth:
            username: <YOUR_LOKI_USERNAME>
            password: <YOUR_LOKI_PASSWORD>
      positions:
        filename: /tmp/positions.yaml
      scrape_configs:
        - job_name: system
          static_configs:
            - targets: [localhost]
              labels:
                job: varlogs
                __path__: /var/log/*log # Tail all .log files in /var/log
        - job_name: my-web-app-logs
          static_configs:
            - targets: [localhost]
              labels:
                job: my-web-app
                __path__: /var/log/my-app/access.log # Tail a specific application log file

3. Sending Tempo Traces via OpenTelemetry

You instrument your application using OpenTelemetry SDKs to generate traces and configure the exporter to send them to Grafana Cloud Tempo. This is crucial for distributed tracing.

For a Node.js web application using Express:

// tracing.ts (or a dedicated OpenTelemetry setup file)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';

// Configure the OTLP exporter to send traces to Grafana Cloud Tempo
const exporter = new OTLPTraceExporter({
  url: 'https://tempo-us-central1.grafana.net/api/traces/otlp/v1/grpc', // Your Tempo OTLP gRPC endpoint
  headers: {
    // Basic auth for Grafana Cloud Tempo
    Authorization: `Basic ${Buffer.from('<YOUR_TEMPO_USERNAME>:<YOUR_TEMPO_PASSWORD>').toString('base64')}`,
  },
});

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-web-app',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
  }),
  traceExporter: exporter,
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
    // Add other relevant instrumentations (e.g., for database clients)
  ],
});

sdk.start()
  .then(() => console.log('OpenTelemetry tracing initialized for Grafana Cloud Tempo.'))
  .catch((error) => console.error('Error initializing tracing:', error));

// Ensure the SDK is shut down gracefully on process exit
process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('OpenTelemetry SDK shut down successfully.'))
    .catch((error) => console.error('Error shutting down OpenTelemetry SDK:', error))
    .finally(() => process.exit(0));
});

Then, ensure this tracing.ts file is imported and run at the very beginning of your application's entry point before other modules load.

Best Practices

  • Establish a strong labeling strategy: Consistently apply labels like service_name, environment, cluster, namespace, and instance to all metrics, logs, and traces. This enables powerful filtering, aggregation, and correlation across your observability data.
  • Leverage Grafana Agent for comprehensive collection: Use the Grafana Agent whenever possible to collect metrics, logs, and traces from your infrastructure and applications, simplifying deployment and management.
  • Implement correlated dashboards: Design Grafana dashboards that combine metrics, logs, and traces relevant to a specific service or component. Use variables and linking to seamlessly jump between different views for efficient troubleshooting.
  • Configure robust alerting: Set up alerts in Grafana for critical metrics (e.g., CPU, memory, error rates, latency) and specific log patterns. Use alert groups and notification channels to ensure the right people are notified without creating alert fatigue.
  • Optimize for cardinality: Be mindful of high-cardinality labels in Prometheus metrics and Loki logs, as they can significantly increase costs and impact query performance. Use relabel_configs to drop or sanitize unnecessary labels.
  • Utilize Trace to Logs/Metrics linking: Configure your Grafana dashboards to automatically link from a trace span to relevant logs in Loki or metrics in Prometheus, making root cause analysis much faster.
  • Centralize configuration management: Store your Grafana Agent configurations and application instrumentation details in version control. Automate their deployment using tools like Helm, Ansible, or Terraform.

Anti-Patterns

Ignoring Cardinality. Sending metrics with constantly changing or unique labels (e.g., a timestamp or a unique user ID) creates high cardinality, which can lead to expensive storage and slow queries in Prometheus. Instead, strip or aggregate such labels before sending them to Grafana Cloud.

Siloed Observability. Only sending metrics or logs, but not traces, prevents a holistic view of your system's behavior and makes complex issue diagnosis difficult. Instead, implement a full observability strategy by sending metrics, logs, and traces to Grafana Cloud for complete correlation.

Over-alerting on Symptoms. Setting up too many alerts on minor deviations or symptoms rather than root causes leads to alert fatigue and missed critical incidents. Instead, focus on actionable alerts based on service-level objectives (SLOs) and key performance indicators (KPIs) that truly indicate user impact.

Lack of Contextual Labels. Sending raw metrics or logs without sufficient contextual labels (e.g., environment, service_name, version) makes it impossible to effectively filter, group, or understand the data in large environments. Always enrich your telemetry data with meaningful, consistent labels.

Manual Agent Updates. Manually updating Grafana Agents on individual servers or containers is time-consuming and prone to errors. Instead, automate agent deployment and updates using infrastructure-as-code tools and CI/CD pipelines to ensure consistency and efficiency.

Install this skill directly: skilldb add monitoring-services-skills

Get CLI access →