Infrastructure

Grafana dashboards that don't lie: from Prometheus to panels.


Most Grafana dashboards are decorative. They look busy, they scroll forever, and when an incident happens nobody knows where to look. This is almost always a PromQL problem and a design problem, not a Grafana problem. This post covers the queries that give you useful signal, the panel design decisions that make dashboards usable under pressure, Loki log correlation so you can move from a metric spike to the logs that explain it, and the alerting setup that means your AlertManager config and your dashboard actually agree with each other.

The RED method: three queries for every service

Every HTTP service needs exactly three things on its primary dashboard: Rate (requests per second), Errors (error rate as a percentage), and Duration (latency distribution). These three panels tell you whether the service is healthy โ€” everything else is detail.

# Rate โ€” requests per second over 5 minute window
# Replace 'payments_api' with your service label
rate(http_requests_total{service="payments_api"}[5m])

# Error rate โ€” percentage of requests that returned 5xx
100 * (
  rate(http_requests_total{service="payments_api", status=~"5.."}[5m])
  /
  rate(http_requests_total{service="payments_api"}[5m])
)

# Latency โ€” 50th, 95th, 99th percentile (requires histogram metric type)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket{service="payments_api"}[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="payments_api"}[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="payments_api"}[5m]))

Put all three latency quantiles on a single time-series panel. When p95 diverges sharply from p50, you have a slow-tail problem affecting a minority of requests. When all three rise together, you have a general latency problem. The shape of the divergence tells you where to look.

Variable templating: one dashboard for all environments

The fastest way to double your dashboard count is to create separate staging and production dashboards. Use Grafana variables instead:

# In Grafana dashboard settings โ†’ Variables โ†’ Add variable:
# Name: env
# Type: Custom
# Values: staging, production
# Default: production

# Name: service  
# Type: Query (populate from Prometheus labels)
# Query: label_values(http_requests_total{env="$env"}, service)

# Now your queries become:
rate(http_requests_total{env="$env", service="$service"}[5m])

Add an instance variable too (sourced from the instance label), and you can drill from fleet-level to single-host in one click. The $__rate_interval built-in variable automatically adjusts the rate() window based on your dashboard's time range โ€” use it instead of hardcoding [5m].

Infrastructure panels: USE method for hosts

For hosts and infrastructure components, the USE method (Utilisation, Saturation, Errors) complements RED:

# CPU utilisation โ€” fraction of time spent non-idle
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval]))

# Memory saturation โ€” swap usage indicates pressure beyond physical RAM
rate(node_vmstat_pswpin[$__rate_interval]) + rate(node_vmstat_pswpout[$__rate_interval])

# Disk saturation โ€” I/O utilisation (fraction of time disk was busy)
rate(node_disk_io_time_seconds_total{device!~"loop.*"}[$__rate_interval])

# Network saturation โ€” receive/transmit bytes per second
rate(node_network_receive_bytes_total{device!~"lo"}[$__rate_interval])
rate(node_network_transmit_bytes_total{device!~"lo"}[$__rate_interval])

Loki integration: jump from metric to logs

Grafana's Explore view and Data Links let you click on a spike in a metric panel and jump directly to the Loki logs for that time window and service. This is the feature that makes dashboards useful in incidents rather than just decorative:

# Loki datasource configuration in Grafana
# Add to your panel's Data Links:
# URL: /explore?left={"datasource":"loki","queries":[{"expr":"{service=\"${service}\"}"}],"range":{"from":"${__from}","to":"${__to}"}}
# Title: "View logs for ${service}"

# Useful Loki queries for error investigation:
# All error logs for a service in the last hour:
{service="payments_api"} |= "error" | logfmt | level="error"

# Slow requests (requires structured logging with duration field):
{service="payments_api"} | logfmt | duration > 1000ms

# Correlate with trace ID (if you have distributed tracing):
{service="payments_api"} | logfmt | traceID="abc123"

Structured logging matters here. Loki's logfmt and JSON parsers only work if your logs are actually structured. If you're still writing freeform log strings, you can still search them with |= text matching, but you lose filtering by specific fields. The investment in structured logging pays off immediately when you start using Loki.

Alert rules from panels: keep them in sync

The most common observability failure mode is having alert thresholds that don't match what's on the dashboard. Fix this by defining alert rules directly in Grafana (Grafana-managed alerts), which are stored alongside the dashboard and share the same PromQL:

# In Grafana: Alerting โ†’ Alert rules โ†’ New alert rule
# Set the query to match your dashboard panel exactly:

# Alert: High error rate
# Query A: 
100 * (
  rate(http_requests_total{service="$service", status=~"5.."}[5m])
  /
  rate(http_requests_total{service="$service"}[5m])
)
# Condition: A IS ABOVE 5  (fire when error rate > 5%)
# For: 2m                  (must persist for 2 minutes to avoid flapping)
# Labels: severity=warning, service={{ $labels.service }}
# Annotations: summary="Error rate on {{ $labels.service }} is {{ $values.A }}%"

# Alert: Latency SLA breach  
# Query B:
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="$service"}[$__rate_interval]))
# Condition: B IS ABOVE 2  (p99 over 2 seconds)
# For: 5m

Dashboard organisation that scales

As dashboards multiply, finding the right one under pressure becomes its own problem. A structure that works:

  • Fleet overview โ€” one panel per service, showing a single health signal (green/yellow/red). The entry point for any investigation.
  • Service dashboards โ€” one per service, RED method plus business metrics. Linked from the fleet overview.
  • Infrastructure dashboards โ€” one per host type (Kubernetes nodes, database servers, Nginx instances). USE method.
  • Incident dashboards โ€” ephemeral, created during incidents to track specific hypotheses. Delete after postmortem.

In 47Network Studio engagements we provision Grafana dashboards as code via the Grafana API during every hardware and zero-trust deployment. Dashboard JSON is committed to the client's GitOps repository โ€” dashboards are version-controlled, peer-reviewed on change, and restored automatically if Grafana is wiped. The 47Sentry product ships a pre-configured Grafana instance with fleet overview and per-service dashboards populated at install time.


โ† Back to Blog Prometheus Alerting โ†’