Prometheus gives you metrics. Grafana makes them visible. But when an alert fires and you want to see the actual log lines from the moment of the incident, you need Loki. Grafana Loki is a log aggregation system designed specifically to complement Prometheus โ same label model, same query language family, and native integration in Grafana that lets you jump from a metric spike to the correlated log lines in one click. Promtail is the log shipper that reads from files and systemd journal and pushes to Loki. This post covers the setup from scratch to a production-ready logging pipeline.
Why Loki instead of Elasticsearch
Elasticsearch (or OpenSearch) is powerful but expensive to operate โ it indexes every field in every log line, consuming significant CPU and disk IOPS. Loki takes a different approach: it only indexes the labels you define (like app, host, env), and stores the actual log content as compressed chunks. This makes Loki dramatically cheaper to run and simpler to operate, at the cost of slower full-text search on unindexed fields. For most infrastructure logging use cases โ you know which service you're looking at and want to grep through recent logs โ Loki's performance is more than adequate.
Deploying Loki + Promtail
# docker-compose.yml โ Loki + Promtail + Grafana
services:
loki:
image: grafana/loki:3.3.2
ports:
- "3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/local-config.yaml
- loki-data:/loki
command: -config.file=/etc/loki/local-config.yaml
promtail:
image: grafana/promtail:3.3.2
volumes:
- /var/log:/var/log:ro # System logs
- /var/lib/docker/containers:/var/lib/docker/containers:ro # Container logs
- ./promtail-config.yml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
volumes:
loki-data:
# loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 744h # 31 days
Promtail configuration and label design
Labels are the most important design decision in a Loki deployment. Every unique combination of label values creates a separate stream โ too many high-cardinality labels (like user IDs or request IDs) causes stream explosion and degrades performance. Stick to low-cardinality labels: app, host, env, level.
# promtail-config.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml # Tracks read position in each file
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Nginx access logs
- job_name: nginx
static_configs:
- targets: [localhost]
labels:
app: nginx
host: __HOSTNAME__
env: production
__path__: /var/log/nginx/access.log
pipeline_stages:
# Parse nginx JSON log format
- json:
expressions:
status: status
method: method
path: uri
duration: request_time
upstream: upstream_addr
# Extract status code as a label for filtering
- labels:
status:
# Only keep the level label for errors (status >= 500)
- template:
source: level
template: '{{ if ge (int .status) 500 }}error{{ else }}info{{ end }}'
- labels:
level:
# Docker container logs (all containers)
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: [__meta_docker_container_label_com_docker_compose_service]
target_label: app
- source_labels: [__meta_docker_container_name]
target_label: container
- replacement: production
target_label: env
LogQL: querying logs
LogQL is Loki's query language, modelled on PromQL. Log queries filter streams by label, then optionally filter or parse the log content:
# Show all logs from the nginx app in the last hour
{app="nginx"}
# Filter to error-level logs only
{app="nginx", level="error"}
# Full-text search within the stream โ slower but necessary for unindexed fields
{app="nginx"} |= "500 Internal Server Error"
# Regex filter โ find slow requests (>1 second)
{app="nginx"} | json | duration > 1.0
# Count error rate per minute โ metric query from logs
sum(rate({app="nginx", level="error"}[5m])) by (app)
# Parse JSON logs inline and extract a field
{app="payments"} | json | line_format "{{.user_id}} {{.amount}} {{.status}}"
Correlating logs with Prometheus metrics in Grafana
The most powerful feature of the Loki + Prometheus combination is Grafana's Explore split view: left panel shows a Prometheus metric (request rate, error rate, latency), right panel shows Loki logs for the same time range and service. When the metric shows a spike, you can see the exact log lines that correspond to it without switching tools or correlating timestamps manually.
Configure this in Grafana by adding a derived field to your Loki data source that links trace IDs in log lines to your Tempo tracing backend โ clicking a trace ID in a log line opens the full distributed trace for that request. The three pillars โ metrics, logs, and traces โ are now one integrated view.
Log-based alerting with Loki ruler
Loki's ruler component evaluates LogQL metric queries on a schedule and fires alerts into Alertmanager โ the same Alertmanager that handles your Prometheus alerts. This means log-derived alerts flow through the same routing, silencing, and notification pipelines as metric alerts:
# loki-rules.yml โ alert when error rate from a service is elevated
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate({app="payments", level="error"}[5m])) by (app)
/
sum(rate({app="payments"}[5m])) by (app)
> 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate: {{ $labels.app }}"
description: "{{ $value | printf \"%.1f\" }}% of requests are errors"
- alert: ServiceDownNoLogs
# No logs for 5 minutes = service may be down
expr: |
absent_over_time({app="payments"}[5m]) == 1
for: 1m
labels:
severity: critical
annotations:
summary: "No logs from payments service for 5+ minutes"
Retention and storage sizing
Loki's retention is configured per-stream using the retention_period setting in the limits config. For most infrastructure use cases, 31 days covers incident investigation windows while keeping storage manageable. A rough sizing guide: a server emitting typical nginx + application logs produces around 2โ5GB of compressed Loki storage per month. For a 50-host environment at 31-day retention, budget 100โ250GB of storage for Loki's data directory.
For cost-sensitive deployments, use Loki's object storage backend (S3, GCS, or MinIO for self-hosted) for chunks, keeping only the index on local disk. This moves the bulk of storage to cheaper object storage while keeping query performance acceptable for recent logs.
Loki is part of every 47Network observability stack. The standard deployment is Prometheus + Alertmanager + Grafana + Loki + Promtail + (optionally) Grafana Tempo for traces. This stack runs on a single modest VM for most clients โ 4 vCPU, 8GB RAM handles 50+ hosts comfortably at 31-day log retention. The 47Sentry product includes Loki as the log backend for all monitored services.