The most common failure mode in Prometheus alerting isn't false negatives โ it's false positives. Teams set up alerts, get paged constantly for things that resolve themselves, mute the channel, and then miss the real incident. Alert fatigue is the silent killer of on-call rotations, and it's almost always a design problem rather than a tooling problem.
This post covers the alert architecture that 47Network uses across its own platform and deploys for Studio clients: severity tiers with specific routing, inhibition rules that suppress lower-severity alerts when a higher one is firing, for durations that separate transient spikes from sustained issues, and the review process that keeps alert quality high over time.
The problem with default alerting
Most teams start with a list of things that feel worth alerting on โ CPU high, memory high, disk high, error rate elevated. These get copy-pasted from runbooks or blog posts, set to reasonable-sounding thresholds, and pointed at a Slack channel. Then the channel starts filling up overnight. Disk usage ticks above 80% on a machine that spikes to 82% every night during backups. A batch job causes a brief CPU spike that fires an alert that resolves itself in 90 seconds. Over time, the channel becomes noise, and the team learns to ignore it.
The fix isn't tighter thresholds โ it's a more deliberate architecture.
Severity tiers: be specific about what requires a response
Not all alerts are equal, and routing them identically destroys signal. A useful four-tier model:
Service is down or data is at risk
Pages immediately via PagerDuty or phone call. No throttling. Requires acknowledgement within 5 minutes. Examples: health check failing for >2 minutes, certificate expired, database unreachable, Vault sealed.
Service degraded but running
Sends to Slack with @here during business hours; PagerDuty after hours with 30-min escalation. Examples: error rate >1% sustained for 5 min, p99 latency >2s, queue depth growing for 10 min.
Worth knowing, not urgent
Ticket created automatically; Slack message without ping. Examples: disk >75%, certificate expiry in <14 days, sustained high memory (not yet critical), deployment failure on non-prod.
No action needed
Logged to audit trail, visible in Grafana. Not sent to any human channel. Examples: config reload triggered, deployment started, certificate renewed, autoscaling event.
Every alert rule must specify a severity label. The Alertmanager routing tree uses this label to determine where the alert goes. Nothing without a severity label should route anywhere โ make it a linting rule in CI.
The for duration is your most important parameter
The for field in a Prometheus alert rule specifies how long a condition must be continuously true before the alert fires. This is the single most effective tool against false positives:
# Bad - fires immediately on any spike
alert: HighCPU
expr: cpu_usage_percent > 85
labels:
severity: p1
# Better - must be sustained for 5 minutes
alert: HighCPU
expr: cpu_usage_percent > 85
for: 5m
labels:
severity: p1
# Best - uses a rate over time instead of an instantaneous value
alert: HighCPUSustained
expr: avg_over_time(cpu_usage_percent[5m]) > 80
for: 5m
labels:
severity: p1
service: "{{ $labels.service }}"
The combination of avg_over_time (or rate() for counters) with a for duration creates a double buffer โ the value must average above the threshold for the rolling window and must continue to do so for the for duration before the alert fires. This is extremely effective at filtering out transient spikes while still catching sustained problems.
General guidance for for durations by severity:
- P0: 1โ2 minutes. Something genuinely down should be caught quickly.
- P1: 5โ15 minutes. Sustained degradation, not a brief spike.
- P2: 30โ60 minutes. Trends that need attention, not urgent responses.
Inhibition rules: suppress noise when the big thing fires
When a node goes down, Prometheus will fire alerts for every service that was running on it. If you have 12 services on that node, you get 12 alerts instead of 1. Inhibition rules suppress lower-severity alerts when a higher-severity alert is already firing for the same target:
# alertmanager.yml
inhibit_rules:
# If a node is down (P0), suppress all P1/P2 alerts for that node
- source_matchers:
- severity = "p0"
- alertname = "NodeDown"
target_matchers:
- severity =~ "p1|p2"
equal:
- node
# If a service is down (P0), suppress degraded alerts (P1) for same service
- source_matchers:
- severity = "p0"
target_matchers:
- severity = "p1"
equal:
- service
# If Vault is sealed (critical dependency), suppress all secrets-related alerts
- source_matchers:
- alertname = "VaultSealed"
target_matchers:
- component = "secrets"
The equal field is crucial โ it specifies which labels must match between the source and target alert for the inhibition to apply. Without it, a P0 on one service would suppress P1 alerts across all services, which is too broad.
Grouping: aggregate before routing
Alertmanager's grouping configuration determines how alerts are bundled before they're sent. Without grouping, each alert fires a separate notification. With grouping, related alerts are combined into a single message:
# alertmanager.yml route tree
route:
receiver: "default-silence"
group_by: [alertname, cluster, service]
group_wait: 30s # wait 30s for more alerts to group
group_interval: 5m # how often to send grouped batches
repeat_interval: 4h # don't re-send resolved unless 4h have passed
routes:
- matchers:
- severity = "p0"
receiver: pagerduty-critical
group_wait: 10s # faster grouping for P0
repeat_interval: 30m # re-notify every 30min if unresolved
- matchers:
- severity = "p1"
receiver: slack-alerts
group_wait: 1m
repeat_interval: 2h
- matchers:
- severity = "p2"
receiver: slack-tickets
group_wait: 5m
repeat_interval: 24h
group_wait vs group_interval: group_wait is the initial delay before the first notification for a new group โ it waits for other alerts to join the group. group_interval is the minimum time between subsequent notifications for an existing group. Keep group_wait short for P0 (10โ30s) and longer for lower severities (1โ5min).
Writing alert rules that stay accurate
A few patterns that appear in well-maintained alert rule sets:
Use recording rules to pre-compute expensive queries
Alert evaluation runs on a fixed interval (usually 15โ60s). Complex queries that run hundreds of times per minute add up. Record expensive aggregations into a new metric and alert on the recording rule output:
# Recording rule (in rules/recordings.yml)
- record: job:request_error_rate:rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# Alert on the pre-computed recording rule
- alert: HighErrorRate
expr: job:request_error_rate:rate5m > 0.01
for: 5m
labels:
severity: p1
annotations:
summary: "{{ $labels.job }} error rate above 1%"
runbook: "https://runbooks.internal/high-error-rate"
Always include a runbook URL in annotations
When an alert fires at 3am, the on-call engineer needs to know immediately what to look at. A runbook annotation with a URL to the relevant runbook is the single highest-value annotation to include. Pages without runbook links are incomplete alert definitions.
Test alert rules before deploying
promtool includes a test framework for alert rules. Write test cases that verify your alert fires when expected and doesn't fire when it shouldn't:
# tests/alert_test.yml
rule_files:
- ../rules/services.yml
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{status="500", job="api"}'
values: '0+10x10' # 10 errors/min for 10 minutes
- series: 'http_requests_total{status="200", job="api"}'
values: '0+90x10' # 90 ok/min for 10 minutes
alert_rule_test:
- eval_time: 6m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
severity: p1
job: api
Alert hygiene: ongoing maintenance
Alert rules decay. A threshold that was correct 6 months ago may be too tight or too loose today. Treating alert maintenance as a regular practice prevents quality decay:
- Weekly review of muted alerts. If an alert has been muted for more than a week, either fix the underlying condition or delete the alert. Chronic mutes are failed alerts.
- Post-incident alert audit. After every P0, review whether existing alerts would have caught the issue earlier. Add a coverage gap alert if not.
- False positive rate tracking. Track what percentage of pages required no action (auto-resolved or "nothing to do"). Target under 10% for P0, under 20% for P1.
- Annually review all P2 thresholds. P2s that never actually escalate to P1 are either correctly filtered or capturing issues that never materialize โ review which.
The test: Could a new team member, woken at 3am by this alert, understand what it means and know what action to take in under 2 minutes? If not, the alert needs a better name, clearer annotations, or a runbook link.
47Sentry integration
47Sentry exports eBPF-observed network metrics to Prometheus โ connection rates, blocked traffic, DNS resolution latency, observed topology changes. These become first-class alert signals alongside application metrics. The same severity framework and inhibition rules apply: a 47Sentry alert that the firewall is dropping >100 requests/second to an internal service inhibits the resulting application P1 latency alerts (the cause is known; the symptoms don't need separate pages).