Infrastructure New

Prometheus alerting that doesn't cry wolf.


The most common failure mode in Prometheus alerting isn't false negatives โ€” it's false positives. Teams set up alerts, get paged constantly for things that resolve themselves, mute the channel, and then miss the real incident. Alert fatigue is the silent killer of on-call rotations, and it's almost always a design problem rather than a tooling problem.

This post covers the alert architecture that 47Network uses across its own platform and deploys for Studio clients: severity tiers with specific routing, inhibition rules that suppress lower-severity alerts when a higher one is firing, for durations that separate transient spikes from sustained issues, and the review process that keeps alert quality high over time.

The problem with default alerting

Most teams start with a list of things that feel worth alerting on โ€” CPU high, memory high, disk high, error rate elevated. These get copy-pasted from runbooks or blog posts, set to reasonable-sounding thresholds, and pointed at a Slack channel. Then the channel starts filling up overnight. Disk usage ticks above 80% on a machine that spikes to 82% every night during backups. A batch job causes a brief CPU spike that fires an alert that resolves itself in 90 seconds. Over time, the channel becomes noise, and the team learns to ignore it.

The fix isn't tighter thresholds โ€” it's a more deliberate architecture.

Severity tiers: be specific about what requires a response

Not all alerts are equal, and routing them identically destroys signal. A useful four-tier model:

P0 โ€” WAKE SOMEONE UP

Service is down or data is at risk

Pages immediately via PagerDuty or phone call. No throttling. Requires acknowledgement within 5 minutes. Examples: health check failing for >2 minutes, certificate expired, database unreachable, Vault sealed.

P1 โ€” NEEDS ATTENTION TODAY

Service degraded but running

Sends to Slack with @here during business hours; PagerDuty after hours with 30-min escalation. Examples: error rate >1% sustained for 5 min, p99 latency >2s, queue depth growing for 10 min.

P2 โ€” REVIEW TOMORROW

Worth knowing, not urgent

Ticket created automatically; Slack message without ping. Examples: disk >75%, certificate expiry in <14 days, sustained high memory (not yet critical), deployment failure on non-prod.

INFO โ€” SITUATIONAL AWARENESS

No action needed

Logged to audit trail, visible in Grafana. Not sent to any human channel. Examples: config reload triggered, deployment started, certificate renewed, autoscaling event.

Every alert rule must specify a severity label. The Alertmanager routing tree uses this label to determine where the alert goes. Nothing without a severity label should route anywhere โ€” make it a linting rule in CI.

The for duration is your most important parameter

The for field in a Prometheus alert rule specifies how long a condition must be continuously true before the alert fires. This is the single most effective tool against false positives:

# Bad - fires immediately on any spike
alert: HighCPU
expr: cpu_usage_percent > 85
labels:
  severity: p1

# Better - must be sustained for 5 minutes
alert: HighCPU
expr: cpu_usage_percent > 85
for: 5m
labels:
  severity: p1

# Best - uses a rate over time instead of an instantaneous value
alert: HighCPUSustained
expr: avg_over_time(cpu_usage_percent[5m]) > 80
for: 5m
labels:
  severity: p1
  service: "{{ $labels.service }}"

The combination of avg_over_time (or rate() for counters) with a for duration creates a double buffer โ€” the value must average above the threshold for the rolling window and must continue to do so for the for duration before the alert fires. This is extremely effective at filtering out transient spikes while still catching sustained problems.

General guidance for for durations by severity:

  • P0: 1โ€“2 minutes. Something genuinely down should be caught quickly.
  • P1: 5โ€“15 minutes. Sustained degradation, not a brief spike.
  • P2: 30โ€“60 minutes. Trends that need attention, not urgent responses.

Inhibition rules: suppress noise when the big thing fires

When a node goes down, Prometheus will fire alerts for every service that was running on it. If you have 12 services on that node, you get 12 alerts instead of 1. Inhibition rules suppress lower-severity alerts when a higher-severity alert is already firing for the same target:

# alertmanager.yml
inhibit_rules:
  # If a node is down (P0), suppress all P1/P2 alerts for that node
  - source_matchers:
      - severity = "p0"
      - alertname = "NodeDown"
    target_matchers:
      - severity =~ "p1|p2"
    equal:
      - node

  # If a service is down (P0), suppress degraded alerts (P1) for same service
  - source_matchers:
      - severity = "p0"
    target_matchers:
      - severity = "p1"
    equal:
      - service

  # If Vault is sealed (critical dependency), suppress all secrets-related alerts
  - source_matchers:
      - alertname = "VaultSealed"
    target_matchers:
      - component = "secrets"

The equal field is crucial โ€” it specifies which labels must match between the source and target alert for the inhibition to apply. Without it, a P0 on one service would suppress P1 alerts across all services, which is too broad.

Grouping: aggregate before routing

Alertmanager's grouping configuration determines how alerts are bundled before they're sent. Without grouping, each alert fires a separate notification. With grouping, related alerts are combined into a single message:

# alertmanager.yml route tree
route:
  receiver: "default-silence"
  group_by: [alertname, cluster, service]
  group_wait: 30s      # wait 30s for more alerts to group
  group_interval: 5m   # how often to send grouped batches
  repeat_interval: 4h  # don't re-send resolved unless 4h have passed

  routes:
    - matchers:
        - severity = "p0"
      receiver: pagerduty-critical
      group_wait: 10s      # faster grouping for P0
      repeat_interval: 30m # re-notify every 30min if unresolved

    - matchers:
        - severity = "p1"
      receiver: slack-alerts
      group_wait: 1m
      repeat_interval: 2h

    - matchers:
        - severity = "p2"
      receiver: slack-tickets
      group_wait: 5m
      repeat_interval: 24h

group_wait vs group_interval: group_wait is the initial delay before the first notification for a new group โ€” it waits for other alerts to join the group. group_interval is the minimum time between subsequent notifications for an existing group. Keep group_wait short for P0 (10โ€“30s) and longer for lower severities (1โ€“5min).

Writing alert rules that stay accurate

A few patterns that appear in well-maintained alert rule sets:

Use recording rules to pre-compute expensive queries

Alert evaluation runs on a fixed interval (usually 15โ€“60s). Complex queries that run hundreds of times per minute add up. Record expensive aggregations into a new metric and alert on the recording rule output:

# Recording rule (in rules/recordings.yml)
- record: job:request_error_rate:rate5m
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
    /
    sum(rate(http_requests_total[5m])) by (job)

# Alert on the pre-computed recording rule
- alert: HighErrorRate
  expr: job:request_error_rate:rate5m > 0.01
  for: 5m
  labels:
    severity: p1
  annotations:
    summary: "{{ $labels.job }} error rate above 1%"
    runbook: "https://runbooks.internal/high-error-rate"

Always include a runbook URL in annotations

When an alert fires at 3am, the on-call engineer needs to know immediately what to look at. A runbook annotation with a URL to the relevant runbook is the single highest-value annotation to include. Pages without runbook links are incomplete alert definitions.

Test alert rules before deploying

promtool includes a test framework for alert rules. Write test cases that verify your alert fires when expected and doesn't fire when it shouldn't:

# tests/alert_test.yml
rule_files:
  - ../rules/services.yml

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{status="500", job="api"}'
        values: '0+10x10'   # 10 errors/min for 10 minutes
      - series: 'http_requests_total{status="200", job="api"}'
        values: '0+90x10'   # 90 ok/min for 10 minutes

    alert_rule_test:
      - eval_time: 6m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: p1
              job: api

Alert hygiene: ongoing maintenance

Alert rules decay. A threshold that was correct 6 months ago may be too tight or too loose today. Treating alert maintenance as a regular practice prevents quality decay:

  • Weekly review of muted alerts. If an alert has been muted for more than a week, either fix the underlying condition or delete the alert. Chronic mutes are failed alerts.
  • Post-incident alert audit. After every P0, review whether existing alerts would have caught the issue earlier. Add a coverage gap alert if not.
  • False positive rate tracking. Track what percentage of pages required no action (auto-resolved or "nothing to do"). Target under 10% for P0, under 20% for P1.
  • Annually review all P2 thresholds. P2s that never actually escalate to P1 are either correctly filtered or capturing issues that never materialize โ€” review which.

The test: Could a new team member, woken at 3am by this alert, understand what it means and know what action to take in under 2 minutes? If not, the alert needs a better name, clearer annotations, or a runbook link.

47Sentry integration

47Sentry exports eBPF-observed network metrics to Prometheus โ€” connection rates, blocked traffic, DNS resolution latency, observed topology changes. These become first-class alert signals alongside application metrics. The same severity framework and inhibition rules apply: a 47Sentry alert that the firewall is dropping >100 requests/second to an internal service inhibits the resulting application P1 latency alerts (the cause is known; the symptoms don't need separate pages).


โ† Back to Blog 47Sentry โ†’