Ugrás a tartalomhoz
← Back to the journal

NIP — on-call: 5-person rotation, pager budget, runbook discipline

Five engineers, primary+secondary in 1-week shifts, pager budget of 3/week, runbook on every alert. Median 0.7 pages/week last quarter.

On-call at NIP

NIP is kept alive by a 5-person on-call rotation. On-call isn't a gadget — people live with it or don't. Here is how we structure it: who's in, with what budget, what severity tiers, and what runbook discipline.

The rotation: primary + secondary, 1-week shift

Five engineers, paired up: primary (first responder) and secondary (escalation point after 15 minutes, plus daytime review of the primary's work). One-week shift, handover on Monday.

  • Primary: once a week, about every 7 weeks.
  • Secondary: once a week, in a different week, about every 5 weeks.

That means an engineer is primary on-call roughly once a month. Most of the year, you are not on-call.

The pager budget

Our key metric: pages per week.

  • Target: ≤3 pages/week for primary on-call.
  • Retro trigger: ≥5 pages/week — automatic root-cause meeting the following Monday.
  • Sev3+ pages on Saturday/Sunday: any count → retro (weekend sleep beats speed-of-fix).

The budget isn't punishment — it's a signal that the infrastructure is noisy. If the count is regularly 5+, something architectural has to change (better alert thresholds, deduplication, or splitting one alert into separate ones).

Actual numbers from last quarter: median 0.7 pages/week, max 4 pages/week (once, after a DB upgrade), 9 of 13 weeks with zero pages.

Severity tiers

Three tiers:

  • Sev1 (Critical) — production-affecting outage, customer writing or calling. Example: nortinia.com returning 500s. Response: within 5 minutes. Loud pager, secondary paged at the same time.
  • Sev2 (Major) — partial degradation, or a fault in a non-customer-facing service. Example: NIP UI slow, deploys still working. Response: within 30 minutes. Vibrating pager, secondary escalated after 15 minutes.
  • Sev3 / FYI — informational, not necessarily on-call work. Example: "Postgres replication lag 2 minutes", "backup retention at 80%". Response: next business day. Arrives as a Slack message, not a page.

The pager budget only counts Sev1-2. Sev3 can fire all it wants — it doesn't wake anyone up.

Runbook discipline

The primary rule: every alert links to a runbook. No exceptions.

If an alert would fire but no runbook exists, then:

  1. The alert cannot page. Slack-only.
  2. The on-call engineer writes the runbook within 1 week, or:
  3. The alert is deleted.

The logic: if it's not important enough that somebody bothered to write down what to do, it's not important enough to wake somebody up over. This rule landed in late 2025; since then the pager budget median has dropped from 2.1 to 0.7.

The runbook format:

# Alert: <alert name>

## What it means
<1-2 sentences>

## First diagnostic command
<one kubectl/sql/curl command, copy-pasteable>

## Typical causes (top 3)
1. ...
2. ...
3. ...

## Escalation
<when to wake the secondary, when to wake the CTO>

Useful? Not always. But the Pareto 80 holds: on-call reads the first diagnostic command, compares the output against the top 3 causes, and most of the time finds the fix.

The deleted alerts

Over one year we deleted 23 alerts. Typical reasons:

  • Metric wasn't reliable enough (false-positive rate >10%).
  • An automatic heal already handles it (e.g. pod restart).
  • Not actionable during on-call hours ("look during business hours").

Deleting alerts is its own discipline: adding a new alert is easy, but on-call burns out when everything is always on.

On-call compensation

  • 5% fixed shift bonus (independent of page count).
  • +50 EUR per Sev1 page (regardless of on-call status).
  • Weekend multiplier: 1.5x on everything.

This isn't meant to seduce, but it matters that on-call isn't unpaid labour.

What we learned

The single most important rule: runbooks, not heroes. An on-call engineer should not be working from tribal knowledge. If something isn't documented, somebody else will fail to handle it tomorrow. That rule overrides all the others.

Let's talk about your project

Tell us what you are building — we will figure out how to help.