On-call at NIP
NIP is kept alive by a 5-person on-call rotation. On-call isn't a gadget — people live with it or don't. Here is how we structure it: who's in, with what budget, what severity tiers, and what runbook discipline.
The rotation: primary + secondary, 1-week shift
Five engineers, paired up: primary (first responder) and secondary (escalation point after 15 minutes, plus daytime review of the primary's work). One-week shift, handover on Monday.
- Primary: once a week, about every 7 weeks.
- Secondary: once a week, in a different week, about every 5 weeks.
That means an engineer is primary on-call roughly once a month. Most of the year, you are not on-call.
The pager budget
Our key metric: pages per week.
- Target: ≤3 pages/week for primary on-call.
- Retro trigger: ≥5 pages/week — automatic root-cause meeting the following Monday.
- Sev3+ pages on Saturday/Sunday: any count → retro (weekend sleep beats speed-of-fix).
The budget isn't punishment — it's a signal that the infrastructure is noisy. If the count is regularly 5+, something architectural has to change (better alert thresholds, deduplication, or splitting one alert into separate ones).
Actual numbers from last quarter: median 0.7 pages/week, max 4 pages/week (once, after a DB upgrade), 9 of 13 weeks with zero pages.
Severity tiers
Three tiers:
- Sev1 (Critical) — production-affecting outage, customer writing or calling. Example:
nortinia.comreturning 500s. Response: within 5 minutes. Loud pager, secondary paged at the same time. - Sev2 (Major) — partial degradation, or a fault in a non-customer-facing service. Example: NIP UI slow, deploys still working. Response: within 30 minutes. Vibrating pager, secondary escalated after 15 minutes.
- Sev3 / FYI — informational, not necessarily on-call work. Example: "Postgres replication lag 2 minutes", "backup retention at 80%". Response: next business day. Arrives as a Slack message, not a page.
The pager budget only counts Sev1-2. Sev3 can fire all it wants — it doesn't wake anyone up.
Runbook discipline
The primary rule: every alert links to a runbook. No exceptions.
If an alert would fire but no runbook exists, then:
- The alert cannot page. Slack-only.
- The on-call engineer writes the runbook within 1 week, or:
- The alert is deleted.
The logic: if it's not important enough that somebody bothered to write down what to do, it's not important enough to wake somebody up over. This rule landed in late 2025; since then the pager budget median has dropped from 2.1 to 0.7.
The runbook format:
# Alert: <alert name>
## What it means
<1-2 sentences>
## First diagnostic command
<one kubectl/sql/curl command, copy-pasteable>
## Typical causes (top 3)
1. ...
2. ...
3. ...
## Escalation
<when to wake the secondary, when to wake the CTO>
Useful? Not always. But the Pareto 80 holds: on-call reads the first diagnostic command, compares the output against the top 3 causes, and most of the time finds the fix.
The deleted alerts
Over one year we deleted 23 alerts. Typical reasons:
- Metric wasn't reliable enough (false-positive rate >10%).
- An automatic heal already handles it (e.g. pod restart).
- Not actionable during on-call hours ("look during business hours").
Deleting alerts is its own discipline: adding a new alert is easy, but on-call burns out when everything is always on.
On-call compensation
- 5% fixed shift bonus (independent of page count).
- +50 EUR per Sev1 page (regardless of on-call status).
- Weekend multiplier: 1.5x on everything.
This isn't meant to seduce, but it matters that on-call isn't unpaid labour.
What we learned
The single most important rule: runbooks, not heroes. An on-call engineer should not be working from tribal knowledge. If something isn't documented, somebody else will fail to handle it tomorrow. That rule overrides all the others.