nip-platform4 May 2026EN

NIP — on-call: 5-person rotation, pager budget, runbook discipline

Five engineers, primary+secondary in 1-week shifts, pager budget of 3/week, runbook on every alert. Median 0.7 pages/week last quarter.

On-call at NIP

NIP is kept alive by a 5-person on-call rotation. On-call isn't a gadget — people live with it or don't. Here is how we structure it: who's in, with what budget, what severity tiers, and what runbook discipline.

The rotation: primary + secondary, 1-week shift

Five engineers, paired up: primary (first responder) and secondary (escalation point after 15 minutes, plus daytime review of the primary's work). One-week shift, handover on Monday.

Primary: once a week, about every 7 weeks.
Secondary: once a week, in a different week, about every 5 weeks.

That means an engineer is primary on-call roughly once a month. Most of the year, you are not on-call.

Our key metric: pages per week.

Target: ≤3 pages/week for primary on-call.
Retro trigger: ≥5 pages/week — automatic root-cause meeting the following Monday.
Sev3+ pages on Saturday/Sunday: any count → retro (weekend sleep beats speed-of-fix).

The budget isn't punishment — it's a signal that the infrastructure is noisy. If the count is regularly 5+, something architectural has to change (better alert thresholds, deduplication, or splitting one alert into separate ones).

Actual numbers from last quarter: median 0.7 pages/week, max 4 pages/week (once, after a DB upgrade), 9 of 13 weeks with zero pages.

Severity tiers

Three tiers:

Sev1 (Critical) — production-affecting outage, customer writing or calling. Example: nortinia.com returning 500s. Response: within 5 minutes. Loud pager, secondary paged at the same time.
Sev2 (Major) — partial degradation, or a fault in a non-customer-facing service. Example: NIP UI slow, deploys still working. Response: within 30 minutes. Vibrating pager, secondary escalated after 15 minutes.
Sev3 / FYI — informational, not necessarily on-call work. Example: "Postgres replication lag 2 minutes", "backup retention at 80%". Response: next business day. Arrives as a Slack message, not a page.

The pager budget only counts Sev1-2. Sev3 can fire all it wants — it doesn't wake anyone up.

Runbook discipline

The primary rule: every alert links to a runbook. No exceptions.

If an alert would fire but no runbook exists, then:

The alert cannot page. Slack-only.
The on-call engineer writes the runbook within 1 week, or:
The alert is deleted.

The logic: if it's not important enough that somebody bothered to write down what to do, it's not important enough to wake somebody up over. This rule landed in late 2025; since then the pager budget median has dropped from 2.1 to 0.7.

The runbook format:

# Alert: <alert name>

## What it means
<1-2 sentences>

## First diagnostic command
<one kubectl/sql/curl command, copy-pasteable>

## Typical causes (top 3)
1. ...
2. ...
3. ...

## Escalation
<when to wake the secondary, when to wake the CTO>

Useful? Not always. But the Pareto 80 holds: on-call reads the first diagnostic command, compares the output against the top 3 causes, and most of the time finds the fix.

The deleted alerts

Over one year we deleted 23 alerts. Typical reasons:

Metric wasn't reliable enough (false-positive rate >10%).
An automatic heal already handles it (e.g. pod restart).
Not actionable during on-call hours ("look during business hours").

Deleting alerts is its own discipline: adding a new alert is easy, but on-call burns out when everything is always on.

On-call compensation

5% fixed shift bonus (independent of page count).
+50 EUR per Sev1 page (regardless of on-call status).
Weekend multiplier: 1.5x on everything.

This isn't meant to seduce, but it matters that on-call isn't unpaid labour.

What we learned

The single most important rule: runbooks, not heroes. An on-call engineer should not be working from tribal knowledge. If something isn't documented, somebody else will fail to handle it tomorrow. That rule overrides all the others.