UgrĂĄs a tartalomhoz
← Back to the journal

NIP — the 3-2-1 backup rule in practice, plus a 47-minute restore drill

Our 3-2-1 backup: 4 copies, 3 media, 2 offsite, monthly restore drill at a 30-minute median, plus a 47-minute drill we learned from.

The 3-2-1 backup rule in practice

The 3-2-1 rule is simple: 3 copies of the data, on 2 different media, with 1 offsite. Industry default for decades. The hard part isn't the description — it's the follow-through: routinely verify every copy, run restore drills, hit the recovery targets (RTO/RPO). Here is how NIP does it for Nortinia's data.

What 3-2-1 means for us

Primary data: the Postgres cluster (NIP control-plane DB, ~80 GB), plus per-workload PGs (~600 GB total), plus MinIO object storage (S3-compatible, ~4 TB).

  • Copy 1 (live) — on the Postgres primary node, NVMe SSD. The hot data.
  • Copy 2 (logical dump, on-demand restore) — pg_dump every 6 hours to an NFS-mounted SAN (Synology RS-series, RAID6). 7-day retention. This is for fast restores when an engineer truncated a table in production by mistake.
  • Copy 3 (physical, PITR) — continuous WAL streaming to Hetzner Object Storage (S3-compatible). Provides Point-In-Time Recovery. 30-day retention. This is offsite too, sitting in a different Hetzner region.
  • Plus: cold copy — a weekly restic job replicates the 6-hour dump to a NAS in Salzburg (at a partner company, physically elsewhere). This is the "even Hetzner Frankfurt burns down" scenario. 90-day retention.

That's 4 copies, 3 media types (NVMe, SAS HDD in Synology, object storage), 2 offsite locations (Hetzner FSN1 + Salzburg). Comfortably above the 3-2-1 minimum.

The targets: RTO 30 min, RPO 5 min

  • RTO (Recovery Time Objective): 30 minutes — given a full cluster loss, we aim to be back up within half an hour.
  • RPO (Recovery Point Objective): 5 minutes — at most 5 minutes of transactions can be lost (thanks to WAL streaming).

These aren't arbitrary. Our Nortinia business SLA is 99.9% (~43 minutes of downtime per month). If a single monthly incident consumes the full 43 minutes there's zero buffer left — which is exactly why we picked a 30-minute RTO.

The monthly restore drill

First Wednesday of every month, the on-call must run a full restore drill. The rules:

  1. Blind drill. The drill environment is fully isolated (its own namespace, its own hostname, no DNS exposure).
  2. Must go end-to-end: load 6-hour logical dump, replay WAL up to 12 hours ago, start the application, run the smoke test.
  3. Timer starts on kubectl apply and stops when the smoke test reports OK.
  4. Result logged into NIP, summary posted into #infra.

Last year's drill times in minutes: 28, 31, 29, 47, 33, 27, 30, 25, 29, 31, 28, 26.

The 47-minute drill

Drill #4 (October 2025) clocked 47 minutes. Findings:

  1. WAL replay took 22 minutes instead of the usual 8-10. Cause: the wal_buffers and maintenance_work_mem Postgres parameters were at defaults in the restore environment. Fix: tuned parameters baked into the restore Helm chart.
  2. Smoke test failed for 9 minutes because CORS config wasn't applied to the restore namespace, and the smoke test was hitting BE via FE. Fix: smoke test now curls BE directly, bypassing FE.
  3. The restic cold copy was outside the drill scope — we realised we'd never tested the full chain. Next drill includes it, costing an extra ~2 minutes.

Net outcome: the 47 minutes never repeated. The very next drill came in at 33 minutes (with the tuned parameters and the cleaner smoke test).

What we don't do

  • Continuous data protection (CDP) — replicate every transaction in real time. Too expensive (network bandwidth + storage cost), and a 5-minute RPO is plenty.
  • Per-customer-key backup encryption — pretty, but our multi-tenant data is already encrypted in-cluster via SealedSecrets, and Hetzner S3 server-side encryption covers transit/at-rest. Customer-specific keys only on compliance request.
  • Tape backups — in 2026, at Nortinia's scale, there's no good reason left.

Numbers

  • 4 copies / 3 media / 2 offsite
  • RTO 30 min / RPO 5 min
  • 12 monthly drills last year, 30-minute median restore time
  • 0 real disaster events (knock on wood)

Let's talk about your project

Tell us what you are building — we will figure out how to help.

NIP — the 3-2-1 backup rule in practice, plus a 47-minute restore drill — Nortinia Journal | Nortinia