The 3-2-1 backup rule in practice
The 3-2-1 rule is simple: 3 copies of the data, on 2 different media, with 1 offsite. Industry default for decades. The hard part isn't the description â it's the follow-through: routinely verify every copy, run restore drills, hit the recovery targets (RTO/RPO). Here is how NIP does it for Nortinia's data.
What 3-2-1 means for us
Primary data: the Postgres cluster (NIP control-plane DB, ~80 GB), plus per-workload PGs (~600 GB total), plus MinIO object storage (S3-compatible, ~4 TB).
- Copy 1 (live) â on the Postgres primary node, NVMe SSD. The hot data.
- Copy 2 (logical dump, on-demand restore) â
pg_dumpevery 6 hours to an NFS-mounted SAN (Synology RS-series, RAID6). 7-day retention. This is for fast restores when an engineer truncated a table in production by mistake. - Copy 3 (physical, PITR) â continuous WAL streaming to Hetzner Object Storage (S3-compatible). Provides Point-In-Time Recovery. 30-day retention. This is offsite too, sitting in a different Hetzner region.
- Plus: cold copy â a weekly
resticjob replicates the 6-hour dump to a NAS in Salzburg (at a partner company, physically elsewhere). This is the "even Hetzner Frankfurt burns down" scenario. 90-day retention.
That's 4 copies, 3 media types (NVMe, SAS HDD in Synology, object storage), 2 offsite locations (Hetzner FSN1 + Salzburg). Comfortably above the 3-2-1 minimum.
The targets: RTO 30 min, RPO 5 min
- RTO (Recovery Time Objective): 30 minutes â given a full cluster loss, we aim to be back up within half an hour.
- RPO (Recovery Point Objective): 5 minutes â at most 5 minutes of transactions can be lost (thanks to WAL streaming).
These aren't arbitrary. Our Nortinia business SLA is 99.9% (~43 minutes of downtime per month). If a single monthly incident consumes the full 43 minutes there's zero buffer left â which is exactly why we picked a 30-minute RTO.
The monthly restore drill
First Wednesday of every month, the on-call must run a full restore drill. The rules:
- Blind drill. The drill environment is fully isolated (its own namespace, its own hostname, no DNS exposure).
- Must go end-to-end: load 6-hour logical dump, replay WAL up to 12 hours ago, start the application, run the smoke test.
- Timer starts on
kubectl applyand stops when the smoke test reportsOK. - Result logged into NIP, summary posted into
#infra.
Last year's drill times in minutes: 28, 31, 29, 47, 33, 27, 30, 25, 29, 31, 28, 26.
The 47-minute drill
Drill #4 (October 2025) clocked 47 minutes. Findings:
- WAL replay took 22 minutes instead of the usual 8-10. Cause: the
wal_buffersandmaintenance_work_memPostgres parameters were at defaults in the restore environment. Fix: tuned parameters baked into the restore Helm chart. - Smoke test failed for 9 minutes because CORS config wasn't applied to the restore namespace, and the smoke test was hitting BE via FE. Fix: smoke test now curls BE directly, bypassing FE.
- The
resticcold copy was outside the drill scope â we realised we'd never tested the full chain. Next drill includes it, costing an extra ~2 minutes.
Net outcome: the 47 minutes never repeated. The very next drill came in at 33 minutes (with the tuned parameters and the cleaner smoke test).
What we don't do
- Continuous data protection (CDP) â replicate every transaction in real time. Too expensive (network bandwidth + storage cost), and a 5-minute RPO is plenty.
- Per-customer-key backup encryption â pretty, but our multi-tenant data is already encrypted in-cluster via SealedSecrets, and Hetzner S3 server-side encryption covers transit/at-rest. Customer-specific keys only on compliance request.
- Tape backups â in 2026, at Nortinia's scale, there's no good reason left.
Numbers
- 4 copies / 3 media / 2 offsite
- RTO 30 min / RPO 5 min
- 12 monthly drills last year, 30-minute median restore time
- 0 real disaster events (knock on wood)