Why XCP-ng, and not VMware
Nortinia's infrastructure runs roughly 100 VMs across several physical hosts, plus Kubernetes clusters and support services. The hypervisor choice was a deliberate call, not a default. Two years ago the question was: VMware vSphere, Proxmox VE, or XCP-ng? XCP-ng won. Here is why.
The cost math
VMware vSphere Standard in 2024 was roughly 400 EUR / socket / year. On a typical dual-socket Xeon host that's 800 EUR/year/host. Across six hosts: 4,800 EUR/year on hypervisor licensing alone, with vCenter on top. Since the Broadcom acquisition (2024), per-core pricing and minimum commits have only gotten worse.
XCP-ng: 0 EUR in licensing. Vates (the XCP-ng vendor) sells a Pro support tier (~1,250 EUR/year/host) which we don't need at our SLA â the Xen Project and Vates community are responsive enough that the one or two questions we have per month get answered there.
Net savings: ~4,800 EUR/year on licensing alone, plus vCenter (~2,000 EUR/year) avoided.
License sovereignty
After the Broadcom acquisition we read multiple stories of perpetual licenses being converted to subscription-only, and customers seeing 5x cost jumps overnight. That is a structural risk: if the vendor restructures tomorrow, our options are limited. With an open-source hypervisor that risk disappears. The Xen Project (and XCP-ng) lives under the Linux Foundation umbrella, Vates is the maintainer, but a community fork is always possible.
The Xen security model
Xen is a Type 1 (bare-metal) hypervisor: dom0 is the Linux management VM, the domUs are guests. Compared to KVM, Xen has a smaller attack surface: no in-kernel kvm module, hypervisor and host OS cleanly separated. AWS ran EC2 on Xen for years (now on Nitro) precisely for that isolation model.
In practice: over the last 5 years zero CVEs materialised that would have compromised a Nortinia production workload via Xen. (Two Xen security advisories did land â both patched in our weekly patch window.)
Snapshot performance
Xen xen-vbd snapshots are copy-on-write at the Storage Repository level. Snapshot of a 100 GB VM: ~2 seconds. Restore: ~5 minutes (SR-level copy-back). This made daily backup jobs (xe vm-export cron) trivial â under VMware in our environment a comparable snapshot was 8-15 seconds, restore 7-10 minutes.
The 3 missing VMware features
What did we lose by leaving VMware? Three things:
- vMotion (live migration UI) â XCP-ng's
xe vm-migrateworks from the CLI, but vSphere's drag-and-drop UX is nicer. We built our own "Migrate VM" button in NIP that wrapsxe vm-migrate. Solved. - DRS (Distributed Resource Scheduler) â VMware auto-balances load across hosts. XCP-ng has nothing like it. We built
nip-balancer, a 15-minute cron that watches host load and, when any host crosses 80%, proposes a migration plan to the on-call (not automatic â we don't want 3 a.m. surprises). Solved. - Fault Tolerance (FT, lockstep) â sub-second mirror of a VM on another host, instant failover. We did not build this. Instead: critical workloads run on Kubernetes (replicas), where pod failover is built in. The one true single-instance VM (Postgres primary) gets HA via PG streaming replicas â not at the hypervisor layer.
Why not Proxmox
Proxmox VE is also open source, KVM-based, with an excellent community. We did look. Two reasons we didn't pick it:
- KVM vs Xen attack surface â subjective, but the Xen model is smaller. AWS precedent.
- Storage integration story â XCP-ng pairs cleanly with XOSAN/XOSTOR (Vates' distributed storage), but we already had Ceph, and Xen's
XAPIwas easier to stitch into than Proxmox'spveproxy. Historical reason; today I'd probably re-evaluate.
What we would not change now
Two years in: the choice was right. The annual ~7,000 EUR saved (licensing + vCenter + support) is enough to fund one engineer-day per week if we ever needed deeper xe CLI expertise. We haven't.