Flux GitOps and image rollout
Every Nortinia Kubernetes cluster runs Flux GitOps. That isn't trend-following — it gives us a concrete separation of duties: Git is the source of truth, Flux is the reconciler, and NIP is the control plane that manages image tags and deploy audit. Here is the layout, plus the story of one real race condition.
One repo per cluster
Four clusters, four repos:
nip-cluster-prod— production workloadsnip-cluster-staging— stagingnip-cluster-data— data services (Postgres, Redis, MinIO)nip-cluster-observability— Grafana, Prometheus, Loki
Each repo has a Flux Kustomization root, with per-namespace base manifests plus per-app overlays. Reasons for the split: (1) cleaner RBAC, (2) reconcile-failure blast radius limited to one cluster, (3) cluster rebuilds are easier.
The reconcile loop: 60 seconds
Flux reconciles Kustomizations every 60 seconds by default. We didn't lower it — we tried 30 seconds, but the Git LFS quota (occasional git pulls for image caching) climbed for no good reason. 60 seconds fits our 9-minute end-to-end deploy SLA just fine.
The ImageUpdateAutomation controller polls GHCR separately, every minute. It sees new tags and commits a bump to Git, which the Kustomization picks up one cycle later.
PR-driven prod, push-to-main staging
- staging — push-to-main rule. We deploy the
mainbranch immediately. Goal: surface obvious failures fast. - prod — PR-driven. An engineer opens a PR, it needs at least one review, CI gates must pass, only then can it merge. Once merged, Flux reconciles it.
The PR template surfaces: which image tag, which commit, the change log, the rollback path. That is the audit trail.
The image promotion ladder
Life of a new image:
dev-<sha>— built on a developer machine, local only.staging-<sha>— CI build on the staging branch, auto-deployed to the staging cluster.prod-<sha>— CI build on main, awaits manual promotion (a PR into the prod GitOps repo).
Promotion does NOT duplicate Docker layers — same SHA, same image. The tag is just a retag (docker buildx imagetools create). Cheap and fast.
The March 2026 race condition
Early March 2026 we had a two-day stretch where deploy times stretched from ~9 minutes to ~25 minutes. Symptom: the Flux Kustomization status stuck in Reconciling, then flipped to Suspended. Investigation followed.
Root cause: the ImageUpdateAutomation controller was committing a new image tag to the SAME kustomization.yaml file that the Kustomization reconciler was reading. Flux provides no lock between the two controllers, so the reconciler entered an infinite cycle because the file kept mutating during reconcile.
The fix: we split the files. ImageUpdateAutomation now writes to a separate image-tags.yaml, which kustomization.yaml pulls in via patchesStrategicMerge. Now the two controllers work on disjoint files. Reconcile time dropped back to 60-90 seconds.
Lesson learned: Flux controllers are logically isolated, but not physically (at the file level). patchesStrategicMerge is the cheap way to fix that.
What we did not build
- Multi-cluster GitOps (one repo, many clusters) — per-environment repos are simpler, with cleaner RBAC and shallower folder hierarchies.
- Our own reconciler — Flux works, no reason to replace it.
- Automatic prod promotion — intentionally manual. The engineer's PR is the final gate.
Numbers
- 4 clusters / 4 repos
- 60-second reconcile (default)
- 1-minute GHCR image polling
- 1,247 deploys in the last quarter, 0 lost (thanks to the PR audit trail)