nip-platform25 May 2026EN

NIP — Flux GitOps and image rollout, the way we do it

Four clusters, four GitOps repos, Flux on a 60-second reconcile, manual prod promotion. Plus a March 2026 race condition fixed at the file level.

Flux GitOps and image rollout

Every Nortinia Kubernetes cluster runs Flux GitOps. That isn't trend-following — it gives us a concrete separation of duties: Git is the source of truth, Flux is the reconciler, and NIP is the control plane that manages image tags and deploy audit. Here is the layout, plus the story of one real race condition.

One repo per cluster

Four clusters, four repos:

nip-cluster-prod — production workloads
nip-cluster-staging — staging
nip-cluster-data — data services (Postgres, Redis, MinIO)
nip-cluster-observability — Grafana, Prometheus, Loki

Each repo has a Flux Kustomization root, with per-namespace base manifests plus per-app overlays. Reasons for the split: (1) cleaner RBAC, (2) reconcile-failure blast radius limited to one cluster, (3) cluster rebuilds are easier.

The reconcile loop: 60 seconds

Flux reconciles Kustomizations every 60 seconds by default. We didn't lower it — we tried 30 seconds, but the Git LFS quota (occasional git pulls for image caching) climbed for no good reason. 60 seconds fits our 9-minute end-to-end deploy SLA just fine.

The ImageUpdateAutomation controller polls GHCR separately, every minute. It sees new tags and commits a bump to Git, which the Kustomization picks up one cycle later.

PR-driven prod, push-to-main staging

staging — push-to-main rule. We deploy the main branch immediately. Goal: surface obvious failures fast.
prod — PR-driven. An engineer opens a PR, it needs at least one review, CI gates must pass, only then can it merge. Once merged, Flux reconciles it.

The PR template surfaces: which image tag, which commit, the change log, the rollback path. That is the audit trail.

The image promotion ladder

Life of a new image:

dev-<sha> — built on a developer machine, local only.
staging-<sha> — CI build on the staging branch, auto-deployed to the staging cluster.
prod-<sha> — CI build on main, awaits manual promotion (a PR into the prod GitOps repo).

Promotion does NOT duplicate Docker layers — same SHA, same image. The tag is just a retag (docker buildx imagetools create). Cheap and fast.

The March 2026 race condition

Early March 2026 we had a two-day stretch where deploy times stretched from ~9 minutes to ~25 minutes. Symptom: the Flux Kustomization status stuck in Reconciling, then flipped to Suspended. Investigation followed.

Root cause: the ImageUpdateAutomation controller was committing a new image tag to the SAME kustomization.yaml file that the Kustomization reconciler was reading. Flux provides no lock between the two controllers, so the reconciler entered an infinite cycle because the file kept mutating during reconcile.

The fix: we split the files. ImageUpdateAutomation now writes to a separate image-tags.yaml, which kustomization.yaml pulls in via patchesStrategicMerge. Now the two controllers work on disjoint files. Reconcile time dropped back to 60-90 seconds.

Lesson learned: Flux controllers are logically isolated, but not physically (at the file level). patchesStrategicMerge is the cheap way to fix that.

What we did not build

Multi-cluster GitOps (one repo, many clusters) — per-environment repos are simpler, with cleaner RBAC and shallower folder hierarchies.
Our own reconciler — Flux works, no reason to replace it.
Automatic prod promotion — intentionally manual. The engineer's PR is the final gate.

Numbers

4 clusters / 4 repos
60-second reconcile (default)
1-minute GHCR image polling
1,247 deploys in the last quarter, 0 lost (thanks to the PR audit trail)