What happens during a deploy on NIP
The Nortinia Infrastructure Platform (NIP) is where an engineer's commit turns into a running service. It is not a CI/CD tool â it is an infrastructure controller that ties Kubernetes clusters, GitOps state, image rollouts, and notifications into one coherent flow. This article walks through what happens between a simple git push and a new pod taking traffic.
Eight steps (median: 9 minutes)
- Commit â engineer pushes to a main branch (or merges the PR). GitHub Actions workflow fires.
- CI build â the workflow builds an OCI image. Layer cache makes a typical backend build take 3-4 minutes.
- Push to GHCR â image is uploaded to
ghcr.io/nortinia-ltd/<repo>with two tags:main-<sha>andmain-latest. - Webhook to NIP â CI POSTs an
image-builtpayload to NIP at/api/v1/deploy/image-built. Payload includes repo, commit SHA, image tag, and target environment (staging/prod). - NIP dedupes â using an idempotency key (
repo+sha+env) NIP checks whether the webhook has already been processed. If so, it returns 200 and does nothing. - Flux reconcile â NIP updates the Kustomization image tag in the GitOps repo (one commit bumping the tag in
kustomization.yaml). Flux notices within 60 seconds and starts the sync. - kubectl rollout â the Deployment gets a new ReplicaSet. Each new pod must pass its readiness probe before taking traffic. Default surge: 25%, max unavailable: 0.
- Notification â Slack message in
#deploystagging the commit author. Green on success, red plus on-call on failure.
End-to-end median: 9 minutes (commit to first ready pod). The most common bottleneck is the CI build (image size + Next.js production build).
The four-step rollback
When something goes wrong, rollback is not git revert â there's a faster path:
- "Rollback to previous tag" button in the NIP UI.
- NIP commits to the GitOps repo, pinning the last known good tag.
- Flux syncs within 60 seconds.
- kubectl rollout creates a new ReplicaSet, the broken one drains.
Execution time: 2-3 minutes. The git revert path would be 8+ minutes (CI build + push + webhook).
Why webhook deduplication matters
GitHub Actions retries if NIP responds slowly or with a 5xx. Without dedup, that meant two consecutive Flux commits for the same SHA â pointless reconcile churn. The dedup key is ${repo}:${sha}:${env}, 24-hour TTL in Redis. Simple, but it filters 30+ duplicate hits per day.
What happens when Flux gets stuck
Flux exposes a failureSeverity field on the HelmRelease/Kustomization status. NIP polls it every 30 seconds. On failure (invalid manifest, image pull error, etc.) the deploy state moves to FAILED, and on-call receives a Slack DM plus a PagerDuty page (Sev2).
The most common failures over the past six months: (1) memory request too large for the node; (2) missing secret (SealedSecret not yet deployed); (3) ConfigMap key typo. All three have runbooks.
What we did not build (and why)
- Canary deployments with auto-rollback â planned, but our traffic is not high enough for metric-based decisions to be reliable. Instead we hold a 5-minute stability window and a human reviews the Sentry error rate.
- Multi-region failover automation â one region (Hetzner FSN1). If we ever add another, it still won't be automatic on day one.
- In-house container registry â GHCR works, it is free, no reason to switch.
Numbers from last quarter
- 1,247 successful deploys
- 23 rollbacks (1.8%)
- 9.1-minute median end-to-end
- 0 lost deploys (thanks to webhook dedup)