From Click-Ops to GitOps: Rebuilding the Afrotomation Fleet on k3s
Nine days ago I wrote a retrospective about migrating 47 apps from Vercel to Coolify. I was proud of that sprint. Then I went to live with the result.
By the evening of April 23, 2026, I was back in front of a terminal typing this:
"I have migrated dozens of apps from Vercel to my own VPS setup in Coolify… however, I feel everything is very manual although Claude helped me with a lot of the heavy lifting. Help me design a system that uses Kubernetes and IaC."
Coolify wasn't broken. Coolify worked. The operating model broke me. So tonight I'm doing it again — only this time the destination is a self-hosted k3s cluster across the same three nodes, with ArgoCD driving deploys from Git, CloudNativePG holding the data, and SOPS holding the secrets. Oh, and I'm also pulling the databases off Neon in the same sprint, because why stop at half a migration.
This is the honest post. The gotchas are real. The compromises are real. The single-point-of-failure is real.
Why rip out Coolify at all
Nothing crashed. That's the awkward part. By April 22 the 47-app Coolify fleet was green, domains resolved, TLS certs renewed. The problem was what it took to keep it green.
A partial list of the things I was doing by hand every week:
- Clicking through the Coolify UI to update an env var on app N, then re-triggering deploys.
- Re-syncing env var drift between my local
.envfiles and Coolify's DB. - Patching Traefik labels manually when a subdomain needed a new route.
- Chasing stale
source_idreferences after the GitHub App permissions hiccuped. - Updating ClickRise portfolio URLs manually after every batch of deploys.
- Keeping a mental model of what's deployed where, because there was no repo that said so.
The Coolify database itself had become the source of truth for my production fleet. That's fine for five apps. At forty-seven, it felt like a brittle monolith I couldn't diff, couldn't review, couldn't roll back, and couldn't rebuild from scratch if the VPS burned.
Eight days earlier (April 15), when I'd half-joked about going to Kubernetes, Claude had talked me out of it: "The real culprits are your build setup, your env-var handling, and your DNS. k3s won't fix those." That was correct advice at the time. Eight days of click-ops later, I overruled it.
Why k3s, not something else
Four answers to four questions, typed at 18:54:
"1. Tailscale. 2. [k3s]. 3. whichever is free and works well for small startup. 4. use your best judgement."
The reasoning behind the short answers:
| Option | Verdict |
| kubeadm | Too much etcd operational work for a one-person three-node cluster. |
| k3s | Single binary, ~100 MB control plane, embedded datastore, ARM64-native (critical for the Oracle Ampere box), first-class --flannel-iface=tailscale0 support. |
| Nomad | Great tool. But the ecosystem of Helm charts + operators + ArgoCD integrations I want for Postgres + monitoring + ingress is pure Kubernetes. |
| Stay on Coolify | See section above. |
k3s won on weight and on the fact that I already had a working private Tailscale tailnet joining all three nodes.
The bootstrap, 21:00 → 21:08
Three nodes, already on the tailnet, each tagged tag:prod-k8s:
| Node | Provider | Role | Specs |
vps50 (Contabo VPS 50) | Contabo | k3s server + Postgres primary + ingress-nginx | 16 vCPU · 64 GB RAM · 600 GB SSD |
ada (Oracle Cloud Ampere) | Oracle free tier | k3s agent + Postgres sync replica (RPO=0) + OpenClaw | 192 GB SSD |
vps10 (Contabo VPS 10) | Contabo | k3s agent + Postgres async replica + Loki + pgBackRest target | 150 GB SSD |
The install script — committed to bootstrap/10-k3s-server.sh in afrotomation-infra — was the plain upstream installer with a handful of flags:
curl -sfL https://get.k3s.io | sh -s - \
--cluster-init \
--flannel-iface=tailscale0 \
--node-ip=$TAILSCALE_IP \
--advertise-address=$TAILSCALE_IP \
--disable=traefik \
--disable=servicelb \
--node-label=afrotomation.io/pg=primary
Two non-defaults worth calling out:
--disable=traefik— k3s ships Traefik by default; I swapped it for ingress-nginx so I could reuse community annotations and cert-manager patterns already documented a thousand times.--disable=servicelb— klipper-lb on a Tailscale-flannel'd cluster with host networking was redundant. I pin the ingress controller to the public node directly.
First attempt at 21:00:23 failed with curl 404 — the installer script is fine, but my own wrapper tried to pull helper files from the private afrotomation-infra repo without auth. Pivoted to streaming the script over SSH:
cat bootstrap/10-k3s-server.sh | ssh contabo50 "bash -s"
Join token captured at 21:02. Both agents (ada-oracle, contabo10) joined by 21:08. Total elapsed from git init in the infra repo to a three-node Ready cluster: ~25 minutes.
The GitOps stack
Everything downstream of k3s is managed by ArgoCD using the app-of-apps pattern:
clusters/production/
├── root-app.yaml # the one Application that syncs everything
└── apps/
├── cert-manager.yaml
├── ingress-nginx.yaml
├── external-dns.yaml
├── kube-prometheus-stack.yaml
├── cnpg-operator.yaml
├── sops-operator.yaml
└── image-updater.yaml
platform/
├── postgres/ # CNPG Cluster CRs
├── monitoring/ # Grafana / Prometheus values
└── secrets/
├── ghcr-pull.enc.yaml
└── rotate-ghcr-pat.sh
workloads/
├── afrotomation/
├── clickrise/
├── codeniserver/
└── ... (47 Vercel migrants + OpenClaw)
Secrets: SOPS + age, personal age keys enumerated in .sops.yaml. Not sealed-secrets, not External Secrets Operator. SOPS because Git is the source of truth — including the encrypted blobs — and there's nothing to reconcile against an external vault I don't yet need.
Image pipeline: each app repo has a GitHub Actions workflow (docs/app-build-workflow-template.yml) that builds and pushes to GHCR. ArgoCD's image-updater polls GHCR, writes a new image tag into the Kustomization, and auto-syncs. Pull auth lives in platform/secrets/ghcr-pull.enc.yaml, rotated by rotate-ghcr-pat.sh.
That rotate script ate 30 minutes of my night. First run at 22:13 failed with sops: error loading config: no matching creation rules found because I invoked it from the wrong cwd. Re-ran from the repo root at 22:18 — worked. Lesson: SOPS rules are resolved relative to cwd, not to the file being encrypted. Write defensive scripts.
Per-app layout
Every app in workloads/<app>/ gets the same five files:
workloads/codeniserver/
├── namespace.yaml
├── deployment.yaml
├── service.yaml
├── ingress.yaml
├── secrets.enc.yaml # SOPS-encrypted env
└── kustomization.yaml
Ingress uses ingressClassName: nginx, TLS via cert-manager.io/cluster-issuer: letsencrypt-prod, DNS via external-dns.alpha.kubernetes.io/hostname so Cloudflare records get upserted automatically when I commit a manifest.
One controversial choice I'll defend below: ingress-nginx runs only on vps50, as a host-network Deployment pinned via nodeSelector. No MetalLB, no cloud LB, no DaemonSet across the three nodes. Cloudflare A records point to vps50's public IP, and the ingress answers directly on 80/443.
First app through the pipeline: tioyedev2024, pilot deploy, HTTP 200 with a real Let's Encrypt prod cert by ~23:40. Batch 1 immediately after, on explicit direction: "codeniserver, clickrise, sahelprosperity, sahelfoods".
The database story — 17 Neon DBs, one cluster
I wasn't planning to touch the databases tonight. I was overruled by myself:
"not tomorrow. today. we still have 3 hours till midnight. let's migrate at least 30 more apps and we need to migrate a few databases away from Neon DB." (00:56)
The target: CloudNativePG (CNPG) running as a StatefulSet operator inside the k3s cluster, with 3-instance Postgres 17 topology:
# platform/postgres/cluster.yaml (excerpt)
instances: 3
imageName: ghcr.io/cloudnative-pg/postgresql:17.2
minSyncReplicas: 1
maxSyncReplicas: 1
# primary: vps50
# sync replica: ada (RPO=0, cross-provider)
# async replica: vps10
storage:
storageClass: local-path
size: 50Gi
walStorage:
storageClass: local-path
size: 10Gi
backup:
barmanObjectStore:
destinationPath: s3://afrotomation-pg-backups/
s3Credentials: # sops-encrypted
wal:
compression: gzip
data:
compression: gzip
retentionPolicy: "90d"
Postgres 17 because every Neon project I was pulling from was on 17. Exactly one database stayed on 16: Solaire. I could have upgraded it, but the point of the night was cutover, not an in-place major-version jump.
The migration flow per app, codified in docs/runbooks/migrate-app-db-from-neon-to-cnpg.md:
- Create the role + db inside the CNPG primary via
kubectl exec. pg_dump --format=custom --no-owner --no-aclfrom Neon on my laptop.kubectl cpthe dump into the primary pod.pg_restore --exit-on-error --no-owner --no-aclinto the new db.- Rewrite
DATABASE_URLin the app's SOPS-encrypted env Secret. - Commit, push, ArgoCD syncs, pod restarts.
Downtime per app: about a minute of writes. Reads never went down because CNPG exposes a -ro service fronted by the replicas.
I also pulled belt-and-suspenders cold backups to /Users/codenificient/Documents/GitHub/postgres/neonbackups before any cutover — 17 pg_dump archives timestamped 20260423-2203, covering bookshelf, bugginator, calificient, clickrise, codenalytics, codenibudget, codeninvest, codeninvoice, codeniscapes, codeniserver, codenitask, codeniwork, fructosahel, sahelaqua, sahelprosperity, tioyedev2024, and one more. Total across all of them: under 2 GB. That's the thing you learn when you migrate off a managed DB — 34 projects of "production" data fit on a modest thumb drive.
Local verification loop, once per dump:
brew services start postgresql@17
createdb verify_test
pg_restore --no-owner --no-acl --dbname=verify_test <dump>.dump
psql -d verify_test -c '\dt'
dropdb verify_test
Catches schema errors before they hit the cluster. Found one (permissions on a public schema grant), fixed it in the restore flags.
Gotchas — the long list
Every migration has a scar tissue chapter. Here's this one's.
CNPG admission webhook rejected raw synchronous_standby_names. CNPG wants you to declare replica intent via minSyncReplicas/maxSyncReplicas fields and will compute the standby list itself. Learned that about six minutes into my first kubectl apply -f cluster.yaml.
postInitSQL fires before the application database exists. I had a block trying to CREATE EXTENSION pg_stat_statements that blew up with database "app" does not exist. Tried postInitApplicationSQL — same symptom on CNPG 1.24.1 running on ARM64. Gave up and ran the extensions by hand via kubectl exec post-bootstrap. Deferred the proper fix to a later sprint.
barman-cloud-check-wal-archive: Expected empty archive when I recreated the cluster after a config change. The B2 bucket wasn't empty any more. Workaround: bump serverName from afrotomation-pg to afrotomation-pg-v2. Old prefix kept for forensics.
Tailscale CLI died on macOS at 21:57, mid-bringup:
The Tailscale CLI failed to start: The operation couldn't be completed. (Tailscale.CLIError error 1.)
My first instinct was to rip out Tailscale and switch to Headscale. Claude pushed back: "Headscale probably doesn't solve your Mac issue. 90% of 'Tailscale broken on Mac' fixes are a logout + login. Don't rat-hole on this tonight — you have 47 apps to migrate." I accepted the push-back. Workaround: operate the cluster from inside ada over SSH, copy the kubeconfig to my Mac as ~/.kube/afrotomation-prod-public.yaml (public IPs in server: URLs, not tailnet IPs), and defer the tailnet-only kubeconfig to tomorrow-me.
DNS_PROBE_FINISHED_NXDOMAIN → Cloudflare error 1002 on grafana.afrotomation.com. Cloudflare was refusing to proxy the private tailnet IP I'd pointed the A record at. Swapped to vps50's public IP, proxied through Cloudflare, fine.
Grafana started with nothing. "Nothing is showing up, no sources on Grafana." Had to reset the admin password and re-wire the Prometheus datasource from scratch. The persistent volume was there; the provisioned datasource just hadn't survived a restart. Fix deferred by codifying the datasource in platform/monitoring/values.yaml.
Token leaks in chat. In the heat of the rebuild I pasted a live Cloudflare cfut_… token and an npm publish token directly into my terminal output. Flagged both for rotation. The k3s join token also leaked, but that one is rate-limited by Tailscale ACL (tag:prod-k8s only) so the blast radius is contained.
Coolify "partially down" is a useful state. When I killed Coolify's proxy container to stop it answering on 443, I discovered the app containers kept running. Which means docker inspect was still a valid source of truth for environment variables I had managed to lose from my local .env files. Extracted them, SOPS-encrypted them, committed them, moved on.
tioyedev2024 build failed at 23:10 with bunx prisma generate && bun run build exit 1 because DATABASE_URL wasn't present at build time. Fix: lazy-init the Prisma client so bun run build doesn't crack it open. Applied the same pattern across the fleet pre-emptively.
Compromises — the things I'm not proud of
I'd rather be honest than aspirational, so here's the list of things a "proper" k8s setup would do that mine does not, as of tonight:
- Single-server control plane.
--cluster-initwas chosen specifically so I can grow to HA later without re-migrating. Today it's one etcd on vps50. If that node's SSD dies, the cluster is down. Postgres will fail over to the Oracle sync replica but the control plane won't. - No service mesh, no mTLS, no NetworkPolicy. All pod-to-pod traffic is plaintext inside the Tailscale WireGuard tunnel. I'm trusting the tailnet as my security boundary.
- No HPA. Every workload is a fixed-replica Deployment. A viral traffic spike on one subdomain takes the whole node with it.
- Ingress-nginx pinned to one node. If vps50 dies, all 47 hostnames go dark — even though Postgres stays up. I accept this because my users are me plus a handful of beta testers, and because a DaemonSet + MetalLB would have doubled tonight's setup time.
- OpenClaw (the agent gateway) was left powered down. Its Docker Swarm deployment was "complicated" enough that I punted to a clean k8s version in a later sprint.
- Bassaweb + Solaire are on GitLab and deferred. ArgoCD's GitLab auth is fine but I didn't want to chase that token tonight. Mirror-to-GitHub is the follow-up.
- Uptime Kuma + Umami aren't reachable from the new cluster yet; monitoring for those apps is handled by Cloudflare analytics for the moment.
- Token rotation for the leaked Cloudflare + npm + k3s tokens is deferred to tomorrow morning's hygiene block.
- Auto-sync is aggressive. ArgoCD auto-syncs every Application every 3 minutes; a buggy commit hits production fast. No canary, no rollout controller, no feature flags. Again: one user, acceptable.
Lessons learned
In no particular order, because retrospectives are messy:
- Operating model matters more than the platform. Coolify was faster to adopt than k3s. It was slower to live with. If you're going to run something for a year, optimize for the year, not for the first day.
- GitOps disciplines you even when you're a team of one. Having every prod state visible in a single
git logis worth the extra setup time by itself. I've already caught two "wait, what changed?" questions in the 90 minutes since the repo went live. - SOPS + age > sealed-secrets for small teams. No cluster dependency, no round-trip, no operator to debug. Rotate age keys by changing
.sops.yaml. Done. - Pg_dump the cold copies before you touch anything. The 17 dumps I made before any cutover were never needed, but the confidence they gave me — knowing I could rebuild any one of those databases from scratch in under a minute — paid for itself in decision speed.
- Claude pushing back when I was about to rat-hole on Headscale mid-migration is the Claude behavior I want. The same Claude that talked me out of k3s on April 15 also talked me out of rewriting the VPN at 22:00 on April 23. Both saves.
- "It's not a crash, I just don't like the operating model" is a valid reason to migrate. I felt guilty about it at first. I don't any more.
End-of-day state — April 23 → April 24
- Three-node k3s cluster: Ready on all nodes, tailnet tagged
tag:prod-k8s. - Workload directories scaffolded in
afrotomation-infra/workloads/for the full 47-app fleet — plus a stub foropenclaw(the agent gateway that predates the Vercel era and has always been self-hosted on Ada). - Pilot + Batch 1 live:
tioye.dev,codeniserver,clickrise,sahelprosperity,sahelfoods— serving real traffic with real certs from the k3s ingress. - CNPG cluster: 1 primary (vps50) + 1 sync replica (ada) + 1 async replica (vps10), nightly backups to Backblaze B2, PITR via WAL shipping.
- 17 Neon pg_dumps archived to
~/Documents/GitHub/postgres/neonbackups/with timestamped filenames. Neon account queued for free-tier downgrade once the remaining projects are cut over (< 20 remaining). - Grafana + Prometheus reachable and wired, admin password rotated.
- Coolify is still up on vps50 but no longer fronted by any DNS record; its proxy was stopped around 22:00. Its app containers stay up as a fallback source of truth for env vars I don't trust my local files on.
Roughly 20–25 apps actually cut over tonight. The remaining ~22 sit as committed manifests in the infra repo, waiting for their pg_dump + DNS flip tomorrow. That's the nice thing about having finished the scaffolding before midnight: the rest of the migration is pure reps, and reps don't need all-nighters.
Next post: the CNPG runbook in detail, and whether tearing down Coolify twice in nine days taught me anything that would have been impossible to learn on the first try.
Spoiler: I think it did. But I want to sleep on it.