Phase 4: K3s HA cluster (5-node production) #37

Closed
opened 2026-03-26 02:27:33 +00:00 by mik-tf · 2 comments
Member

Goal

5-node K3s HA cluster on TFGrid matching freezone's production setup. Zero single points of failure.

Depends on

Architecture

Cloudflare (DNS + CDN + DDoS)
  → 3 TFGrid gateways (Let's Encrypt TLS, round-robin)
    → 5-node K3s cluster (WireGuard mesh)
      → 2 backend replicas (active-active, GlusterFS RWX)
      → 2 frontend replicas (stateless nginx + WASM)
      → 1 admin replica

Tasks

Infrastructure (Terraform/OpenTofu)

  • 4.1 Terraform: 3 K3s servers (control plane + etcd quorum) — 4 CPU, 8GB RAM, 50GB disk each
  • 4.2 Terraform: 2 K3s agents (worker nodes) — 4 CPU, 8GB RAM, 50GB disk each
  • 4.3 Terraform: 3 TFGrid gateways — FQDN proxies for projectmycelium.org subdomains (app, api, admin, www)
  • 4.4 Terraform: WireGuard mesh (10.1.0.0/16) + Mycelium IPv6 overlay for ops

Cluster Bootstrap

  • 4.5 setup-cluster.sh — server-1 cluster-init, server-2/3 join, agent-1/2 join
  • 4.6 setup-storage.sh — Kadalu GlusterFS Replica3 (40GB brick per server, ext4)
  • 4.7 StorageClass kadalu-replicated — RWX for backend PVC

Kubernetes Manifests

  • 4.8 Kustomize base — namespace, backend Deployment+Service+PVC, frontend Deployment+Service, admin Deployment+Service, Ingress
  • 4.9 Kustomize prod-ha overlay — 2 backend replicas (RollingUpdate), RWX PVC on kadalu-replicated, pod anti-affinity, image tag pinning, imagePullSecrets
  • 4.10 Ingress routing — projectmycelium.org, app.projectmycelium.org, admin.projectmycelium.org
  • 4.11 ConfigMap for branding.toml — mounted into backend pods
  • 4.12 Secrets for env vars (JWT secret, payment keys, etc.)

Orchestration

  • 4.13 Makefile: make infra ENV=prod (OpenTofu apply)
  • 4.14 Makefile: make cluster ENV=prod (K3s + Kadalu)
  • 4.15 Makefile: make deploy ENV=prod (Kustomize apply)
  • 4.16 Makefile: make all ENV=prod (infra → cluster → deploy → test)
  • 4.17 Makefile: make ssh ENV=prod (Mycelium IPv6 SSH to nodes)

Verification

  • 4.18 Deploy marketplace to cluster
  • 4.19 Run full test suite against production URLs
  • 4.20 Verify pod anti-affinity (pods spread across nodes)
  • 4.21 Kill one node — verify zero downtime

Reference

  • Freezone K3s: znzfreezone_deploy/k3s-v2/
  • Freezone Terraform: znzfreezone_deploy/k3s-v2/tf/main.tf (5 nodes + 3 gateways)
  • Freezone Kustomize: znzfreezone_deploy/k8s/base/ + k8s/prod-ha/
  • Freezone scripts: znzfreezone_deploy/k3s-v2/scripts/setup-*.sh

Failure tolerance targets

Failure Impact
1 server dies etcd quorum intact (2/3), pods reschedule, 0s downtime
1 agent dies Pods reschedule, 0s downtime
1 gateway dies Cloudflare routes to remaining 2, 0s downtime
2 servers die Cluster degraded, manual recovery needed

Signed-off-by: mik-tf

## Goal 5-node K3s HA cluster on TFGrid matching freezone's production setup. Zero single points of failure. ## Depends on - https://forge.ourworld.tf/mycelium_code/home/issues/36 (Phase 3: Load testing — confirms the app handles concurrency) ## Architecture ``` Cloudflare (DNS + CDN + DDoS) → 3 TFGrid gateways (Let's Encrypt TLS, round-robin) → 5-node K3s cluster (WireGuard mesh) → 2 backend replicas (active-active, GlusterFS RWX) → 2 frontend replicas (stateless nginx + WASM) → 1 admin replica ``` ## Tasks ### Infrastructure (Terraform/OpenTofu) - [ ] **4.1** Terraform: 3 K3s servers (control plane + etcd quorum) — 4 CPU, 8GB RAM, 50GB disk each - [ ] **4.2** Terraform: 2 K3s agents (worker nodes) — 4 CPU, 8GB RAM, 50GB disk each - [ ] **4.3** Terraform: 3 TFGrid gateways — FQDN proxies for `projectmycelium.org` subdomains (app, api, admin, www) - [ ] **4.4** Terraform: WireGuard mesh (10.1.0.0/16) + Mycelium IPv6 overlay for ops ### Cluster Bootstrap - [ ] **4.5** `setup-cluster.sh` — server-1 cluster-init, server-2/3 join, agent-1/2 join - [ ] **4.6** `setup-storage.sh` — Kadalu GlusterFS Replica3 (40GB brick per server, ext4) - [ ] **4.7** StorageClass `kadalu-replicated` — RWX for backend PVC ### Kubernetes Manifests - [ ] **4.8** Kustomize base — namespace, backend Deployment+Service+PVC, frontend Deployment+Service, admin Deployment+Service, Ingress - [ ] **4.9** Kustomize prod-ha overlay — 2 backend replicas (RollingUpdate), RWX PVC on kadalu-replicated, pod anti-affinity, image tag pinning, imagePullSecrets - [ ] **4.10** Ingress routing — `projectmycelium.org`, `app.projectmycelium.org`, `admin.projectmycelium.org` - [ ] **4.11** ConfigMap for `branding.toml` — mounted into backend pods - [ ] **4.12** Secrets for env vars (JWT secret, payment keys, etc.) ### Orchestration - [ ] **4.13** Makefile: `make infra ENV=prod` (OpenTofu apply) - [ ] **4.14** Makefile: `make cluster ENV=prod` (K3s + Kadalu) - [ ] **4.15** Makefile: `make deploy ENV=prod` (Kustomize apply) - [ ] **4.16** Makefile: `make all ENV=prod` (infra → cluster → deploy → test) - [ ] **4.17** Makefile: `make ssh ENV=prod` (Mycelium IPv6 SSH to nodes) ### Verification - [ ] **4.18** Deploy marketplace to cluster - [ ] **4.19** Run full test suite against production URLs - [ ] **4.20** Verify pod anti-affinity (pods spread across nodes) - [ ] **4.21** Kill one node — verify zero downtime ## Reference - Freezone K3s: `znzfreezone_deploy/k3s-v2/` - Freezone Terraform: `znzfreezone_deploy/k3s-v2/tf/main.tf` (5 nodes + 3 gateways) - Freezone Kustomize: `znzfreezone_deploy/k8s/base/` + `k8s/prod-ha/` - Freezone scripts: `znzfreezone_deploy/k3s-v2/scripts/setup-*.sh` ## Failure tolerance targets | Failure | Impact | |---------|--------| | 1 server dies | etcd quorum intact (2/3), pods reschedule, 0s downtime | | 1 agent dies | Pods reschedule, 0s downtime | | 1 gateway dies | Cloudflare routes to remaining 2, 0s downtime | | 2 servers die | Cluster degraded, manual recovery needed | Signed-off-by: mik-tf
Author
Member

Deferred — focus on single-VM first

Phases 4+5 (K3s HA + backup) are deferred until single-VM dev passes 100% of tests. The freezone K3s setup is already proven and will be straightforward to replicate once the application layer is solid.

Current blocker: 8 Playwright tests fail on dev (5 SPA-only, 2 visual/routing, 1 SPA Buy Now).

— mik-tf

## Deferred — focus on single-VM first Phases 4+5 (K3s HA + backup) are deferred until single-VM dev passes 100% of tests. The freezone K3s setup is already proven and will be straightforward to replicate once the application layer is solid. Current blocker: 8 Playwright tests fail on dev (5 SPA-only, 2 visual/routing, 1 SPA Buy Now). — mik-tf
Author
Member

Superseded by mycelium_code/home#49 — detailed task breakdown with file-by-file audit against freezone k3s-v2 reference.

— mik-tf

Superseded by https://forge.ourworld.tf/mycelium_code/home/issues/49 — detailed task breakdown with file-by-file audit against freezone k3s-v2 reference. — mik-tf
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coopcloud_code/home#37
No description provided.