coopcloud_code/home

Fork 0

Phase 4: Update k3s/ + k8s/ for embedded OSIS architecture #49

New issue

Closed

opened 2026-03-27 14:53:25 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-03-27 14:53:25 +00:00

Member

Goal

Update the existing k3s/ and k8s/ directories in projectmycelium_marketplace_deploy to match the current v2.0.0 architecture (embedded OSIS, no external containers) and align with the freezone k3s-v2 production pattern.

Reference: znzfreezone_deploy/k3s-v2/ + znzfreezone_deploy/k8s/ (production-proven, 5-node HA).

Current state

The marketplace already has k3s/ and k8s/ directories but they're outdated — written before the embedded OSIS migration:

k8s/base/ — what's wrong

File	Problem
`hero-osis.yaml`	Remove — OSIS is now embedded in the backend binary
`marketplace.yaml`	Uses `APP_BACKEND=hero`, `HERO_OSIS_URL` — should be `APP_BACKEND=local` with no external deps
`marketplace.yaml`	Missing: liveness/readiness probes, resource limits, volume mount for `/app/data`
Missing	`frontend-deployment.yaml` — SPA frontend (nginx + WASM)
Missing	`frontend-service.yaml`
Missing	`admin-deployment.yaml` — admin dashboard
Missing	`admin-service.yaml`
Missing	`backend-pvc.yaml` — PersistentVolumeClaim for OSIS data
Missing	`backend-service.yaml` — separate from deployment
`kustomization.yaml`	References hero-osis, missing frontend/admin/PVC

k8s/overlays/ — what's wrong

File	Problem
`prod/kustomization.yaml`	3 replicas but no RWX volume, no anti-affinity, no image pinning
`dev/kustomization.yaml`	Basic but functional
Missing	`prod-ha/kustomization.yaml` — freezone-style HA overlay (2 replicas, Kadalu RWX, anti-affinity)

k3s/ — what's wrong

File	Problem
`tf/main.tf`	Only 1 server + 2 agents — freezone uses 3 servers (etcd quorum) + 2 agents
Missing	`scripts/setup-cluster.sh` — K3s HA bootstrap (join 3 servers)
Missing	`scripts/setup-velero.sh` — backup setup
Missing	`scripts/restore-data.sh` — disaster recovery
Missing	`scripts/migrate.sh` — data migration from single-VM

Tasks

k8s/base/ — rewrite for v2.0.0 architecture

4.1 Remove hero-osis.yaml (no longer needed)
4.2 Rewrite marketplace.yaml → backend-deployment.yaml:
- Image: projectmycelium_marketplace:TAG
- Env: remove HERO_OSIS_URL/HERO_LEDGER_RPC_URL, add MARKETPLACE_DB_PATH=/app/data/marketplace
- Probes: GET /api/health (liveness 10s), GET /api/ready (readiness 5s)
- Volume mount: /app/data from PVC
- Resource requests/limits
4.3 Add backend-pvc.yaml — 10Gi RWO (base), overridden to RWX + Kadalu in prod-ha
4.4 Add backend-service.yaml — port 8000
4.5 Add frontend-deployment.yaml + frontend-service.yaml — nginx + WASM SPA, port 80
4.6 Add admin-deployment.yaml + admin-service.yaml — admin proxy, port 9000, MARKETPLACE_RPC_URL env
4.7 Update ingress.yaml — 3 hosts (app, admin, API), TFGrid gateway handles TLS
4.8 Update kustomization.yaml — new file list, configMapGenerator for branding

k8s/prod-ha/ — new overlay (freezone pattern)

4.9 Create prod-ha/kustomization.yaml:
- Backend: 2 replicas, RollingUpdate, RWX PVC (Kadalu), anti-affinity
- Frontend: 2 replicas, anti-affinity
- Admin: 1 replica
- Image pinning via kustomize images: block (SSOT for versions)
- Registry auth imagePullSecrets
- Strip TLS from ingress (TFGrid gateway handles it)

k3s/tf/ — 5-node HA

4.10 Update main.tf — 3 servers (etcd quorum) + 2 agents + 3 gateways
- Follow freezone k3s-v2/tf/main.tf pattern
- Server: 4 CPU, 8GB RAM, 50GB disk
- Agent: 4 CPU, 8GB RAM, 50GB disk
- 3 gateways per subdomain (app, admin, API)
4.11 Update outputs.tf — server_ips, agent_ips, server_mycelium, agent_mycelium, URLs
4.12 Update variables.tf — server_node_ids (3), agent_node_ids (2), gateway_nodes (3+)

k3s/scripts/ — add missing scripts

4.13 Add setup-cluster.sh — bootstrap 3-server HA K3s (join servers 2+3 to server 1)
4.14 Update setup-server.sh — install K3s with --cluster-init on first server
4.15 Update setup-agent.sh — join agents to HA cluster
4.16 Update setup-storage.sh — Kadalu GlusterFS Replica3 on 3 servers
4.17 Add setup-velero.sh — Velero + MinIO for in-cluster backups
4.18 Add restore-data.sh — restore from Velero backup
4.19 Add migrate.sh — migrate data from single-VM Docker volume to K3s PVC
4.20 Update deploy-app.sh — kubectl apply -k k8s/prod-ha/

k3s/Makefile — end-to-end automation

4.21 Update targets: make all ENV=prod does infra → cluster → storage → deploy → test
4.22 Add make migrate ENV=prod for single-VM → K3s migration

Failure tolerance targets

Failure	Impact
1 server dies	etcd quorum intact (2/3), pods reschedule, 0s downtime
1 agent dies	Pods reschedule, 0s downtime
1 gateway dies	Cloudflare routes to remaining 2, 0s downtime

Architecture

Cloudflare (DNS + CDN + DDoS)
  → 3 TFGrid gateways (Let's Encrypt TLS, round-robin)
    → 5-node K3s cluster (WireGuard mesh 10.1.0.0/16)
      → 2 backend replicas (RollingUpdate, Kadalu GlusterFS RWX PVC)
      → 2 frontend replicas (stateless nginx + WASM)
      → 1 admin replica
      → Velero + MinIO (in-cluster backups)

Directory structure (target)

k3s/
  Makefile                    # make all ENV=prod
  tf/
    main.tf                   # 3 servers + 2 agents + 3 gateways
    variables.tf
    outputs.tf
  scripts/
    setup-server.sh           # K3s server with --cluster-init
    setup-cluster.sh          # Join servers 2+3 for HA
    setup-agent.sh            # Join agents
    setup-storage.sh          # Kadalu GlusterFS Replica3
    setup-velero.sh           # Velero + MinIO backups
    setup-dns.sh              # Cloudflare DNS (3 A records per subdomain)
    deploy-app.sh             # kubectl apply -k k8s/prod-ha/
    migrate.sh                # Single-VM → K3s data migration
    restore-data.sh           # Restore from Velero backup
  envs/
    prod/                     # Production env (cluster.env, tf state)

k8s/
  base/
    namespace.yaml
    backend-deployment.yaml   # Embedded OSIS, probes, volume
    backend-pvc.yaml          # 10Gi RWO (overridden in prod-ha)
    backend-service.yaml
    frontend-deployment.yaml  # nginx + WASM SPA
    frontend-service.yaml
    admin-deployment.yaml     # Admin proxy + WASM
    admin-service.yaml
    ingress.yaml              # 3 hosts (app, admin, API)
    kustomization.yaml
  overlays/
    dev/kustomization.yaml    # Dev namespace, :development tags
    prod/kustomization.yaml   # Prod namespace, :latest tags
  prod-ha/
    kustomization.yaml        # 2 replicas, Kadalu RWX, anti-affinity, image pinning

Acceptance criteria

make all ENV=prod provisions 5 VMs, bootstraps K3s HA, deploys app
2 backend replicas active-active on GlusterFS RWX
Health probes use /api/health and /api/ready
1 node failure = zero downtime
All 272 tests pass against the K3s deployment

Signed-off-by: mik-tf

## Goal Update the existing `k3s/` and `k8s/` directories in `projectmycelium_marketplace_deploy` to match the current v2.0.0 architecture (embedded OSIS, no external containers) and align with the freezone k3s-v2 production pattern. Reference: `znzfreezone_deploy/k3s-v2/` + `znzfreezone_deploy/k8s/` (production-proven, 5-node HA). ## Current state The marketplace already has `k3s/` and `k8s/` directories but they're **outdated** — written before the embedded OSIS migration: ### k8s/base/ — what's wrong | File | Problem | |------|---------| | `hero-osis.yaml` | **Remove** — OSIS is now embedded in the backend binary | | `marketplace.yaml` | Uses `APP_BACKEND=hero`, `HERO_OSIS_URL` — should be `APP_BACKEND=local` with no external deps | | `marketplace.yaml` | Missing: liveness/readiness probes, resource limits, volume mount for `/app/data` | | Missing | `frontend-deployment.yaml` — SPA frontend (nginx + WASM) | | Missing | `frontend-service.yaml` | | Missing | `admin-deployment.yaml` — admin dashboard | | Missing | `admin-service.yaml` | | Missing | `backend-pvc.yaml` — PersistentVolumeClaim for OSIS data | | Missing | `backend-service.yaml` — separate from deployment | | `kustomization.yaml` | References hero-osis, missing frontend/admin/PVC | ### k8s/overlays/ — what's wrong | File | Problem | |------|---------| | `prod/kustomization.yaml` | 3 replicas but no RWX volume, no anti-affinity, no image pinning | | `dev/kustomization.yaml` | Basic but functional | | Missing | `prod-ha/kustomization.yaml` — freezone-style HA overlay (2 replicas, Kadalu RWX, anti-affinity) | ### k3s/ — what's wrong | File | Problem | |------|---------| | `tf/main.tf` | Only 1 server + 2 agents — freezone uses 3 servers (etcd quorum) + 2 agents | | Missing | `scripts/setup-cluster.sh` — K3s HA bootstrap (join 3 servers) | | Missing | `scripts/setup-velero.sh` — backup setup | | Missing | `scripts/restore-data.sh` — disaster recovery | | Missing | `scripts/migrate.sh` — data migration from single-VM | ## Tasks ### k8s/base/ — rewrite for v2.0.0 architecture - [ ] **4.1** Remove `hero-osis.yaml` (no longer needed) - [ ] **4.2** Rewrite `marketplace.yaml` → `backend-deployment.yaml`: - Image: `projectmycelium_marketplace:TAG` - Env: remove HERO_OSIS_URL/HERO_LEDGER_RPC_URL, add MARKETPLACE_DB_PATH=/app/data/marketplace - Probes: `GET /api/health` (liveness 10s), `GET /api/ready` (readiness 5s) - Volume mount: `/app/data` from PVC - Resource requests/limits - [ ] **4.3** Add `backend-pvc.yaml` — 10Gi RWO (base), overridden to RWX + Kadalu in prod-ha - [ ] **4.4** Add `backend-service.yaml` — port 8000 - [ ] **4.5** Add `frontend-deployment.yaml` + `frontend-service.yaml` — nginx + WASM SPA, port 80 - [ ] **4.6** Add `admin-deployment.yaml` + `admin-service.yaml` — admin proxy, port 9000, MARKETPLACE_RPC_URL env - [ ] **4.7** Update `ingress.yaml` — 3 hosts (app, admin, API), TFGrid gateway handles TLS - [ ] **4.8** Update `kustomization.yaml` — new file list, configMapGenerator for branding ### k8s/prod-ha/ — new overlay (freezone pattern) - [ ] **4.9** Create `prod-ha/kustomization.yaml`: - Backend: 2 replicas, RollingUpdate, RWX PVC (Kadalu), anti-affinity - Frontend: 2 replicas, anti-affinity - Admin: 1 replica - Image pinning via kustomize `images:` block (SSOT for versions) - Registry auth `imagePullSecrets` - Strip TLS from ingress (TFGrid gateway handles it) ### k3s/tf/ — 5-node HA - [ ] **4.10** Update `main.tf` — 3 servers (etcd quorum) + 2 agents + 3 gateways - Follow freezone `k3s-v2/tf/main.tf` pattern - Server: 4 CPU, 8GB RAM, 50GB disk - Agent: 4 CPU, 8GB RAM, 50GB disk - 3 gateways per subdomain (app, admin, API) - [ ] **4.11** Update `outputs.tf` — server_ips, agent_ips, server_mycelium, agent_mycelium, URLs - [ ] **4.12** Update `variables.tf` — server_node_ids (3), agent_node_ids (2), gateway_nodes (3+) ### k3s/scripts/ — add missing scripts - [ ] **4.13** Add `setup-cluster.sh` — bootstrap 3-server HA K3s (join servers 2+3 to server 1) - [ ] **4.14** Update `setup-server.sh` — install K3s with `--cluster-init` on first server - [ ] **4.15** Update `setup-agent.sh` — join agents to HA cluster - [ ] **4.16** Update `setup-storage.sh` — Kadalu GlusterFS Replica3 on 3 servers - [ ] **4.17** Add `setup-velero.sh` — Velero + MinIO for in-cluster backups - [ ] **4.18** Add `restore-data.sh` — restore from Velero backup - [ ] **4.19** Add `migrate.sh` — migrate data from single-VM Docker volume to K3s PVC - [ ] **4.20** Update `deploy-app.sh` — `kubectl apply -k k8s/prod-ha/` ### k3s/Makefile — end-to-end automation - [ ] **4.21** Update targets: `make all ENV=prod` does infra → cluster → storage → deploy → test - [ ] **4.22** Add `make migrate ENV=prod` for single-VM → K3s migration ### Failure tolerance targets | Failure | Impact | |---------|---------| | 1 server dies | etcd quorum intact (2/3), pods reschedule, 0s downtime | | 1 agent dies | Pods reschedule, 0s downtime | | 1 gateway dies | Cloudflare routes to remaining 2, 0s downtime | ## Architecture ``` Cloudflare (DNS + CDN + DDoS) → 3 TFGrid gateways (Let's Encrypt TLS, round-robin) → 5-node K3s cluster (WireGuard mesh 10.1.0.0/16) → 2 backend replicas (RollingUpdate, Kadalu GlusterFS RWX PVC) → 2 frontend replicas (stateless nginx + WASM) → 1 admin replica → Velero + MinIO (in-cluster backups) ``` ## Directory structure (target) ``` k3s/ Makefile # make all ENV=prod tf/ main.tf # 3 servers + 2 agents + 3 gateways variables.tf outputs.tf scripts/ setup-server.sh # K3s server with --cluster-init setup-cluster.sh # Join servers 2+3 for HA setup-agent.sh # Join agents setup-storage.sh # Kadalu GlusterFS Replica3 setup-velero.sh # Velero + MinIO backups setup-dns.sh # Cloudflare DNS (3 A records per subdomain) deploy-app.sh # kubectl apply -k k8s/prod-ha/ migrate.sh # Single-VM → K3s data migration restore-data.sh # Restore from Velero backup envs/ prod/ # Production env (cluster.env, tf state) k8s/ base/ namespace.yaml backend-deployment.yaml # Embedded OSIS, probes, volume backend-pvc.yaml # 10Gi RWO (overridden in prod-ha) backend-service.yaml frontend-deployment.yaml # nginx + WASM SPA frontend-service.yaml admin-deployment.yaml # Admin proxy + WASM admin-service.yaml ingress.yaml # 3 hosts (app, admin, API) kustomization.yaml overlays/ dev/kustomization.yaml # Dev namespace, :development tags prod/kustomization.yaml # Prod namespace, :latest tags prod-ha/ kustomization.yaml # 2 replicas, Kadalu RWX, anti-affinity, image pinning ``` ## Acceptance criteria - `make all ENV=prod` provisions 5 VMs, bootstraps K3s HA, deploys app - 2 backend replicas active-active on GlusterFS RWX - Health probes use /api/health and /api/ready - 1 node failure = zero downtime - All 272 tests pass against the K3s deployment Signed-off-by: mik-tf

mik-tf referenced this issue

2026-03-27 15:24:39 +00:00

Phase 4: K3s HA cluster (5-node production) #37

mik-tf commented

2026-03-27 15:34:07 +00:00

Author

Member

All 22 tasks complete

Commit: ea623ad on projectmycelium_marketplace_deploy development branch.

k8s/base/ (tasks 4.1-4.8)

Removed hero-osis.yaml
backend-deployment.yaml with /api/health + /api/ready probes
backend-pvc.yaml (10Gi RWO)
backend-service.yaml, frontend-deployment+service, admin-deployment+service
ingress.yaml (3 path rules: /api, /static → backend, / → frontend + admin host)
kustomization.yaml updated

k8s/prod-ha/ (task 4.9)

2 backend replicas, RollingUpdate, Kadalu RWX 30Gi, anti-affinity
2 frontend replicas with anti-affinity
Image pinning (v2.0.0), registry auth, TLS stripped

k3s/tf/ (tasks 4.10-4.12)

3 servers (server_node_ids) + 2 agents (agent_node_ids) + 3 gateways
outputs: server_ips, server_mycelium, agent_ips, agent_mycelium, app_url, admin_url

k3s/scripts/ (tasks 4.13-4.20)

setup-cluster.sh — 3-server HA bootstrap + 2 agents + kubeconfig
setup-server.sh — supports cluster-init + join modes
setup-velero.sh — Velero + MinIO daily backups
restore-data.sh — restore from Velero
migrate.sh — single-VM Docker → K3s PVC migration
deploy-app.sh — kustomize overlay selection

k3s/Makefile (tasks 4.21-4.22)

make all ENV=prod (infra → cluster → storage → deploy → test)
make migrate ENV=prod SOURCE_SSH=root@vm
make velero, make restore

Ready to provision when TFGrid node IDs are available.

— mik-tf

## All 22 tasks complete Commit: ea623ad on projectmycelium_marketplace_deploy development branch. ### k8s/base/ (tasks 4.1-4.8) - Removed hero-osis.yaml - backend-deployment.yaml with /api/health + /api/ready probes - backend-pvc.yaml (10Gi RWO) - backend-service.yaml, frontend-deployment+service, admin-deployment+service - ingress.yaml (3 path rules: /api, /static → backend, / → frontend + admin host) - kustomization.yaml updated ### k8s/prod-ha/ (task 4.9) - 2 backend replicas, RollingUpdate, Kadalu RWX 30Gi, anti-affinity - 2 frontend replicas with anti-affinity - Image pinning (v2.0.0), registry auth, TLS stripped ### k3s/tf/ (tasks 4.10-4.12) - 3 servers (server_node_ids) + 2 agents (agent_node_ids) + 3 gateways - outputs: server_ips, server_mycelium, agent_ips, agent_mycelium, app_url, admin_url ### k3s/scripts/ (tasks 4.13-4.20) - setup-cluster.sh — 3-server HA bootstrap + 2 agents + kubeconfig - setup-server.sh — supports cluster-init + join modes - setup-velero.sh — Velero + MinIO daily backups - restore-data.sh — restore from Velero - migrate.sh — single-VM Docker → K3s PVC migration - deploy-app.sh — kustomize overlay selection ### k3s/Makefile (tasks 4.21-4.22) - make all ENV=prod (infra → cluster → storage → deploy → test) - make migrate ENV=prod SOURCE_SSH=root@vm - make velero, make restore Ready to provision when TFGrid node IDs are available. — mik-tf

mik-tf closed this issue

2026-03-27 15:34:15 +00:00

mik-tf referenced this issue

2026-03-28 12:21:50 +00:00

Marketplace v2.0 — Master Tracker #40