[ops] Long-term: GitOps + immutable infra for Hero OS multi-VM / multi-tenant deploys

mik-tf commented

2026-04-24 03:33:34 +00:00

Owner

When this matters

Not today. This is the destination issue — open now so we don't forget the shape of the end state while we're deep in tactical fixes.

Trigger to start on this: the moment Hero is deployed to a second long-lived VM (production customer, multi-tenant SaaS, internal staging separate from demo). Until then, the Tier 1 DR + Tier 2 make demo path (#prev-two-issues) covers the need.

The gap it closes

Tier 2 (make demo) produces a deterministic fresh install from code. But:

No continuous reconciliation — if someone SSH'es into the VM and changes a hero_proc action env, that drift is invisible to git. Over months, the VM drifts from the declared state.
No trivial N-way scaling — spinning up a second VM means running make demo twice with separate env files, managing their differences manually.
No trivial rollback — "deploy commit X again" works if the code path is clean, but config-level changes (hero_proc env) aren't in git.
Secrets management is ad-hoc — today FORGEJO_TOKEN sits in ~/hero/cfg/env/env.sh. Works for one operator. Doesn't scale to a team.
Data backup is the only continuous safety net — runtime config is not versioned.

What "Tier 3" looks like for Hero

Everything declarative, reconciled from a git repo:

1. All runtime config moves into `hero_proc action` files committed to git

Right now hero_proc action set <file.json> is an imperative command. In Tier 3: a hero-config/ repo holds actions/<service>.yaml files; a reconciler (Argo-CD-style or a small custom daemon) reads them, diffs against the running hero_proc state, applies the delta. Git is the source of truth.

Consequence: to change hero_agent's HERO_AGENT_ROUTING_MODE, you open a PR editing actions/hero_agent_server.yaml, merge → reconciler applies. Fully audited, reviewable, rollback-able.

2. Service binaries come from a registry, not from source builds

Tier 2 builds on the VM. Tier 3: CI builds per-commit artifacts pushed to a Forgejo package registry. The action spec says script: ~/hero/bin/hero_agent_server@v1.2.3; the reconciler ensures the right binary is present.

Saves ~30 min per deploy (no in-place cargo build). Enables rollback (@v1.2.2) without rebuilding.

3. Seed data is migration-driven, not replay-driven

Tier 2's hero_zero_seed produces a clean initial state. Tier 3: treat seed data like database migrations — each commit adds forward-only migration files (migrations/2026-04-24-001_add_geomind_nitrograph.rhai). Running systems apply pending migrations; fresh systems apply all of them. No more "the old seed TOMLs don't match current schema."

4. Secrets in a real secret store

Vault / Teleport / SOPS-encrypted-in-git / Forgejo encrypted env. Any of them. Point is: not plaintext shell files in operator homedirs.

5. Observability from day 1

Prom + Grafana or an equivalent. Every hero_proc service exposes /metrics (openmetrics). Alerts on: OSIS write failures, embedder query latency, agent.chat P99, aibroker error rate. Makes "is the AI broken?" answerable without SSH.

6. Runbooks live alongside code

Every known failure mode from home#122-160 becomes a documented runbook entry in docs_hero/ops/runbooks/. On-call finds a failing service, looks up the runbook, follows numbered steps, fixes it. Not tribal knowledge.

What this does NOT need

Kubernetes. TF Grid VMs + hero_proc are enough for the scale Hero targets. k8s adds operational surface area we don't need.
A service mesh. Unix sockets + hero_router are the mesh.
A separate CI platform. Forgejo Actions handles artifact builds, migration tests, config validation.

Pragmatic order of landing (when the time comes)

Config-as-code first (actions/ committed, reconciler small Rust daemon).
Binary registry second (Forgejo packages, action spec references by version).
Migration-driven seed third (replaces hero_zero_seed's TOML corpus).
Secrets fourth (pick Vault or SOPS depending on team preference).
Observability + runbooks fifth (incremental — start with embedder + agent + aibroker, expand).

Each step is an independent landing, each delivers value on its own.

Estimated effort

Tier 3 full landing: 4-6 weeks engineering + 1 week ops runbook work.
Prerequisite: Tier 2 shipped and stable for ≥1 month so we know what the real drift patterns look like.

#160 — demo state checkpoint
Prev ops issues: disaster recovery (Tier 1), hero_zero_seed formalization, make demo target (Tier 2)

Signed-off-by: mik-tf

## When this matters Not today. This is the destination issue — open now so we don't forget the shape of the end state while we're deep in tactical fixes. Trigger to start on this: the moment Hero is deployed to a **second** long-lived VM (production customer, multi-tenant SaaS, internal staging separate from demo). Until then, the Tier 1 DR + Tier 2 `make demo` path ([#prev-two-issues](#)) covers the need. ## The gap it closes Tier 2 (`make demo`) produces a deterministic fresh install from code. But: - **No continuous reconciliation** — if someone SSH'es into the VM and changes a `hero_proc` action env, that drift is invisible to git. Over months, the VM drifts from the declared state. - **No trivial N-way scaling** — spinning up a second VM means running `make demo` twice with separate env files, managing their differences manually. - **No trivial rollback** — "deploy commit X again" works if the code path is clean, but config-level changes (hero_proc env) aren't in git. - **Secrets management is ad-hoc** — today FORGEJO_TOKEN sits in `~/hero/cfg/env/env.sh`. Works for one operator. Doesn't scale to a team. - **Data backup is the only continuous safety net** — runtime config is not versioned. ## What "Tier 3" looks like for Hero Everything declarative, reconciled from a git repo: ### 1. All runtime config moves into `hero_proc action` files committed to git Right now `hero_proc action set <file.json>` is an imperative command. In Tier 3: a `hero-config/` repo holds `actions/<service>.yaml` files; a reconciler (Argo-CD-style or a small custom daemon) reads them, diffs against the running hero_proc state, applies the delta. Git is the source of truth. Consequence: to change hero_agent's `HERO_AGENT_ROUTING_MODE`, you open a PR editing `actions/hero_agent_server.yaml`, merge → reconciler applies. Fully audited, reviewable, rollback-able. ### 2. Service binaries come from a registry, not from source builds Tier 2 builds on the VM. Tier 3: CI builds per-commit artifacts pushed to a Forgejo package registry. The action spec says `script: ~/hero/bin/hero_agent_server@v1.2.3`; the reconciler ensures the right binary is present. Saves ~30 min per deploy (no in-place cargo build). Enables rollback (`@v1.2.2`) without rebuilding. ### 3. Seed data is migration-driven, not replay-driven Tier 2's `hero_zero_seed` produces a clean initial state. Tier 3: treat seed data like database migrations — each commit adds forward-only migration files (`migrations/2026-04-24-001_add_geomind_nitrograph.rhai`). Running systems apply pending migrations; fresh systems apply all of them. No more "the old seed TOMLs don't match current schema." ### 4. Secrets in a real secret store Vault / Teleport / SOPS-encrypted-in-git / Forgejo encrypted env. Any of them. Point is: not plaintext shell files in operator homedirs. ### 5. Observability from day 1 Prom + Grafana or an equivalent. Every hero_proc service exposes `/metrics` (openmetrics). Alerts on: OSIS write failures, embedder query latency, agent.chat P99, aibroker error rate. Makes "is the AI broken?" answerable without SSH. ### 6. Runbooks live alongside code Every known failure mode from `home#122-160` becomes a documented runbook entry in `docs_hero/ops/runbooks/`. On-call finds a failing service, looks up the runbook, follows numbered steps, fixes it. Not tribal knowledge. ## What this does NOT need - Kubernetes. TF Grid VMs + hero_proc are enough for the scale Hero targets. k8s adds operational surface area we don't need. - A service mesh. Unix sockets + hero_router are the mesh. - A separate CI platform. Forgejo Actions handles artifact builds, migration tests, config validation. ## Pragmatic order of landing (when the time comes) 1. **Config-as-code first** (actions/ committed, reconciler small Rust daemon). 2. **Binary registry second** (Forgejo packages, action spec references by version). 3. **Migration-driven seed third** (replaces hero_zero_seed's TOML corpus). 4. **Secrets fourth** (pick Vault or SOPS depending on team preference). 5. **Observability + runbooks fifth** (incremental — start with embedder + agent + aibroker, expand). Each step is an independent landing, each delivers value on its own. ## Estimated effort - Tier 3 full landing: 4-6 weeks engineering + 1 week ops runbook work. - Prerequisite: Tier 2 shipped and stable for ≥1 month so we know what the real drift patterns look like. ## Related - [#160](https://forge.ourworld.tf/lhumina_code/home/issues/160) — demo state checkpoint - Prev ops issues: disaster recovery (Tier 1), hero_zero_seed formalization, `make demo` target (Tier 2) Signed-off-by: mik-tf

mik-tf referenced this issue from a commit

2026-04-24 16:48:08 +00:00

feat(hero_demo): nu-shell deployment runbook + README + TF rootfs_size/gateway_node

mik-tf referenced this issue

2026-04-25 15:43:33 +00:00

[Phase 2] Codify demo-VM hotfixes into upstream code — clean runbook, no manual steps #185

mik-tf referenced this issue from a commit

2026-04-25 16:15:35 +00:00

feat(skills): codify Phase 2 deploy-time hotfixes into installer + service modules

mik-tf referenced this issue from lhumina_code/hero_skills

2026-04-25 16:16:06 +00:00

feat(skills): codify Phase 2 deploy-time hotfixes into installer + service modules #126

mik-tf referenced this issue from a commit

2026-04-25 16:21:34 +00:00

feat(skills): codify Phase 2 deploy-time hotfixes into installer + service modules (#126)

mik-tf commented

2026-04-25 16:22:00 +00:00

Author

Owner

Resolved by lhumina_code/hero_skills@7c823d1 (PR lhumina_code/hero_skills#126).

Part of Phase 2 tracker #185.

Resolved by https://forge.ourworld.tf/lhumina_code/hero_skills/commit/7c823d1 (PR https://forge.ourworld.tf/lhumina_code/hero_skills/pulls/126). Part of Phase 2 tracker https://forge.ourworld.tf/lhumina_code/home/issues/185.

mik-tf closed this issue

2026-04-25 16:22:01 +00:00

mik-tf referenced this issue

2026-04-25 16:22:56 +00:00

[Phase 2] Codify demo-VM hotfixes into upstream code — clean runbook, no manual steps #185

mik-tf reopened this issue

2026-04-25 16:31:55 +00:00

mik-tf referenced this issue

2026-04-25 16:31:55 +00:00

[ops] Formalize the nu-shell OSIS seed step — canonical hero_zero_seed binary + schema-aligned TOMLs #162

mik-tf commented

2026-04-25 16:31:56 +00:00

Author

Owner

Reopening — closed in error earlier today. The hero_demo runbook §13 had this issue listed as the tracker for an unrelated deploy step (ONNX install for #162 / HERO_ROOTDIR override for #164), and I trusted the reference without checking the actual issue body. Apologies for the noise. The actual scope of this issue is unchanged from when it was filed.

The correct trackers for the work that just landed: ONNX install + HERO_ROOTDIR are covered directly by lhumina_code/hero_skills@7c823d1 and tracked under #185 (no separate sub-issues filed).

Reopening — closed in error earlier today. The hero_demo runbook §13 had this issue listed as the tracker for an unrelated deploy step (ONNX install for #162 / HERO_ROOTDIR override for #164), and I trusted the reference without checking the actual issue body. Apologies for the noise. The actual scope of this issue is unchanged from when it was filed. The correct trackers for the work that just landed: ONNX install + HERO_ROOTDIR are covered directly by https://forge.ourworld.tf/lhumina_code/hero_skills/commit/7c823d1 and tracked under https://forge.ourworld.tf/lhumina_code/home/issues/185 (no separate sub-issues filed).

mik-tf referenced this issue from lhumina_code/hero_demo

2026-04-28 12:21:14 +00:00

[ops] Long-term: GitOps + immutable infra for Hero OS multi-VM / multi-tenant deploys #32

mik-tf commented

2026-04-28 12:21:14 +00:00

Author

Owner

Moved to hero_demo#32 — see lhumina_code/hero_demo#32

Moved to hero_demo#32 — see https://forge.ourworld.tf/lhumina_code/hero_demo/issues/32

mik-tf closed this issue

2026-04-28 12:21:15 +00:00

mik-tf referenced this issue from lhumina_code/hero_demo