[ops] Long-term: GitOps + immutable infra for Hero OS multi-VM / multi-tenant deploys #164

Closed
opened 2026-04-24 03:33:34 +00:00 by mik-tf · 3 comments
Owner

When this matters

Not today. This is the destination issue — open now so we don't forget the shape of the end state while we're deep in tactical fixes.

Trigger to start on this: the moment Hero is deployed to a second long-lived VM (production customer, multi-tenant SaaS, internal staging separate from demo). Until then, the Tier 1 DR + Tier 2 make demo path (#prev-two-issues) covers the need.

The gap it closes

Tier 2 (make demo) produces a deterministic fresh install from code. But:

  • No continuous reconciliation — if someone SSH'es into the VM and changes a hero_proc action env, that drift is invisible to git. Over months, the VM drifts from the declared state.
  • No trivial N-way scaling — spinning up a second VM means running make demo twice with separate env files, managing their differences manually.
  • No trivial rollback — "deploy commit X again" works if the code path is clean, but config-level changes (hero_proc env) aren't in git.
  • Secrets management is ad-hoc — today FORGEJO_TOKEN sits in ~/hero/cfg/env/env.sh. Works for one operator. Doesn't scale to a team.
  • Data backup is the only continuous safety net — runtime config is not versioned.

What "Tier 3" looks like for Hero

Everything declarative, reconciled from a git repo:

1. All runtime config moves into hero_proc action files committed to git

Right now hero_proc action set <file.json> is an imperative command. In Tier 3: a hero-config/ repo holds actions/<service>.yaml files; a reconciler (Argo-CD-style or a small custom daemon) reads them, diffs against the running hero_proc state, applies the delta. Git is the source of truth.

Consequence: to change hero_agent's HERO_AGENT_ROUTING_MODE, you open a PR editing actions/hero_agent_server.yaml, merge → reconciler applies. Fully audited, reviewable, rollback-able.

2. Service binaries come from a registry, not from source builds

Tier 2 builds on the VM. Tier 3: CI builds per-commit artifacts pushed to a Forgejo package registry. The action spec says script: ~/hero/bin/hero_agent_server@v1.2.3; the reconciler ensures the right binary is present.

Saves ~30 min per deploy (no in-place cargo build). Enables rollback (@v1.2.2) without rebuilding.

3. Seed data is migration-driven, not replay-driven

Tier 2's hero_zero_seed produces a clean initial state. Tier 3: treat seed data like database migrations — each commit adds forward-only migration files (migrations/2026-04-24-001_add_geomind_nitrograph.rhai). Running systems apply pending migrations; fresh systems apply all of them. No more "the old seed TOMLs don't match current schema."

4. Secrets in a real secret store

Vault / Teleport / SOPS-encrypted-in-git / Forgejo encrypted env. Any of them. Point is: not plaintext shell files in operator homedirs.

5. Observability from day 1

Prom + Grafana or an equivalent. Every hero_proc service exposes /metrics (openmetrics). Alerts on: OSIS write failures, embedder query latency, agent.chat P99, aibroker error rate. Makes "is the AI broken?" answerable without SSH.

6. Runbooks live alongside code

Every known failure mode from home#122-160 becomes a documented runbook entry in docs_hero/ops/runbooks/. On-call finds a failing service, looks up the runbook, follows numbered steps, fixes it. Not tribal knowledge.

What this does NOT need

  • Kubernetes. TF Grid VMs + hero_proc are enough for the scale Hero targets. k8s adds operational surface area we don't need.
  • A service mesh. Unix sockets + hero_router are the mesh.
  • A separate CI platform. Forgejo Actions handles artifact builds, migration tests, config validation.

Pragmatic order of landing (when the time comes)

  1. Config-as-code first (actions/ committed, reconciler small Rust daemon).
  2. Binary registry second (Forgejo packages, action spec references by version).
  3. Migration-driven seed third (replaces hero_zero_seed's TOML corpus).
  4. Secrets fourth (pick Vault or SOPS depending on team preference).
  5. Observability + runbooks fifth (incremental — start with embedder + agent + aibroker, expand).

Each step is an independent landing, each delivers value on its own.

Estimated effort

  • Tier 3 full landing: 4-6 weeks engineering + 1 week ops runbook work.
  • Prerequisite: Tier 2 shipped and stable for ≥1 month so we know what the real drift patterns look like.
  • #160 — demo state checkpoint
  • Prev ops issues: disaster recovery (Tier 1), hero_zero_seed formalization, make demo target (Tier 2)

Signed-off-by: mik-tf

## When this matters Not today. This is the destination issue — open now so we don't forget the shape of the end state while we're deep in tactical fixes. Trigger to start on this: the moment Hero is deployed to a **second** long-lived VM (production customer, multi-tenant SaaS, internal staging separate from demo). Until then, the Tier 1 DR + Tier 2 `make demo` path ([#prev-two-issues](#)) covers the need. ## The gap it closes Tier 2 (`make demo`) produces a deterministic fresh install from code. But: - **No continuous reconciliation** — if someone SSH'es into the VM and changes a `hero_proc` action env, that drift is invisible to git. Over months, the VM drifts from the declared state. - **No trivial N-way scaling** — spinning up a second VM means running `make demo` twice with separate env files, managing their differences manually. - **No trivial rollback** — "deploy commit X again" works if the code path is clean, but config-level changes (hero_proc env) aren't in git. - **Secrets management is ad-hoc** — today FORGEJO_TOKEN sits in `~/hero/cfg/env/env.sh`. Works for one operator. Doesn't scale to a team. - **Data backup is the only continuous safety net** — runtime config is not versioned. ## What "Tier 3" looks like for Hero Everything declarative, reconciled from a git repo: ### 1. All runtime config moves into `hero_proc action` files committed to git Right now `hero_proc action set <file.json>` is an imperative command. In Tier 3: a `hero-config/` repo holds `actions/<service>.yaml` files; a reconciler (Argo-CD-style or a small custom daemon) reads them, diffs against the running hero_proc state, applies the delta. Git is the source of truth. Consequence: to change hero_agent's `HERO_AGENT_ROUTING_MODE`, you open a PR editing `actions/hero_agent_server.yaml`, merge → reconciler applies. Fully audited, reviewable, rollback-able. ### 2. Service binaries come from a registry, not from source builds Tier 2 builds on the VM. Tier 3: CI builds per-commit artifacts pushed to a Forgejo package registry. The action spec says `script: ~/hero/bin/hero_agent_server@v1.2.3`; the reconciler ensures the right binary is present. Saves ~30 min per deploy (no in-place cargo build). Enables rollback (`@v1.2.2`) without rebuilding. ### 3. Seed data is migration-driven, not replay-driven Tier 2's `hero_zero_seed` produces a clean initial state. Tier 3: treat seed data like database migrations — each commit adds forward-only migration files (`migrations/2026-04-24-001_add_geomind_nitrograph.rhai`). Running systems apply pending migrations; fresh systems apply all of them. No more "the old seed TOMLs don't match current schema." ### 4. Secrets in a real secret store Vault / Teleport / SOPS-encrypted-in-git / Forgejo encrypted env. Any of them. Point is: not plaintext shell files in operator homedirs. ### 5. Observability from day 1 Prom + Grafana or an equivalent. Every hero_proc service exposes `/metrics` (openmetrics). Alerts on: OSIS write failures, embedder query latency, agent.chat P99, aibroker error rate. Makes "is the AI broken?" answerable without SSH. ### 6. Runbooks live alongside code Every known failure mode from `home#122-160` becomes a documented runbook entry in `docs_hero/ops/runbooks/`. On-call finds a failing service, looks up the runbook, follows numbered steps, fixes it. Not tribal knowledge. ## What this does NOT need - Kubernetes. TF Grid VMs + hero_proc are enough for the scale Hero targets. k8s adds operational surface area we don't need. - A service mesh. Unix sockets + hero_router are the mesh. - A separate CI platform. Forgejo Actions handles artifact builds, migration tests, config validation. ## Pragmatic order of landing (when the time comes) 1. **Config-as-code first** (actions/ committed, reconciler small Rust daemon). 2. **Binary registry second** (Forgejo packages, action spec references by version). 3. **Migration-driven seed third** (replaces hero_zero_seed's TOML corpus). 4. **Secrets fourth** (pick Vault or SOPS depending on team preference). 5. **Observability + runbooks fifth** (incremental — start with embedder + agent + aibroker, expand). Each step is an independent landing, each delivers value on its own. ## Estimated effort - Tier 3 full landing: 4-6 weeks engineering + 1 week ops runbook work. - Prerequisite: Tier 2 shipped and stable for ≥1 month so we know what the real drift patterns look like. ## Related - [#160](https://forge.ourworld.tf/lhumina_code/home/issues/160) — demo state checkpoint - Prev ops issues: disaster recovery (Tier 1), hero_zero_seed formalization, `make demo` target (Tier 2) Signed-off-by: mik-tf
Author
Owner

Resolved by lhumina_code/hero_skills@7c823d1 (PR lhumina_code/hero_skills#126).

Part of Phase 2 tracker #185.

Resolved by https://forge.ourworld.tf/lhumina_code/hero_skills/commit/7c823d1 (PR https://forge.ourworld.tf/lhumina_code/hero_skills/pulls/126). Part of Phase 2 tracker https://forge.ourworld.tf/lhumina_code/home/issues/185.
mik-tf reopened this issue 2026-04-25 16:31:55 +00:00
Author
Owner

Reopening — closed in error earlier today. The hero_demo runbook §13 had this issue listed as the tracker for an unrelated deploy step (ONNX install for #162 / HERO_ROOTDIR override for #164), and I trusted the reference without checking the actual issue body. Apologies for the noise. The actual scope of this issue is unchanged from when it was filed.

The correct trackers for the work that just landed: ONNX install + HERO_ROOTDIR are covered directly by lhumina_code/hero_skills@7c823d1 and tracked under #185 (no separate sub-issues filed).

Reopening — closed in error earlier today. The hero_demo runbook §13 had this issue listed as the tracker for an unrelated deploy step (ONNX install for #162 / HERO_ROOTDIR override for #164), and I trusted the reference without checking the actual issue body. Apologies for the noise. The actual scope of this issue is unchanged from when it was filed. The correct trackers for the work that just landed: ONNX install + HERO_ROOTDIR are covered directly by https://forge.ourworld.tf/lhumina_code/hero_skills/commit/7c823d1 and tracked under https://forge.ourworld.tf/lhumina_code/home/issues/185 (no separate sub-issues filed).
Author
Owner

Moved to hero_demo#32 — see lhumina_code/hero_demo#32

Moved to hero_demo#32 — see https://forge.ourworld.tf/lhumina_code/hero_demo/issues/32
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#164
No description provided.