[ops] Disaster recovery for heronu demo VM — runtime state dump + /data snapshot to object storage #29

Open
opened 2026-04-28 12:21:01 +00:00 by mik-tf · 0 comments
Owner

Why

The heronu demo VM (TF Grid freefarm node 1) holds ~12 hours of hand-patching and demo data. If the VM dies, the Mycelium route flaps for days, or the node is reclaimed, recovery today means redoing the entire patch cycle from scratch — even with the development_mik_nu_demo branches now pushed (see #160).

Runtime state that IS NOT in git:

  • hero_proc action get <name> --format yaml for every action — env vars, script paths, restart policies, health checks. Some carry the AIBROKER_API_ENDPOINT, HERO_AGENT_ROUTING_MODE, FORGEJO_TOKEN config that makes the demo work.
  • /home/driver/hero/var/** — OSIS data (business, calendar, projects, media, identity across 5 contexts), hero_books namespaces (hero=163 docs, geomind=1733+ docs indexing), embedder HNSW indexes, hero_foundry webdav content.
  • /home/driver/code/docs_* — 4 cloned doc libraries (~800 MB).
  • Seeded OSIS records with schema-migrated data (30 projects + all business + calendar + media across contexts).
  • ~/hero/var/agent/mcp.json trimmed to hero_books only.

None of this is reconstructible from the code alone — it's either the result of out-of-band operator commands (hero_proc action set) or of a destructive seed-migration flow.

What Tier 1 disaster recovery looks like

Two artifacts, generated periodically, stored outside the VM:

1. Runtime state dump (weekly or post-deploy, ~1 MB)

A single JSON blob capturing:

{
  "heronu_snapshot": "2026-04-24T03:30:00Z",
  "hero_proc": {
    "<action>": { "env": {...}, "script": "...", "health_checks": [...] },
    ...
  },
  "libraries_txt": "hero https://...\ngeomind https://...\n...",
  "mcp_json": { ... },
  "context_list": ["default", "geomind", "incubaid", "root", "threefold"],
  "patches_applied": {
    "<repo>": { "branch": "development_mik_nu_demo", "head_commit": "<sha>" },
    ...
  }
}

Commit this to lhumina_code/demo_state (new repo) or push to an S3-compatible bucket. A recovery script reads it, runs hero_proc action set for each entry, checks out the recorded commits on the new VM.

2. Data tarball (daily, ~2-5 GB compressed)

tar -czf heronu-data-<date>.tar.gz -C /data/home/driver \
  hero/var/osisdb hero/var/embedder/data hero/var/books \
  hero/var/hero_foundry/webdav hero/var/agent/workspace

Push to TF Grid QSFS (preferred, native) or an S3-compatible bucket reachable via Mycelium. Size will drop significantly once embedder HNSW indexes are excluded and re-generatable from source.

3. Recovery runbook (on docs_hero/ops/disaster_recovery.md)

Step-by-step: new VM via existing Terraform, apt install system deps from a fixed list, git clone all repos at recorded commits, download + extract data tarball, replay hero_proc action dump, verify with smoke tests.

Why this is cheap and high-leverage

  • Effort: ~4-6 hours to write the dump script + store+retrieve tooling + runbook. Under a day.
  • Ongoing cost: zero after automation (cron).
  • Bus factor: drops from 1 (the current operator) to any engineer who can read the runbook.
  • Doesn't preclude Tier 2/3 later: state dump and data tarball remain useful even once declarative IaC lands.

Concrete deliverables

  • scripts/ops/dump_demo_state.sh — generates the state JSON described above
  • scripts/ops/backup_demo_data.sh — daily tar+push of /data/home/driver/hero/var/
  • scripts/ops/restore_demo.sh — idempotent replay onto a fresh VM
  • docs_hero/ops/disaster_recovery.md — operator runbook
  • Verify end-to-end by running restore on a sacrificial TF Grid VM and confirming AI Assistant grounds on docs_hero queries
  • #160 — consolidated demo state and remaining work
  • #148 — nu-demo architecture index

Signed-off-by: mik-tf


Originally filed as home#161 on 2026-04-24 by mik-tf — moved to hero_demo as part of consolidating issue tracking.

## Why The heronu demo VM (TF Grid freefarm node 1) holds ~12 hours of hand-patching and demo data. If the VM dies, the Mycelium route flaps for days, or the node is reclaimed, recovery today means redoing the entire patch cycle from scratch — even with the development_mik_nu_demo branches now pushed (see [#160](https://forge.ourworld.tf/lhumina_code/home/issues/160)). Runtime state that IS NOT in git: - `hero_proc action get <name> --format yaml` for every action — env vars, script paths, restart policies, health checks. Some carry the AIBROKER_API_ENDPOINT, HERO_AGENT_ROUTING_MODE, FORGEJO_TOKEN config that makes the demo work. - `/home/driver/hero/var/**` — OSIS data (business, calendar, projects, media, identity across 5 contexts), hero_books namespaces (hero=163 docs, geomind=1733+ docs indexing), embedder HNSW indexes, hero_foundry webdav content. - `/home/driver/code/docs_*` — 4 cloned doc libraries (~800 MB). - Seeded OSIS records with schema-migrated data (30 projects + all business + calendar + media across contexts). - `~/hero/var/agent/mcp.json` trimmed to hero_books only. None of this is reconstructible from the code alone — it's either the result of out-of-band operator commands (`hero_proc action set`) or of a destructive seed-migration flow. ## What Tier 1 disaster recovery looks like Two artifacts, generated periodically, stored outside the VM: ### 1. Runtime state dump (weekly or post-deploy, ~1 MB) A single JSON blob capturing: ``` { "heronu_snapshot": "2026-04-24T03:30:00Z", "hero_proc": { "<action>": { "env": {...}, "script": "...", "health_checks": [...] }, ... }, "libraries_txt": "hero https://...\ngeomind https://...\n...", "mcp_json": { ... }, "context_list": ["default", "geomind", "incubaid", "root", "threefold"], "patches_applied": { "<repo>": { "branch": "development_mik_nu_demo", "head_commit": "<sha>" }, ... } } ``` Commit this to `lhumina_code/demo_state` (new repo) or push to an S3-compatible bucket. A recovery script reads it, runs `hero_proc action set` for each entry, checks out the recorded commits on the new VM. ### 2. Data tarball (daily, ~2-5 GB compressed) ``` tar -czf heronu-data-<date>.tar.gz -C /data/home/driver \ hero/var/osisdb hero/var/embedder/data hero/var/books \ hero/var/hero_foundry/webdav hero/var/agent/workspace ``` Push to TF Grid QSFS (preferred, native) or an S3-compatible bucket reachable via Mycelium. Size will drop significantly once embedder HNSW indexes are excluded and re-generatable from source. ### 3. Recovery runbook (on `docs_hero/ops/disaster_recovery.md`) Step-by-step: new VM via existing Terraform, `apt install` system deps from a fixed list, git clone all repos at recorded commits, download + extract data tarball, replay hero_proc action dump, verify with smoke tests. ## Why this is cheap and high-leverage - **Effort:** ~4-6 hours to write the dump script + store+retrieve tooling + runbook. Under a day. - **Ongoing cost:** zero after automation (cron). - **Bus factor:** drops from 1 (the current operator) to any engineer who can read the runbook. - **Doesn't preclude Tier 2/3 later:** state dump and data tarball remain useful even once declarative IaC lands. ## Concrete deliverables - [ ] `scripts/ops/dump_demo_state.sh` — generates the state JSON described above - [ ] `scripts/ops/backup_demo_data.sh` — daily tar+push of `/data/home/driver/hero/var/` - [ ] `scripts/ops/restore_demo.sh` — idempotent replay onto a fresh VM - [ ] `docs_hero/ops/disaster_recovery.md` — operator runbook - [ ] Verify end-to-end by running restore on a sacrificial TF Grid VM and confirming AI Assistant grounds on docs_hero queries ## Related - [#160](https://forge.ourworld.tf/lhumina_code/home/issues/160) — consolidated demo state and remaining work - [#148](https://forge.ourworld.tf/lhumina_code/home/issues/148) — nu-demo architecture index Signed-off-by: mik-tf --- *Originally filed as [home#161](https://forge.ourworld.tf/lhumina_code/home/issues/161) on 2026-04-24 by mik-tf — moved to hero_demo as part of consolidating issue tracking.*
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_demo#29
No description provided.