[meta] Hero instance state portability — snapshot, restore, per-context migration #226

Open
opened 2026-05-06 21:34:39 +00:00 by mik-tf · 0 comments
Owner

Goal

A Hero instance is a portable unit. Bare VM + binaries (deterministic via home#212) + config (deterministic via home#225) + state archive = restored running instance. Same shape whether the instance has 1 app or 17 domains. Same shape whether it's a personal sovereignty export, a tenant migration between TFGrid nodes, or a backup/restore for safety.

This completes the actual promise from CLAUDE.md design principle 1: "Sovereignty first. All core data stays on the user's machine by default." Without state portability, sovereignty is aspirational — you can't take your Hero instance with you.

Why now

The deploy story has three layers, only two of which are getting first-class treatment:

Layer Tracker Status
Binaries #212 in flight, 22/29
Config (secrets/env) #225 filed, post-binaries
State (data + per-context) this issue this issue

Today's implementation is the ~/heronu-backups/herodemo-backup-*.tar.gz whole-VM tarball — a hack, not a contract. No per-service backup, no per-context export, no manifest, no schema check on restore.

Affected state surfaces (initial — full inventory pending)

Service Persists Per-context? Source-of-truth or derived?
hero_db encrypted Redis-backed (vector/graph/stream/ontology) unclear, audit source-of-truth (AI memory)
hero_osis per-domain stores (×17 domain sockets) yes source-of-truth
hero_foundry per-context webdav files + git repos yes source-of-truth (user files)
hero_collab collab docs DB likely source-of-truth
hero_biz CRM data yes (geomind/threefold/incubaid/default) likely osis-backed
hero_books library content yes source-of-truth + indexer-derived
hero_indexer / hero_embedder embeddings, indexes yes derived (rebuildable from source)
hero_voice mostly transient n/a n/a
hero_proc secrets + service config n/a (instance-wide) source-of-truth

First task of this issue: full inventory + classification.

Scope (contract to define)

  1. State-surface inventory — declare per service: where it persists, per-context vs cross-context, source-of-truth vs derived (rebuildable).
  2. Snapshot contract — atomic + cross-service consistent. Either quiesce writes via hero_proc service pause (doesn't exist yet — would need to be added) or rely on filesystem snapshots (LVM/btrfs/ZFS). Probably FS snapshots for v1, per-service --quiesce for v2.
  3. Archive shape — single tarball with per-service subdirs + a manifest declaring versions/schemas. Encrypted at the wrapper level (hero_db already encrypts; foundry files = user content, encrypt at wrapper).
  4. Per-context backup/restoreservice X backup --context geomind produces a portable per-context archive, restorable into any other Hero instance. The killer feature for sovereignty.
  5. Restore pre-flight — manifest version check, schema migration if needed, context-name collision detection, secret-list completeness check (cross-references home#225).
  6. Multi-tenant story — each tenant gets a clean instance via hero_launcher. Restore boots that tenant's archive into a freshly-deployed binary+secret stack. Wraps to service all install --download && hero_proc secret set ... && service backup restore <archive>.

Sequencing

After home#212 (binaries) + home#225 (config) land. State restore depends on both being deterministic, since the archive references service versions and secret names that those layers stabilize. Order in the queue:

  1. home#212 binary rollout completes (currently 22/29)
  2. home#225 META compliance (config from secrets, not env)
  3. this issue (state portability)
  4. (separately planned) auth arc
  5. (one session) hero_router exposes install surface — bootstrap collapses to 2 binaries

Acceptance criteria

  • State-surface inventory committed (table per service in docs_hero or this issue body).
  • Snapshot contract decision filed under decisions/D-NN-state-snapshot-contract.md.
  • Per-service service X backup --to <archive> and service X restore --from <archive> verbs in dispatcher.nu.
  • service backup all --to <archive> umbrella verb that produces a self-consistent multi-service archive.
  • Per-context backup/restore working end-to-end: take archive on instance A, restore on bare instance B, verify all per-context data isolated and intact.
  • Restore pre-flight enforces manifest version + schema compatibility.
  • Smoke: bare TFGrid VM + state archive → working restored instance in <15 minutes.
  • hero_demo/docs/ops/DEPLOYMENT.md documents the snapshot/restore runbook.

End-state vision (depends on this + auth + router-installer)

Once this lands plus the auth arc and a one-session add to hero_router, the full deploy contract becomes:

# Bare TFGrid VM with hero_router + hero_proc only.
# User opens https://<instance> in a browser, authenticates as owner.
# Router serves a built-in install panel — picks "hero_os" + apps.
# Router → hero_proc → svc_install_download (binaries home#212).
# Router prompts for required secrets → hero_proc secret set (config home#225).
# State archive optional: upload to restore prior instance (this issue).
# All apps + hero_os shell installed in <5 min on a fast link.

That's the actual end-user product loop. This issue is the third leg of the tripod that makes it shippable.

References

  • #212 — binary rollout (blocking parent)
  • #225 — META compliance umbrella (blocking parent)
  • ~/.claude/skills/hero_db/SKILL.md — encrypted DB backend
  • ~/.claude/skills/hero_router/SKILL.md — single TCP entry + service discovery
  • CLAUDE.md design principle 1 — sovereignty contract
  • ~/heronu-backups/herodemo-backup-*.tar.gz — current crude implementation (tarball hack)
## Goal **A Hero instance is a portable unit.** Bare VM + binaries (deterministic via home#212) + config (deterministic via home#225) + **state archive** = restored running instance. Same shape whether the instance has 1 app or 17 domains. Same shape whether it's a personal sovereignty export, a tenant migration between TFGrid nodes, or a backup/restore for safety. This completes the actual promise from CLAUDE.md design principle 1: *"Sovereignty first. All core data stays on the user's machine by default."* Without state portability, sovereignty is aspirational — you can't take your Hero instance with you. ## Why now The deploy story has three layers, only two of which are getting first-class treatment: | Layer | Tracker | Status | |---|---|---| | Binaries | https://forge.ourworld.tf/lhumina_code/home/issues/212 | in flight, 22/29 | | Config (secrets/env) | https://forge.ourworld.tf/lhumina_code/home/issues/225 | filed, post-binaries | | **State (data + per-context)** | **this issue** | this issue | Today's implementation is the `~/heronu-backups/herodemo-backup-*.tar.gz` whole-VM tarball — a hack, not a contract. No per-service backup, no per-context export, no manifest, no schema check on restore. ## Affected state surfaces (initial — full inventory pending) | Service | Persists | Per-context? | Source-of-truth or derived? | |---|---|---|---| | `hero_db` | encrypted Redis-backed (vector/graph/stream/ontology) | unclear, audit | source-of-truth (AI memory) | | `hero_osis` | per-domain stores (×17 domain sockets) | yes | source-of-truth | | `hero_foundry` | per-context webdav files + git repos | yes | source-of-truth (user files) | | `hero_collab` | collab docs DB | likely | source-of-truth | | `hero_biz` | CRM data | yes (geomind/threefold/incubaid/default) | likely osis-backed | | `hero_books` | library content | yes | source-of-truth + indexer-derived | | `hero_indexer` / `hero_embedder` | embeddings, indexes | yes | derived (rebuildable from source) | | `hero_voice` | mostly transient | n/a | n/a | | `hero_proc` | secrets + service config | n/a (instance-wide) | source-of-truth | First task of this issue: full inventory + classification. ## Scope (contract to define) 1. **State-surface inventory** — declare per service: where it persists, per-context vs cross-context, source-of-truth vs derived (rebuildable). 2. **Snapshot contract** — atomic + cross-service consistent. Either quiesce writes via `hero_proc service pause` (doesn't exist yet — would need to be added) or rely on filesystem snapshots (LVM/btrfs/ZFS). Probably FS snapshots for v1, per-service `--quiesce` for v2. 3. **Archive shape** — single tarball with per-service subdirs + a manifest declaring versions/schemas. Encrypted at the wrapper level (hero_db already encrypts; foundry files = user content, encrypt at wrapper). 4. **Per-context backup/restore** — `service X backup --context geomind` produces a portable per-context archive, restorable into any other Hero instance. The killer feature for sovereignty. 5. **Restore pre-flight** — manifest version check, schema migration if needed, context-name collision detection, secret-list completeness check (cross-references home#225). 6. **Multi-tenant story** — each tenant gets a clean instance via `hero_launcher`. Restore boots that tenant's archive into a freshly-deployed binary+secret stack. Wraps to `service all install --download && hero_proc secret set ... && service backup restore <archive>`. ## Sequencing **After home#212 (binaries) + home#225 (config) land.** State restore depends on both being deterministic, since the archive references service versions and secret names that those layers stabilize. Order in the queue: 1. home#212 binary rollout completes (currently 22/29) 2. home#225 META compliance (config from secrets, not env) 3. **this issue** (state portability) 4. (separately planned) auth arc 5. (one session) hero_router exposes install surface — bootstrap collapses to 2 binaries ## Acceptance criteria - [ ] State-surface inventory committed (table per service in `docs_hero` or this issue body). - [ ] Snapshot contract decision filed under `decisions/D-NN-state-snapshot-contract.md`. - [ ] Per-service `service X backup --to <archive>` and `service X restore --from <archive>` verbs in `dispatcher.nu`. - [ ] `service backup all --to <archive>` umbrella verb that produces a self-consistent multi-service archive. - [ ] Per-context backup/restore working end-to-end: take archive on instance A, restore on bare instance B, verify all per-context data isolated and intact. - [ ] Restore pre-flight enforces manifest version + schema compatibility. - [ ] Smoke: bare TFGrid VM + state archive → working restored instance in <15 minutes. - [ ] `hero_demo/docs/ops/DEPLOYMENT.md` documents the snapshot/restore runbook. ## End-state vision (depends on this + auth + router-installer) Once this lands plus the auth arc and a one-session add to hero_router, the full deploy contract becomes: ``` # Bare TFGrid VM with hero_router + hero_proc only. # User opens https://<instance> in a browser, authenticates as owner. # Router serves a built-in install panel — picks "hero_os" + apps. # Router → hero_proc → svc_install_download (binaries home#212). # Router prompts for required secrets → hero_proc secret set (config home#225). # State archive optional: upload to restore prior instance (this issue). # All apps + hero_os shell installed in <5 min on a fast link. ``` That's the actual end-user product loop. This issue is the third leg of the tripod that makes it shippable. ## References - https://forge.ourworld.tf/lhumina_code/home/issues/212 — binary rollout (blocking parent) - https://forge.ourworld.tf/lhumina_code/home/issues/225 — META compliance umbrella (blocking parent) - `~/.claude/skills/hero_db/SKILL.md` — encrypted DB backend - `~/.claude/skills/hero_router/SKILL.md` — single TCP entry + service discovery - CLAUDE.md design principle 1 — sovereignty contract - `~/heronu-backups/herodemo-backup-*.tar.gz` — current crude implementation (tarball hack)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#226
No description provided.