hero_os_tfgrid_deployer integration: methods we'll consume + small gaps #116

New issue

Open

opened 2026-05-20 21:41:39 +00:00 by mik-tf · 0 comments

mik-tf commented

2026-05-20 21:41:39 +00:00

Owner

hero_os_tfgrid_deployer integration: methods we'll consume + small gaps

The new admin tool hero_os_tfgrid_deployer (scope under discussion at hero_os_tfgrid_deployer#1) will consume ComputeService OpenRPC (currently in crates/my_compute_zos_server/src/cloud/openrpc.json) as its only VM-lifecycle backend.

Reviewed the spec — most of what we need is already there. Filing this issue to (a) confirm intended usage so we don't drift, and (b) surface a few small gaps that would make the deployer's flow easier.

Methods the deployer will call

For each demo user we provision:

Inject SSH key → ComputeService.inject_ssh_keys — deployer generates a per-user ED25519 key, registers the public half via this method, retains the private half in its sqlite for SSH-back-in.
Deploy VM → ComputeService.deploy_vm with spec { cpu: 16, memory: 8 GB, disk: 200 GB, rootfs: 16 GB, flist: "ubuntu-24.04-latest", publicip: false, node_id: <pinned> }. Today's s132 work proves this spec is sufficient (16 CPU is overcommit for an 8 GB VM but matches what's in flight via the OpenTofu path).
Wait until VM is reachable → poll ComputeService.get_vm for mycelium_ip to appear + open.
Deploy gateway → ComputeService.deploy_webgateway mapping <user>.<node>.grid.tf → http://<vm_ip>:9988 (where hero_router listens).
Run bootstrap script → currently via SSH from deployer to VM (proven path in hero_demo/deploy/single-vm/scripts/setup-binaries.sh). Alternative: pipe through ComputeService.vm_exec if it handles long-running scripts cleanly — see Gap 2 below.
Track + manage lifecycle → ComputeService.list_vms / get_vm / vm_stats for the admin UI's per-user state view.

Confirmation questions (low-cost — flag any "yes" / "no" / "TBD")

C1: Is deploy_vm ready for production use on TFGrid mainnet? (s132 used OpenTofu directly against TFGrid — works. Want to swap to this once it's stable for our flow.)
C2: Does deploy_vm return synchronously after the VM is fully reachable (SSH-able), or does it return early and require polling get_vm? Documentation in the OpenRPC summary field would resolve this for any caller.
C3: Is there a "metadata" or "tag" field on the VM spec? Deployer would store { user: "<forge_id>", profile: "demo", provisioned_at: ... } per-VM so the admin UI can join VMs back to users without round-tripping its own sqlite.
C4: inject_ssh_keys — is this called pre- or post-deploy_vm? Order matters for our deployer flow.

Small gaps (what would help us)

G1: A ComputeService.wait_vm_ready(vm_id, timeout) method that blocks until the VM is SSH-able (or the timeout expires). Today we'd poll get_vm from the deployer — works but every caller reimplements the same readiness logic. Not a blocker; nice-to-have.
G2: Clarity on vm_exec — does it stream stdout incrementally (good for our setup-binaries.sh which prints ~1500 lines of lab build progress over 5-30 min) or buffer until the command exits? If buffered, we keep the SSH path; if streamed, we can drop the SSH dependency on the deployer side entirely.
G3: deploy_webgateway — does it return the publicly-resolvable FQDN immediately, or does DNS propagation need extra wait? S132 saw the gateway resolve within ~30 s of tofu apply completing; if hero_compute mirrors that, no action needed.
G4: Auth model for the deployer → hero_compute connection. Currently the deployer is "admin-only" (us). Is the existing ComputeService socket reachable only locally, or does it expect bearer-token auth over network? Deployer's host (deployer admin UI) is not on the same machine as hero_compute.

None of these are blockers — happy to file separate issues for any of them if that's easier. Mostly this is a tee-up for the deployer work that starts in the next few sessions (current plan in hero_os_tfgrid_deployer#1 and the follow-up scope issues we're about to file there).

What's NOT a gap

VM lifecycle methods: complete. deploy_vm / start_vm / stop_vm / restart_vm / delete_vm / list_vms / get_vm — all present.
Web gateway: complete. deploy_webgateway / list_webgateways / get_webgateway / delete_webgateway — present.
SSH key injection: present (inject_ssh_keys).
VM diagnostics: present (vm_logs, vm_stats, vm_exec).
Nice surprise: migrate_secret, list_images, attach_hypervisor are also there — beyond what we need immediately but useful later.

Context

VM bootstrap proof point (s132): hero_demo setup-binaries.sh — 34/34 PASS on lab download/install + hero_proc + hero_router GREEN on a fresh TFGrid VM.
Cockpit spec: hero_cockpit#1.
Meeting notes umbrella: hero_os_tfgrid_deployer#1.

cc @mahmoud , no rush — answers can come incrementally.

## hero_os_tfgrid_deployer integration: methods we'll consume + small gaps The new admin tool `hero_os_tfgrid_deployer` (scope under discussion at [`hero_os_tfgrid_deployer#1`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/1)) will consume `ComputeService` OpenRPC (currently in `crates/my_compute_zos_server/src/cloud/openrpc.json`) as its only VM-lifecycle backend. Reviewed the spec — most of what we need is **already there**. Filing this issue to (a) confirm intended usage so we don't drift, and (b) surface a few small gaps that would make the deployer's flow easier. ### Methods the deployer will call For each demo user we provision: 1. **Inject SSH key** → `ComputeService.inject_ssh_keys` — deployer generates a per-user ED25519 key, registers the public half via this method, retains the private half in its sqlite for SSH-back-in. 2. **Deploy VM** → `ComputeService.deploy_vm` with spec `{ cpu: 16, memory: 8 GB, disk: 200 GB, rootfs: 16 GB, flist: "ubuntu-24.04-latest", publicip: false, node_id: <pinned> }`. Today's s132 work proves this spec is sufficient (16 CPU is overcommit for an 8 GB VM but matches what's in flight via the OpenTofu path). 3. **Wait until VM is reachable** → poll `ComputeService.get_vm` for mycelium_ip to appear + open. 4. **Deploy gateway** → `ComputeService.deploy_webgateway` mapping `<user>.<node>.grid.tf` → `http://<vm_ip>:9988` (where hero_router listens). 5. **Run bootstrap script** → currently via SSH from deployer to VM (proven path in [`hero_demo/deploy/single-vm/scripts/setup-binaries.sh`](https://forge.ourworld.tf/lhumina_code/hero_demo/src/branch/development/deploy/single-vm/scripts/setup-binaries.sh)). Alternative: pipe through `ComputeService.vm_exec` if it handles long-running scripts cleanly — see Gap 2 below. 6. **Track + manage lifecycle** → `ComputeService.list_vms` / `get_vm` / `vm_stats` for the admin UI's per-user state view. ### Confirmation questions (low-cost — flag any "yes" / "no" / "TBD") - **C1:** Is `deploy_vm` ready for production use on TFGrid mainnet? (s132 used OpenTofu directly against TFGrid — works. Want to swap to this once it's stable for our flow.) - **C2:** Does `deploy_vm` return synchronously after the VM is fully reachable (SSH-able), or does it return early and require polling `get_vm`? Documentation in the OpenRPC `summary` field would resolve this for any caller. - **C3:** Is there a "metadata" or "tag" field on the VM spec? Deployer would store `{ user: "<forge_id>", profile: "demo", provisioned_at: ... }` per-VM so the admin UI can join VMs back to users without round-tripping its own sqlite. - **C4:** `inject_ssh_keys` — is this called pre- or post-`deploy_vm`? Order matters for our deployer flow. ### Small gaps (what would help us) - **G1:** A `ComputeService.wait_vm_ready(vm_id, timeout)` method that blocks until the VM is SSH-able (or the timeout expires). Today we'd poll `get_vm` from the deployer — works but every caller reimplements the same readiness logic. Not a blocker; nice-to-have. - **G2:** Clarity on `vm_exec` — does it stream stdout incrementally (good for our `setup-binaries.sh` which prints ~1500 lines of `lab build` progress over 5-30 min) or buffer until the command exits? If buffered, we keep the SSH path; if streamed, we can drop the SSH dependency on the deployer side entirely. - **G3:** `deploy_webgateway` — does it return the publicly-resolvable FQDN immediately, or does DNS propagation need extra wait? S132 saw the gateway resolve within ~30 s of `tofu apply` completing; if hero_compute mirrors that, no action needed. - **G4:** Auth model for the deployer → hero_compute connection. Currently the deployer is "admin-only" (us). Is the existing `ComputeService` socket reachable only locally, or does it expect bearer-token auth over network? Deployer's host (deployer admin UI) is not on the same machine as `hero_compute`. None of these are blockers — happy to file separate issues for any of them if that's easier. Mostly this is a tee-up for the deployer work that starts in the next few sessions (current plan in [`hero_os_tfgrid_deployer#1`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/1) and the follow-up scope issues we're about to file there). ### What's NOT a gap - VM lifecycle methods: complete. `deploy_vm` / `start_vm` / `stop_vm` / `restart_vm` / `delete_vm` / `list_vms` / `get_vm` — all present. - Web gateway: complete. `deploy_webgateway` / `list_webgateways` / `get_webgateway` / `delete_webgateway` — present. - SSH key injection: present (`inject_ssh_keys`). - VM diagnostics: present (`vm_logs`, `vm_stats`, `vm_exec`). - Nice surprise: `migrate_secret`, `list_images`, `attach_hypervisor` are also there — beyond what we need immediately but useful later. ### Context - VM bootstrap proof point (s132): [`hero_demo` setup-binaries.sh — 34/34 PASS on lab download/install + hero_proc + hero_router GREEN on a fresh TFGrid VM](https://forge.ourworld.tf/lhumina_code/hero_demo/src/branch/development/deploy/single-vm/scripts/setup-binaries.sh). - Cockpit spec: [`hero_cockpit#1`](https://forge.ourworld.tf/lhumina_code/hero_cockpit/issues/1). - Meeting notes umbrella: [`hero_os_tfgrid_deployer#1`](https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/1). cc @mahmoud , no rush — answers can come incrementally.