[deployer/cockpit] Admin VM manages and updates itself through its own Cockpit (admin service bundle, machine/fleet split) #282

Closed
opened 2026-06-12 01:20:13 +00:00 by mik-tf · 1 comment
Owner

Every Hero machine runs Cockpit as its machine surface: a tester VM's Cockpit manages that tester's services with update buttons, release channels, and installed-build receipts. The admin VM is the same kind of machine (it has its own full Cockpit) but today its most important services cannot be updated from any UI: the Cockpit upgrade map covers the shared engines but not hero_tfgrid_deployer or the my_compute_zos chain daemons, so the control machine of the whole sandbox is the only machine still updated by hand over SSH. The deployer stays out of tester bundles on purpose (fleet controls on a tester machine would mislead testers); the change here is only about the admin machine managing itself. Scope:

  • Deployer admin navbar gets a "This machine" link to the admin VM's own Cockpit services page (landed, integration 6791cf4)
  • Cockpit gains an admin machine profile (an explicit setting on the admin VM only, e.g. a hero_proc secret slot or env read at start; never set on tester VMs) that enables an Admin bundle in the catalog: hero_tfgrid_deployer (server + admin), the my_compute_zos chain daemons, and keeps the engines it already covers. With the profile off, catalogs are byte-identical to today, so tester VMs are untouched
  • With that in place the admin VM updates its own deployer, daemons, and engines from Cockpit Services with channels and receipts, same as testers; SSH fast-patching becomes the emergency path only
  • Admin Cockpit shows a "Fleet" link back to the deployer admin when the admin profile is on (the reverse of the navbar link above)
  • Deployer admin Control page slims to shared resources only (engine health, tokens, links); machine management points at Cockpit

Done means: on the admin VM, Cockpit Services lists and successfully upgrades the deployer and a chain daemon with a written receipt, a tester VM's catalog is proven unchanged, and the two surfaces link to each other.


Implementation appendix (everything needed to execute)

1. Machine role flag

  • Add an [[env]] block to hero_cockpit/crates/hero_cockpit_server/service.toml (and to the web crate's service.toml if the navbar needs it): var = "COCKPIT_MACHINE_ROLE", default = "tester", desc explaining that the value admin enables the admin machine bundle. Every [[env]] block must carry a default or the service panics at startup on the manifest schema.
  • CRITICAL: adding a NEW [[env]] block to an already registered service requires re-registering it (lab service hero_cockpit_server --start); a binary swap plus restart is NOT enough, the supervisor injects env from the stored service definition.
  • Rollout: on the admin VM only, hero_proc secret set --context cockpit COCKPIT_MACHINE_ROLE admin, then re-register and restart both cockpit services. Tester VMs get nothing; the default applies and behavior is unchanged.
  • Parse in hero_cockpit_server/src/main.rs at startup into state (enum Tester | Admin; anything that is not exactly admin means Tester, fail-open to Tester).

2. Catalog and upgrade map (hero_cockpit_server)

  • src/catalog.rs is a static CatalogEntry array with a completeness test (around line 290) pinning each app's binary set. Refactor to a function returning the base entries always, plus admin entries only when the role is Admin. Keep a test pin asserting the Tester catalog is byte-identical to today's, and add a second pin for the Admin catalog.
  • Admin entries:
    • app hero_tfgrid_deployer, repo hero_os_tfgrid_deployer, binaries hero_tfgrid_deployer_server + hero_tfgrid_deployer_admin. Release assets verified present on latest-integration: hero_tfgrid_deployer_server-linux-musl-x86_64, hero_tfgrid_deployer_admin-linux-musl-x86_64.
    • app hero_compute, repo hero_compute, downloadable binary my_compute_zos_server (asset verified: my_compute_zos_server-linux-musl-x86_64). CAUTION: on the admin VM this ONE binary backs THREE registered services: my_compute_zos_server (qa) plus my_compute_zos_main_server and my_compute_zos_testnet_server (wrappers with their own env; the mainnet one sets a separate compute config context and socket path). lab build hero_compute --download --bin my_compute_zos_main_server would fail, there is no such asset. Recommended: extend the entry model with restart_only service names (download the binaries list once, restart binaries plus restart_only). Acceptable v1 fallback: bundle only the qa service and document that the wrapper daemons need a manual restart after an upgrade.
    • engines: hero_embedder_provider (server + admin) is already in the upgrade map in src/repos.rs; verify hero_voice_provider (single service, runs on the admin VM) is mapped too, and add catalog entries for both so they render as bundles.
  • src/repos.rs service_repo(): add hero_tfgrid_deployer_server | hero_tfgrid_deployer_admin to hero_os_tfgrid_deployer; my_compute_zos_server (plus wrapper names if restart_only is chosen) to hero_compute.
  • When the role is admin, the navbar renders a Fleet link to /hero_tfgrid_deployer/admin/ (the deployer admin path through this machine's router). With the flag off the templates render byte-identical to today.

4. Control page slimming (deployer admin, optional polish)

  • Control keeps the shared engine tiles and links; machine management wording points at the Cockpit. The navbar "This machine" link is already live (integration 6791cf4).

5. Caveats that will bite

  • Cockpit upgrading or restarting ITSELF still loses the start half of the restart (hero_proc issue 149): after any cockpit self-upgrade, run hero_proc service start hero_cockpit_server once, until that issue is fixed.
  • The deployer binaries on the admin VM are currently AHEAD of their release assets (live fast patches). The first Cockpit-driven upgrade would downgrade them. Before relying on Cockpit updates for the deployer, publish current integration so the release equals the branch head. The release upload now replaces stale assets correctly (md5 compared), but the CI builder image is still blocked by the registry push timeout (home issue 280), so publish locally with a current lab build until that clears.
  • Supervisor status dots can drift from reality; trust the Build column and receipts (written by the install path) over the status dot.

6. Done means (verification)

  • Admin VM: Cockpit Services shows the admin bundles; cockpit.upgrade_service for the deployer completes, writes a receipt (tag, commit, md5), and the deployer restarts and answers RPC; same for the compute daemon with all three chain daemons running afterwards.
  • Tester VM: cockpit.list_services output and the catalog are proven unchanged (flag unset) on a live tester before and after the cockpit rollout.
  • Gates: fmt, clippy, server tests including both catalog pins, musl release builds for cockpit server and web.

Signed-by: mik-tf mik-tf@noreply.invalid

Every Hero machine runs Cockpit as its machine surface: a tester VM's Cockpit manages that tester's services with update buttons, release channels, and installed-build receipts. The admin VM is the same kind of machine (it has its own full Cockpit) but today its most important services cannot be updated from any UI: the Cockpit upgrade map covers the shared engines but not hero_tfgrid_deployer or the my_compute_zos chain daemons, so the control machine of the whole sandbox is the only machine still updated by hand over SSH. The deployer stays out of tester bundles on purpose (fleet controls on a tester machine would mislead testers); the change here is only about the admin machine managing itself. Scope: - [x] Deployer admin navbar gets a "This machine" link to the admin VM's own Cockpit services page (landed, integration 6791cf4) - [x] Cockpit gains an admin machine profile (an explicit setting on the admin VM only, e.g. a hero_proc secret slot or env read at start; never set on tester VMs) that enables an Admin bundle in the catalog: hero_tfgrid_deployer (server + admin), the my_compute_zos chain daemons, and keeps the engines it already covers. With the profile off, catalogs are byte-identical to today, so tester VMs are untouched - [x] With that in place the admin VM updates its own deployer, daemons, and engines from Cockpit Services with channels and receipts, same as testers; SSH fast-patching becomes the emergency path only - [x] Admin Cockpit shows a "Fleet" link back to the deployer admin when the admin profile is on (the reverse of the navbar link above) - [x] Deployer admin Control page slims to shared resources only (engine health, tokens, links); machine management points at Cockpit Done means: on the admin VM, Cockpit Services lists and successfully upgrades the deployer and a chain daemon with a written receipt, a tester VM's catalog is proven unchanged, and the two surfaces link to each other. --- ## Implementation appendix (everything needed to execute) ### 1. Machine role flag - Add an `[[env]]` block to `hero_cockpit/crates/hero_cockpit_server/service.toml` (and to the web crate's service.toml if the navbar needs it): `var = "COCKPIT_MACHINE_ROLE"`, `default = "tester"`, desc explaining that the value `admin` enables the admin machine bundle. Every `[[env]]` block must carry a `default` or the service panics at startup on the manifest schema. - CRITICAL: adding a NEW `[[env]]` block to an already registered service requires re-registering it (`lab service hero_cockpit_server --start`); a binary swap plus restart is NOT enough, the supervisor injects env from the stored service definition. - Rollout: on the admin VM only, `hero_proc secret set --context cockpit COCKPIT_MACHINE_ROLE admin`, then re-register and restart both cockpit services. Tester VMs get nothing; the default applies and behavior is unchanged. - Parse in `hero_cockpit_server/src/main.rs` at startup into state (enum Tester | Admin; anything that is not exactly `admin` means Tester, fail-open to Tester). ### 2. Catalog and upgrade map (`hero_cockpit_server`) - `src/catalog.rs` is a static `CatalogEntry` array with a completeness test (around line 290) pinning each app's binary set. Refactor to a function returning the base entries always, plus admin entries only when the role is Admin. Keep a test pin asserting the Tester catalog is byte-identical to today's, and add a second pin for the Admin catalog. - Admin entries: - app `hero_tfgrid_deployer`, repo `hero_os_tfgrid_deployer`, binaries `hero_tfgrid_deployer_server` + `hero_tfgrid_deployer_admin`. Release assets verified present on latest-integration: `hero_tfgrid_deployer_server-linux-musl-x86_64`, `hero_tfgrid_deployer_admin-linux-musl-x86_64`. - app `hero_compute`, repo `hero_compute`, downloadable binary `my_compute_zos_server` (asset verified: `my_compute_zos_server-linux-musl-x86_64`). CAUTION: on the admin VM this ONE binary backs THREE registered services: `my_compute_zos_server` (qa) plus `my_compute_zos_main_server` and `my_compute_zos_testnet_server` (wrappers with their own env; the mainnet one sets a separate compute config context and socket path). `lab build hero_compute --download --bin my_compute_zos_main_server` would fail, there is no such asset. Recommended: extend the entry model with `restart_only` service names (download the binaries list once, restart binaries plus restart_only). Acceptable v1 fallback: bundle only the qa service and document that the wrapper daemons need a manual restart after an upgrade. - engines: `hero_embedder_provider` (server + admin) is already in the upgrade map in `src/repos.rs`; verify `hero_voice_provider` (single service, runs on the admin VM) is mapped too, and add catalog entries for both so they render as bundles. - `src/repos.rs` `service_repo()`: add `hero_tfgrid_deployer_server` | `hero_tfgrid_deployer_admin` to `hero_os_tfgrid_deployer`; `my_compute_zos_server` (plus wrapper names if restart_only is chosen) to `hero_compute`. ### 3. Fleet backlink (`hero_cockpit_web`) - When the role is admin, the navbar renders a Fleet link to `/hero_tfgrid_deployer/admin/` (the deployer admin path through this machine's router). With the flag off the templates render byte-identical to today. ### 4. Control page slimming (deployer admin, optional polish) - Control keeps the shared engine tiles and links; machine management wording points at the Cockpit. The navbar "This machine" link is already live (integration 6791cf4). ### 5. Caveats that will bite - Cockpit upgrading or restarting ITSELF still loses the start half of the restart (hero_proc issue 149): after any cockpit self-upgrade, run `hero_proc service start hero_cockpit_server` once, until that issue is fixed. - The deployer binaries on the admin VM are currently AHEAD of their release assets (live fast patches). The first Cockpit-driven upgrade would downgrade them. Before relying on Cockpit updates for the deployer, publish current integration so the release equals the branch head. The release upload now replaces stale assets correctly (md5 compared), but the CI builder image is still blocked by the registry push timeout (home issue 280), so publish locally with a current lab build until that clears. - Supervisor status dots can drift from reality; trust the Build column and receipts (written by the install path) over the status dot. ### 6. Done means (verification) - Admin VM: Cockpit Services shows the admin bundles; `cockpit.upgrade_service` for the deployer completes, writes a receipt (tag, commit, md5), and the deployer restarts and answers RPC; same for the compute daemon with all three chain daemons running afterwards. - Tester VM: `cockpit.list_services` output and the catalog are proven unchanged (flag unset) on a live tester before and after the cockpit rollout. - Gates: fmt, clippy, server tests including both catalog pins, musl release builds for cockpit server and web. Signed-by: mik-tf <mik-tf@noreply.invalid>
Author
Owner

All scope boxes are done and proven live.

Shipped as hero_cockpit b8998d1 on integration, published to latest-integration: COCKPIT_MACHINE_ROLE on both cockpit services (default tester, only the exact value admin enables the profile, anything else falls back to the tester surface), role-aware catalog with the admin bundle (deployer server plus admin, hero_compute with the new restart_only entry field for the one-binary, three-services chain daemons, embedder and voice provider), the repo mappings for the upgrade gate, and the Fleet navbar link to the deployer admin. Two test pins lock the tester catalog to its pre-change set and the admin bundle to the audited admin VM registrations.

Live proof on the admin VM:

  • cockpit.list_catalog shows 21 apps with all four admin bundles marked installed.
  • Deployer upgraded from Cockpit: both binaries now md5-equal to the release assets, receipts written (tag, commit, md5), list_nodes answers with all three chain daemons healthy.
  • hero_compute upgraded from the my_compute_zos_main_server row: one asset downloaded, all three chain daemons restarted, 9 of 9 nodes online across main, test, and qa afterward.
  • Embedder engine upgraded from Cockpit after refreshing its release (assets had gone stale while the tag moved; the stale musl assets were removed since this repo is glibc-only, lab prefers musl when both exist): binary md5-equal to the fresh build, models loaded, overlay auth gate fail-closed (401), memory at the expected baseline.

Tester proof on the sandboxfull VM: upgraded its cockpit bundle to the same build; the role key is absent from the process env, the catalog lists exactly the 17 tester apps, no admin bundles, no Fleet link. The only catalog change is hero_memory now listing hero_memory_ui, which is the earlier memory bundle fix arriving with the newer build, unrelated to the profile.

Operational notes:

  1. Setting the role through hero_proc secret set alone does not work: manifest env defaults shadow context secrets at spawn, filed as hero_proc#151. The working rollout is COCKPIT_MACHINE_ROLE=admin lab service hero_cockpit_server --start (same for web) on the admin VM. A plain re-registration without the var reverts the machine to the tester surface; re-run with the var exported if that happens.
  2. Before trusting an Upgrade click on any newly bundled service, confirm the release is at least as new as the running binaries (tag ref plus asset timestamps and md5, never target_commitish). Deployer, cockpit, and embedder releases were republished fresh today. The voice provider asset predates its branch head but the head commit is CI-only, so the click is harmless there; republish on the next real voice change.
  3. After any cockpit self-upgrade the server stays stopped until a manual hero_proc service start (hero_proc#149), confirmed again on the tester walk.
  4. Minor receipt wart on glibc-only repos: the receipt records the musl asset name with an empty md5 because the host priority list guesses the asset label instead of recording the one actually installed. Cosmetic, the installed binary itself was md5-verified.

Signed-by: mik-tf mik-tf@noreply.invalid

All scope boxes are done and proven live. Shipped as hero_cockpit b8998d1 on integration, published to latest-integration: COCKPIT_MACHINE_ROLE on both cockpit services (default tester, only the exact value admin enables the profile, anything else falls back to the tester surface), role-aware catalog with the admin bundle (deployer server plus admin, hero_compute with the new restart_only entry field for the one-binary, three-services chain daemons, embedder and voice provider), the repo mappings for the upgrade gate, and the Fleet navbar link to the deployer admin. Two test pins lock the tester catalog to its pre-change set and the admin bundle to the audited admin VM registrations. Live proof on the admin VM: - cockpit.list_catalog shows 21 apps with all four admin bundles marked installed. - Deployer upgraded from Cockpit: both binaries now md5-equal to the release assets, receipts written (tag, commit, md5), list_nodes answers with all three chain daemons healthy. - hero_compute upgraded from the my_compute_zos_main_server row: one asset downloaded, all three chain daemons restarted, 9 of 9 nodes online across main, test, and qa afterward. - Embedder engine upgraded from Cockpit after refreshing its release (assets had gone stale while the tag moved; the stale musl assets were removed since this repo is glibc-only, lab prefers musl when both exist): binary md5-equal to the fresh build, models loaded, overlay auth gate fail-closed (401), memory at the expected baseline. Tester proof on the sandboxfull VM: upgraded its cockpit bundle to the same build; the role key is absent from the process env, the catalog lists exactly the 17 tester apps, no admin bundles, no Fleet link. The only catalog change is hero_memory now listing hero_memory_ui, which is the earlier memory bundle fix arriving with the newer build, unrelated to the profile. Operational notes: 1. Setting the role through hero_proc secret set alone does not work: manifest env defaults shadow context secrets at spawn, filed as [hero_proc#151](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/151). The working rollout is COCKPIT_MACHINE_ROLE=admin lab service hero_cockpit_server --start (same for web) on the admin VM. A plain re-registration without the var reverts the machine to the tester surface; re-run with the var exported if that happens. 2. Before trusting an Upgrade click on any newly bundled service, confirm the release is at least as new as the running binaries (tag ref plus asset timestamps and md5, never target_commitish). Deployer, cockpit, and embedder releases were republished fresh today. The voice provider asset predates its branch head but the head commit is CI-only, so the click is harmless there; republish on the next real voice change. 3. After any cockpit self-upgrade the server stays stopped until a manual hero_proc service start ([hero_proc#149](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/149)), confirmed again on the tester walk. 4. Minor receipt wart on glibc-only repos: the receipt records the musl asset name with an empty md5 because the host priority list guesses the asset label instead of recording the one actually installed. Cosmetic, the installed binary itself was md5-verified. Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#282
No description provided.