[deployer/admin] Admin VM operability: daemon health on Nodes, logs tab, meta admin links, gated terminal

mik-tf commented

2026-06-11 23:59:44 +00:00

Owner

Tonight a compute daemon on the admin VM lost its RPC socket while its process kept running. The deployer dashboard showed "0 of 0 nodes online, no dedicated nodes configured" (an empty fleet) instead of a backend failure, and the diagnosis plus the restart had to be done over SSH, even though every needed log line was already captured by hero_proc and the admin VM's own Cockpit could have restarted the daemon from the browser. The pieces exist in the platform; they are not wired into the admin surfaces. Scope:

Nodes page: when a chain daemon is unreachable, say so and show its last fleet error instead of rendering an empty node list ("qa daemon unreachable, restart it from Services" beats "no dedicated nodes configured")
Deployer admin: add a Logs tab using the shared logs viewer component (talks to hero_proc logs.filter), pre-filtered to the deployer server and the my_compute_zos daemons
Control page: add an "Admin VM services" tile linking to the admin Cockpit Services page, so any admin VM service can be restarted from the browser (today Control only lists shared engines)
Expose the hero_router terminal (PTY sessions run as hero_proc jobs) on the admin VM behind the existing admin SSO allowlist; admins only, never on tester VMs; needs a nod from the router/proxy owners before wiring
Users page: relabel the release channel dropdown as the default for first installs and services without a recorded channel, and fix the page subtitle; updates already preserve each service bundle's recorded channel, the copy just says otherwise

Done means: all items live on the admin VM (https://hcockpit.gent01.qa.grid.tf/hero_tfgrid_deployer/admin/) and verified in the browser, then this issue closes.

Signed-by: mik-tf mik-tf@noreply.invalid

Tonight a compute daemon on the admin VM lost its RPC socket while its process kept running. The deployer dashboard showed "0 of 0 nodes online, no dedicated nodes configured" (an empty fleet) instead of a backend failure, and the diagnosis plus the restart had to be done over SSH, even though every needed log line was already captured by hero_proc and the admin VM's own Cockpit could have restarted the daemon from the browser. The pieces exist in the platform; they are not wired into the admin surfaces. Scope: - [x] Nodes page: when a chain daemon is unreachable, say so and show its last fleet error instead of rendering an empty node list ("qa daemon unreachable, restart it from Services" beats "no dedicated nodes configured") - [x] Deployer admin: add a Logs tab using the shared logs viewer component (talks to hero_proc logs.filter), pre-filtered to the deployer server and the my_compute_zos daemons - [x] Control page: add an "Admin VM services" tile linking to the admin Cockpit Services page, so any admin VM service can be restarted from the browser (today Control only lists shared engines) - [ ] Expose the hero_router terminal (PTY sessions run as hero_proc jobs) on the admin VM behind the existing admin SSO allowlist; admins only, never on tester VMs; needs a nod from the router/proxy owners before wiring - [x] Users page: relabel the release channel dropdown as the default for first installs and services without a recorded channel, and fix the page subtitle; updates already preserve each service bundle's recorded channel, the copy just says otherwise Done means: all items live on the admin VM (https://hcockpit.gent01.qa.grid.tf/hero_tfgrid_deployer/admin/) and verified in the browser, then this issue closes. Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf commented

2026-06-12 00:29:07 +00:00

Author

Owner

Four of five items are live on the admin VM (hero_os_tfgrid_deployer integration commits 8d172d3, ff3f35b, aa49126; binaries deployed and verified over the admin socket).

What changed: deployer.list_nodes now reports ok/error per chain daemon plus a daemons_unreachable count, and the Nodes page and overview strip render an unreachable daemon as a red warning with the daemon's last error and a link to the admin services page, instead of an empty fleet. Verified live by briefly stopping the qa compute daemon: the response showed qa ok=false with the socket error and daemons_unreachable=1, then 3 of 3 nodes after restart. The new Logs page (navbar) embeds the shared logs viewer against a read-only relay to the supervisor's log store with quick-pick source buttons; write methods are refused by the relay. Control has an "Admin VM services" tile opening the admin Cockpit services page. The Users page channel selector is now labeled "Default for new installs" and the subtitle says updates preserve each service's recorded channel.

Remaining: the gated hero_router terminal exposure, which needs a nod from the router and proxy owners before wiring.

Signed-by: mik-tf mik-tf@noreply.invalid

Four of five items are live on the admin VM (hero_os_tfgrid_deployer integration commits 8d172d3, ff3f35b, aa49126; binaries deployed and verified over the admin socket). What changed: deployer.list_nodes now reports ok/error per chain daemon plus a daemons_unreachable count, and the Nodes page and overview strip render an unreachable daemon as a red warning with the daemon's last error and a link to the admin services page, instead of an empty fleet. Verified live by briefly stopping the qa compute daemon: the response showed qa ok=false with the socket error and daemons_unreachable=1, then 3 of 3 nodes after restart. The new Logs page (navbar) embeds the shared logs viewer against a read-only relay to the supervisor's log store with quick-pick source buttons; write methods are refused by the relay. Control has an "Admin VM services" tile opening the admin Cockpit services page. The Users page channel selector is now labeled "Default for new installs" and the subtitle says updates preserve each service's recorded channel. Remaining: the gated hero_router terminal exposure, which needs a nod from the router and proxy owners before wiring. Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf referenced this issue

2026-06-12 02:37:19 +00:00

[META] Hero OS sandbox demo, functional readiness: onboarding pipeline + per-app verification #239

Rows
Columns

[deployer/admin] Admin VM operability: daemon health on Nodes, logs tab, meta admin links, gated terminal #281