[deployer/admin] Admin VM operability: daemon health on Nodes, logs tab, meta admin links, gated terminal #281

Open
opened 2026-06-11 23:59:44 +00:00 by mik-tf · 1 comment
Owner

Tonight a compute daemon on the admin VM lost its RPC socket while its process kept running. The deployer dashboard showed "0 of 0 nodes online, no dedicated nodes configured" (an empty fleet) instead of a backend failure, and the diagnosis plus the restart had to be done over SSH, even though every needed log line was already captured by hero_proc and the admin VM's own Cockpit could have restarted the daemon from the browser. The pieces exist in the platform; they are not wired into the admin surfaces. Scope:

  • Nodes page: when a chain daemon is unreachable, say so and show its last fleet error instead of rendering an empty node list ("qa daemon unreachable, restart it from Services" beats "no dedicated nodes configured")
  • Deployer admin: add a Logs tab using the shared logs viewer component (talks to hero_proc logs.filter), pre-filtered to the deployer server and the my_compute_zos daemons
  • Control page: add an "Admin VM services" tile linking to the admin Cockpit Services page, so any admin VM service can be restarted from the browser (today Control only lists shared engines)
  • Expose the hero_router terminal (PTY sessions run as hero_proc jobs) on the admin VM behind the existing admin SSO allowlist; admins only, never on tester VMs; needs a nod from the router/proxy owners before wiring
  • Users page: relabel the release channel dropdown as the default for first installs and services without a recorded channel, and fix the page subtitle; updates already preserve each service bundle's recorded channel, the copy just says otherwise

Done means: all items live on the admin VM (https://hcockpit.gent01.qa.grid.tf/hero_tfgrid_deployer/admin/) and verified in the browser, then this issue closes.

Signed-by: mik-tf mik-tf@noreply.invalid

Tonight a compute daemon on the admin VM lost its RPC socket while its process kept running. The deployer dashboard showed "0 of 0 nodes online, no dedicated nodes configured" (an empty fleet) instead of a backend failure, and the diagnosis plus the restart had to be done over SSH, even though every needed log line was already captured by hero_proc and the admin VM's own Cockpit could have restarted the daemon from the browser. The pieces exist in the platform; they are not wired into the admin surfaces. Scope: - [x] Nodes page: when a chain daemon is unreachable, say so and show its last fleet error instead of rendering an empty node list ("qa daemon unreachable, restart it from Services" beats "no dedicated nodes configured") - [x] Deployer admin: add a Logs tab using the shared logs viewer component (talks to hero_proc logs.filter), pre-filtered to the deployer server and the my_compute_zos daemons - [x] Control page: add an "Admin VM services" tile linking to the admin Cockpit Services page, so any admin VM service can be restarted from the browser (today Control only lists shared engines) - [ ] Expose the hero_router terminal (PTY sessions run as hero_proc jobs) on the admin VM behind the existing admin SSO allowlist; admins only, never on tester VMs; needs a nod from the router/proxy owners before wiring - [x] Users page: relabel the release channel dropdown as the default for first installs and services without a recorded channel, and fix the page subtitle; updates already preserve each service bundle's recorded channel, the copy just says otherwise Done means: all items live on the admin VM (https://hcockpit.gent01.qa.grid.tf/hero_tfgrid_deployer/admin/) and verified in the browser, then this issue closes. Signed-by: mik-tf <mik-tf@noreply.invalid>
Author
Owner

Four of five items are live on the admin VM (hero_os_tfgrid_deployer integration commits 8d172d3, ff3f35b, aa49126; binaries deployed and verified over the admin socket).

What changed: deployer.list_nodes now reports ok/error per chain daemon plus a daemons_unreachable count, and the Nodes page and overview strip render an unreachable daemon as a red warning with the daemon's last error and a link to the admin services page, instead of an empty fleet. Verified live by briefly stopping the qa compute daemon: the response showed qa ok=false with the socket error and daemons_unreachable=1, then 3 of 3 nodes after restart. The new Logs page (navbar) embeds the shared logs viewer against a read-only relay to the supervisor's log store with quick-pick source buttons; write methods are refused by the relay. Control has an "Admin VM services" tile opening the admin Cockpit services page. The Users page channel selector is now labeled "Default for new installs" and the subtitle says updates preserve each service's recorded channel.

Remaining: the gated hero_router terminal exposure, which needs a nod from the router and proxy owners before wiring.

Signed-by: mik-tf mik-tf@noreply.invalid

Four of five items are live on the admin VM (hero_os_tfgrid_deployer integration commits 8d172d3, ff3f35b, aa49126; binaries deployed and verified over the admin socket). What changed: deployer.list_nodes now reports ok/error per chain daemon plus a daemons_unreachable count, and the Nodes page and overview strip render an unreachable daemon as a red warning with the daemon's last error and a link to the admin services page, instead of an empty fleet. Verified live by briefly stopping the qa compute daemon: the response showed qa ok=false with the socket error and daemons_unreachable=1, then 3 of 3 nodes after restart. The new Logs page (navbar) embeds the shared logs viewer against a read-only relay to the supervisor's log store with quick-pick source buttons; write methods are refused by the relay. Control has an "Admin VM services" tile opening the admin Cockpit services page. The Users page channel selector is now labeled "Default for new installs" and the subtitle says updates preserve each service's recorded channel. Remaining: the gated hero_router terminal exposure, which needs a nod from the router and proxy owners before wiring. Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#281
No description provided.