service.restart leaves the service stopped when the requesting client dies mid-call (self-restart case) #149

New issue

Closed

opened 2026-06-11 20:50:34 +00:00 by mik-tf · 3 comments

mik-tf commented

2026-06-11 20:50:34 +00:00

Owner

When a service asks hero_proc to restart itself over RPC, the stop half runs but the start half is lost, leaving the service inactive. Reproduced today on the sandboxfull tester: hero_cockpit_server calls service.restart for each binary of its own bundle after a self-upgrade; the admin binary restarted fine, but the restart of hero_cockpit_server itself killed the calling process mid-RPC and the service stayed stopped until a manual hero_proc service start hero_cockpit_server (its rpc.sock was gone and hero_proc reported inactive). The admin-side tester update flow already tolerates the dropped connection, but it assumes hero_proc finishes the restart it accepted. Expected behavior: once service.restart is accepted, the stop+start sequence should run to completion in a detached job that does not depend on the requesting client connection staying alive.

Signed-by: mik-tf mik-tf@noreply.invalid

When a service asks hero_proc to restart itself over RPC, the stop half runs but the start half is lost, leaving the service inactive. Reproduced today on the sandboxfull tester: hero_cockpit_server calls service.restart for each binary of its own bundle after a self-upgrade; the admin binary restarted fine, but the restart of hero_cockpit_server itself killed the calling process mid-RPC and the service stayed stopped until a manual `hero_proc service start hero_cockpit_server` (its rpc.sock was gone and hero_proc reported inactive). The admin-side tester update flow already tolerates the dropped connection, but it assumes hero_proc finishes the restart it accepted. Expected behavior: once service.restart is accepted, the stop+start sequence should run to completion in a detached job that does not depend on the requesting client connection staying alive. Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf referenced this issue

2026-06-12 00:51:20 +00:00

All service.* RPCs wedge when one supervised service stops answering its socket #150

mik-tf referenced this issue from lhumina_code/home

2026-06-12 02:35:07 +00:00

[deployer/cockpit] Admin VM manages and updates itself through its own Cockpit (admin service bundle, machine/fleet split) #282

mik-tf referenced this issue from a commit

2026-06-12 13:29:18 +00:00

fix(services): survive severed connections during upgrade/restart/install

mik-tf commented

2026-06-12 13:40:07 +00:00

Author

Owner

Root cause found while debugging the same failure shape on the tester cockpit. service.restart in hero_proc_server (crates/hero_proc_server/src/rpc/service.rs, handle_restart) runs handle_stop await then handle_start await inline in the RPC handler. When the RPC client connection drops between those two awaits, the HTTP server drops the in-flight handler future, so the service is stopped but never started again. That is exactly what happens when the restart target is the process carrying the request itself: cockpit restarting hero_cockpit_server (this issue), or the web gateway bundle restarting hero_proxy_server, which took the whole tester web surface down until a manual start over SSH. The cockpit now runs its upgrade and restart work on a detached task so its side of the chain survives a dropped browser connection (hero_cockpit commit lhumina_code/hero_cockpit@d5aed67), but that cannot cover the cockpit restarting itself, because the process hosting the detached task dies with it. The structural fix belongs here: make service.restart immune to client disconnect, for example by spawning the stop+start sequence as a supervised job (or detached task) and returning the job id, so a half-finished restart can never strand a service stopped.

Root cause found while debugging the same failure shape on the tester cockpit. service.restart in hero_proc_server (crates/hero_proc_server/src/rpc/service.rs, handle_restart) runs handle_stop await then handle_start await inline in the RPC handler. When the RPC client connection drops between those two awaits, the HTTP server drops the in-flight handler future, so the service is stopped but never started again. That is exactly what happens when the restart target is the process carrying the request itself: cockpit restarting hero_cockpit_server (this issue), or the web gateway bundle restarting hero_proxy_server, which took the whole tester web surface down until a manual start over SSH. The cockpit now runs its upgrade and restart work on a detached task so its side of the chain survives a dropped browser connection (hero_cockpit commit https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/d5aed67), but that cannot cover the cockpit restarting itself, because the process hosting the detached task dies with it. The structural fix belongs here: make service.restart immune to client disconnect, for example by spawning the stop+start sequence as a supervised job (or detached task) and returning the job id, so a half-finished restart can never strand a service stopped.

mik-tf referenced this issue from lhumina_code/home

2026-06-12 21:36:16 +00:00

[meta] Hero Sandbox Review #275

mik-tf referenced this issue from a commit

2026-06-14 12:48:03 +00:00

fix(service): detach service.restart so a self-restart cannot strand the service

mik-tf commented

2026-06-14 12:59:41 +00:00

Author

Owner

Fix pushed to the integration branch as commit 1d297d4: service.restart now runs the stop+start on a detached task so a client that dies mid-call (a service restarting its own bundle, e.g. the cockpit upgrading itself) can no longer cancel the start half and leave the service stopped. The services page already polls and reloads once the backend answers again, so no cockpit change was needed. Keeping this open until the new hero_proc is deployed to the tester VMs and a UI-driven cockpit self-upgrade is verified live (a supervised step, since it restarts the supervisor).

Signed-by: mik-tf mik-tf@noreply.invalid

Fix pushed to the integration branch as commit 1d297d4: service.restart now runs the stop+start on a detached task so a client that dies mid-call (a service restarting its own bundle, e.g. the cockpit upgrading itself) can no longer cancel the start half and leave the service stopped. The services page already polls and reloads once the backend answers again, so no cockpit change was needed. Keeping this open until the new hero_proc is deployed to the tester VMs and a UI-driven cockpit self-upgrade is verified live (a supervised step, since it restarts the supervisor). Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf commented

2026-06-14 13:29:48 +00:00

Author

Owner

Fixed and proven live. The fix (commit 1d297d4 on integration) detaches the stop+start in service.restart so the supervisor finishes the restart even when the requesting client dies mid-call.

Verified on a tester VM: triggered the cockpit to restart its own backend (the exact self-restart case). The cockpit process died mid-call as expected, and hero_proc completed the start on its own; the backend came back with a new pid and answered RPCs again within about a second. Before this fix the same action left the backend stopped until a manual SSH start.

Deployed to all 8 current tester VMs (supervisor restarted via the standard kill-and-relaunch sequence, with the supervised services preserved). Promotion to main follows the normal release cadence.

Signed-by: mik-tf mik-tf@noreply.invalid

Fixed and proven live. The fix (commit 1d297d4 on integration) detaches the stop+start in service.restart so the supervisor finishes the restart even when the requesting client dies mid-call. Verified on a tester VM: triggered the cockpit to restart its own backend (the exact self-restart case). The cockpit process died mid-call as expected, and hero_proc completed the start on its own; the backend came back with a new pid and answered RPCs again within about a second. Before this fix the same action left the backend stopped until a manual SSH start. Deployed to all 8 current tester VMs (supervisor restarted via the standard kill-and-relaunch sequence, with the supervised services preserved). Promotion to main follows the normal release cadence. Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf closed this issue

2026-06-14 13:29:51 +00:00