service.restart leaves the service stopped when the requesting client dies mid-call (self-restart case) #149

Closed
opened 2026-06-11 20:50:34 +00:00 by mik-tf · 3 comments
Owner

When a service asks hero_proc to restart itself over RPC, the stop half runs but the start half is lost, leaving the service inactive. Reproduced today on the sandboxfull tester: hero_cockpit_server calls service.restart for each binary of its own bundle after a self-upgrade; the admin binary restarted fine, but the restart of hero_cockpit_server itself killed the calling process mid-RPC and the service stayed stopped until a manual hero_proc service start hero_cockpit_server (its rpc.sock was gone and hero_proc reported inactive). The admin-side tester update flow already tolerates the dropped connection, but it assumes hero_proc finishes the restart it accepted. Expected behavior: once service.restart is accepted, the stop+start sequence should run to completion in a detached job that does not depend on the requesting client connection staying alive.

Signed-by: mik-tf mik-tf@noreply.invalid

When a service asks hero_proc to restart itself over RPC, the stop half runs but the start half is lost, leaving the service inactive. Reproduced today on the sandboxfull tester: hero_cockpit_server calls service.restart for each binary of its own bundle after a self-upgrade; the admin binary restarted fine, but the restart of hero_cockpit_server itself killed the calling process mid-RPC and the service stayed stopped until a manual `hero_proc service start hero_cockpit_server` (its rpc.sock was gone and hero_proc reported inactive). The admin-side tester update flow already tolerates the dropped connection, but it assumes hero_proc finishes the restart it accepted. Expected behavior: once service.restart is accepted, the stop+start sequence should run to completion in a detached job that does not depend on the requesting client connection staying alive. Signed-by: mik-tf <mik-tf@noreply.invalid>
Author
Owner

Root cause found while debugging the same failure shape on the tester cockpit. service.restart in hero_proc_server (crates/hero_proc_server/src/rpc/service.rs, handle_restart) runs handle_stop await then handle_start await inline in the RPC handler. When the RPC client connection drops between those two awaits, the HTTP server drops the in-flight handler future, so the service is stopped but never started again. That is exactly what happens when the restart target is the process carrying the request itself: cockpit restarting hero_cockpit_server (this issue), or the web gateway bundle restarting hero_proxy_server, which took the whole tester web surface down until a manual start over SSH. The cockpit now runs its upgrade and restart work on a detached task so its side of the chain survives a dropped browser connection (hero_cockpit commit lhumina_code/hero_cockpit@d5aed67), but that cannot cover the cockpit restarting itself, because the process hosting the detached task dies with it. The structural fix belongs here: make service.restart immune to client disconnect, for example by spawning the stop+start sequence as a supervised job (or detached task) and returning the job id, so a half-finished restart can never strand a service stopped.

Root cause found while debugging the same failure shape on the tester cockpit. service.restart in hero_proc_server (crates/hero_proc_server/src/rpc/service.rs, handle_restart) runs handle_stop await then handle_start await inline in the RPC handler. When the RPC client connection drops between those two awaits, the HTTP server drops the in-flight handler future, so the service is stopped but never started again. That is exactly what happens when the restart target is the process carrying the request itself: cockpit restarting hero_cockpit_server (this issue), or the web gateway bundle restarting hero_proxy_server, which took the whole tester web surface down until a manual start over SSH. The cockpit now runs its upgrade and restart work on a detached task so its side of the chain survives a dropped browser connection (hero_cockpit commit https://forge.ourworld.tf/lhumina_code/hero_cockpit/commit/d5aed67), but that cannot cover the cockpit restarting itself, because the process hosting the detached task dies with it. The structural fix belongs here: make service.restart immune to client disconnect, for example by spawning the stop+start sequence as a supervised job (or detached task) and returning the job id, so a half-finished restart can never strand a service stopped.
Author
Owner

Fix pushed to the integration branch as commit 1d297d4: service.restart now runs the stop+start on a detached task so a client that dies mid-call (a service restarting its own bundle, e.g. the cockpit upgrading itself) can no longer cancel the start half and leave the service stopped. The services page already polls and reloads once the backend answers again, so no cockpit change was needed. Keeping this open until the new hero_proc is deployed to the tester VMs and a UI-driven cockpit self-upgrade is verified live (a supervised step, since it restarts the supervisor).

Signed-by: mik-tf mik-tf@noreply.invalid

Fix pushed to the integration branch as commit 1d297d4: service.restart now runs the stop+start on a detached task so a client that dies mid-call (a service restarting its own bundle, e.g. the cockpit upgrading itself) can no longer cancel the start half and leave the service stopped. The services page already polls and reloads once the backend answers again, so no cockpit change was needed. Keeping this open until the new hero_proc is deployed to the tester VMs and a UI-driven cockpit self-upgrade is verified live (a supervised step, since it restarts the supervisor). Signed-by: mik-tf <mik-tf@noreply.invalid>
Author
Owner

Fixed and proven live. The fix (commit 1d297d4 on integration) detaches the stop+start in service.restart so the supervisor finishes the restart even when the requesting client dies mid-call.

Verified on a tester VM: triggered the cockpit to restart its own backend (the exact self-restart case). The cockpit process died mid-call as expected, and hero_proc completed the start on its own; the backend came back with a new pid and answered RPCs again within about a second. Before this fix the same action left the backend stopped until a manual SSH start.

Deployed to all 8 current tester VMs (supervisor restarted via the standard kill-and-relaunch sequence, with the supervised services preserved). Promotion to main follows the normal release cadence.

Signed-by: mik-tf mik-tf@noreply.invalid

Fixed and proven live. The fix (commit 1d297d4 on integration) detaches the stop+start in service.restart so the supervisor finishes the restart even when the requesting client dies mid-call. Verified on a tester VM: triggered the cockpit to restart its own backend (the exact self-restart case). The cockpit process died mid-call as expected, and hero_proc completed the start on its own; the backend came back with a new pid and answered RPCs again within about a second. Before this fix the same action left the backend stopped until a manual SSH start. Deployed to all 8 current tester VMs (supervisor restarted via the standard kill-and-relaunch sequence, with the supervised services preserved). Promotion to main follows the normal release cadence. Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#149
No description provided.