fix(supervisor): stop service no longer leaves orphan processes

omarz commented

2026-04-30 11:48:21 +00:00

Member

Repeating service.start / service.stop on a multi-action service (e.g.
hero_whiteboard, with separate _server and _ui actions) accumulated
orphan processes parented by hero_proc_server itself, with their DB
jobs marked cancelled (pid=0) but the OS processes alive.

Two races in the stop path:

cancel_job killed the child first, then wrote phase=Cancelled. While
kill_process_tree was blocking in its SIGTERM grace wait, the
executor task's child.wait() returned and apply_exit_status read
the still-Running phase, saw a non-zero exit + retry_policy slots
left, and wrote phase=Retrying. The 500 ms supervisor poll then
respawned a fresh child. cancel_job finally wrote Cancelled, pid=
None — but a live PID was now untracked.
handle_stop / handle_start(replace_existing) / handle_kill /
handle_stop_all only iterated Running/Retrying jobs. Pending jobs
(created by service.start but not yet picked up by the next 500 ms
poll tick) were skipped at stop, then spawned later with no parent
intent.

Fixes:

cancel_job writes Cancelled BEFORE killing. apply_exit_status's
existing terminal-phase guard now fires and skips the Retrying
write.
run_job_regular and run_job_pty re-check the phase between
cmd.spawn() and the Running write. If terminal, kill the freshly-
spawned child and bail — closes the narrower window where stop
arrives between spawn and the DB update.
Added service_non_terminal_jobs (Pending/Waiting/Running/Retrying)
alongside service_running_jobs and switched the four cancellation
call sites to it. service_running_jobs is kept for status handlers,
where the narrower meaning is correct.

Verified: 5 sequential and 10 rapid (300 ms) start/stop cycles, plus
5 cycles via hero_whiteboard --start/--stop (the SDK restart_service
path). Zero orphans in all runs; final state always matches the DB.

Repeating service.start / service.stop on a multi-action service (e.g. hero_whiteboard, with separate _server and _ui actions) accumulated orphan processes parented by hero_proc_server itself, with their DB jobs marked cancelled (pid=0) but the OS processes alive. Two races in the stop path: 1. cancel_job killed the child first, then wrote phase=Cancelled. While kill_process_tree was blocking in its SIGTERM grace wait, the executor task's child.wait() returned and apply_exit_status read the still-Running phase, saw a non-zero exit + retry_policy slots left, and wrote phase=Retrying. The 500 ms supervisor poll then respawned a fresh child. cancel_job finally wrote Cancelled, pid= None — but a live PID was now untracked. 2. handle_stop / handle_start(replace_existing) / handle_kill / handle_stop_all only iterated Running/Retrying jobs. Pending jobs (created by service.start but not yet picked up by the next 500 ms poll tick) were skipped at stop, then spawned later with no parent intent. Fixes: - cancel_job writes Cancelled BEFORE killing. apply_exit_status's existing terminal-phase guard now fires and skips the Retrying write. - run_job_regular and run_job_pty re-check the phase between cmd.spawn() and the Running write. If terminal, kill the freshly- spawned child and bail — closes the narrower window where stop arrives between spawn and the DB update. - Added service_non_terminal_jobs (Pending/Waiting/Running/Retrying) alongside service_running_jobs and switched the four cancellation call sites to it. service_running_jobs is kept for status handlers, where the narrower meaning is correct. Verified: 5 sequential and 10 rapid (300 ms) start/stop cycles, plus 5 cycles via hero_whiteboard --start/--stop (the SDK restart_service path). Zero orphans in all runs; final state always matches the DB.

omarz added 1 commit

2026-04-30 11:48:21 +00:00

fix(supervisor): stop service no longer leaves orphan processes

Tests / test (pull_request) Successful in 3m51s

Details

Build and Test / build (pull_request) Successful in 4m30s

Details

de1c28bc14

Repeating service.start / service.stop on a multi-action service (e.g.
hero_whiteboard, with separate _server and _ui actions) accumulated
orphan processes parented by hero_proc_server itself, with their DB
jobs marked cancelled (pid=0) but the OS processes alive.

Two races in the stop path:

1. cancel_job killed the child first, then wrote phase=Cancelled. While
   kill_process_tree was blocking in its SIGTERM grace wait, the
   executor task's child.wait() returned and apply_exit_status read
   the still-Running phase, saw a non-zero exit + retry_policy slots
   left, and wrote phase=Retrying. The 500 ms supervisor poll then
   respawned a fresh child. cancel_job finally wrote Cancelled, pid=
   None — but a live PID was now untracked.

2. handle_stop / handle_start(replace_existing) / handle_kill /
   handle_stop_all only iterated Running/Retrying jobs. Pending jobs
   (created by service.start but not yet picked up by the next 500 ms
   poll tick) were skipped at stop, then spawned later with no parent
   intent.

Fixes:

- cancel_job writes Cancelled BEFORE killing. apply_exit_status's
  existing terminal-phase guard now fires and skips the Retrying
  write.
- run_job_regular and run_job_pty re-check the phase between
  cmd.spawn() and the Running write. If terminal, kill the freshly-
  spawned child and bail — closes the narrower window where stop
  arrives between spawn and the DB update.
- Added service_non_terminal_jobs (Pending/Waiting/Running/Retrying)
  alongside service_running_jobs and switched the four cancellation
  call sites to it. service_running_jobs is kept for status handlers,
  where the narrower meaning is correct.

Verified: 5 sequential and 10 rapid (300 ms) start/stop cycles, plus
5 cycles via hero_whiteboard --start/--stop (the SDK restart_service
path). Zero orphans in all runs; final state always matches the DB.

omarz commented

2026-04-30 11:52:35 +00:00

Author

Member

fix #61

fix https://forge.ourworld.tf/lhumina_code/hero_proc/issues/61

omarz merged commit 2266b1c0ba into development

2026-04-30 11:52:52 +00:00

omarz deleted branch development_supervisor_orphan_fix

2026-04-30 11:52:52 +00:00

Rows
Columns

fix(supervisor): stop service no longer leaves orphan processes #79