fix(supervisor): stop service no longer leaves orphan processes #79

Merged
omarz merged 1 commit from development_supervisor_orphan_fix into development 2026-04-30 11:52:52 +00:00
Member

Repeating service.start / service.stop on a multi-action service (e.g.
hero_whiteboard, with separate _server and _ui actions) accumulated
orphan processes parented by hero_proc_server itself, with their DB
jobs marked cancelled (pid=0) but the OS processes alive.

Two races in the stop path:

  1. cancel_job killed the child first, then wrote phase=Cancelled. While
    kill_process_tree was blocking in its SIGTERM grace wait, the
    executor task's child.wait() returned and apply_exit_status read
    the still-Running phase, saw a non-zero exit + retry_policy slots
    left, and wrote phase=Retrying. The 500 ms supervisor poll then
    respawned a fresh child. cancel_job finally wrote Cancelled, pid=
    None — but a live PID was now untracked.

  2. handle_stop / handle_start(replace_existing) / handle_kill /
    handle_stop_all only iterated Running/Retrying jobs. Pending jobs
    (created by service.start but not yet picked up by the next 500 ms
    poll tick) were skipped at stop, then spawned later with no parent
    intent.

Fixes:

  • cancel_job writes Cancelled BEFORE killing. apply_exit_status's
    existing terminal-phase guard now fires and skips the Retrying
    write.
  • run_job_regular and run_job_pty re-check the phase between
    cmd.spawn() and the Running write. If terminal, kill the freshly-
    spawned child and bail — closes the narrower window where stop
    arrives between spawn and the DB update.
  • Added service_non_terminal_jobs (Pending/Waiting/Running/Retrying)
    alongside service_running_jobs and switched the four cancellation
    call sites to it. service_running_jobs is kept for status handlers,
    where the narrower meaning is correct.

Verified: 5 sequential and 10 rapid (300 ms) start/stop cycles, plus
5 cycles via hero_whiteboard --start/--stop (the SDK restart_service
path). Zero orphans in all runs; final state always matches the DB.

Repeating service.start / service.stop on a multi-action service (e.g. hero_whiteboard, with separate _server and _ui actions) accumulated orphan processes parented by hero_proc_server itself, with their DB jobs marked cancelled (pid=0) but the OS processes alive. Two races in the stop path: 1. cancel_job killed the child first, then wrote phase=Cancelled. While kill_process_tree was blocking in its SIGTERM grace wait, the executor task's child.wait() returned and apply_exit_status read the still-Running phase, saw a non-zero exit + retry_policy slots left, and wrote phase=Retrying. The 500 ms supervisor poll then respawned a fresh child. cancel_job finally wrote Cancelled, pid= None — but a live PID was now untracked. 2. handle_stop / handle_start(replace_existing) / handle_kill / handle_stop_all only iterated Running/Retrying jobs. Pending jobs (created by service.start but not yet picked up by the next 500 ms poll tick) were skipped at stop, then spawned later with no parent intent. Fixes: - cancel_job writes Cancelled BEFORE killing. apply_exit_status's existing terminal-phase guard now fires and skips the Retrying write. - run_job_regular and run_job_pty re-check the phase between cmd.spawn() and the Running write. If terminal, kill the freshly- spawned child and bail — closes the narrower window where stop arrives between spawn and the DB update. - Added service_non_terminal_jobs (Pending/Waiting/Running/Retrying) alongside service_running_jobs and switched the four cancellation call sites to it. service_running_jobs is kept for status handlers, where the narrower meaning is correct. Verified: 5 sequential and 10 rapid (300 ms) start/stop cycles, plus 5 cycles via hero_whiteboard --start/--stop (the SDK restart_service path). Zero orphans in all runs; final state always matches the DB.
fix(supervisor): stop service no longer leaves orphan processes
All checks were successful
Tests / test (pull_request) Successful in 3m51s
Build and Test / build (pull_request) Successful in 4m30s
de1c28bc14
Repeating service.start / service.stop on a multi-action service (e.g.
hero_whiteboard, with separate _server and _ui actions) accumulated
orphan processes parented by hero_proc_server itself, with their DB
jobs marked cancelled (pid=0) but the OS processes alive.

Two races in the stop path:

1. cancel_job killed the child first, then wrote phase=Cancelled. While
   kill_process_tree was blocking in its SIGTERM grace wait, the
   executor task's child.wait() returned and apply_exit_status read
   the still-Running phase, saw a non-zero exit + retry_policy slots
   left, and wrote phase=Retrying. The 500 ms supervisor poll then
   respawned a fresh child. cancel_job finally wrote Cancelled, pid=
   None — but a live PID was now untracked.

2. handle_stop / handle_start(replace_existing) / handle_kill /
   handle_stop_all only iterated Running/Retrying jobs. Pending jobs
   (created by service.start but not yet picked up by the next 500 ms
   poll tick) were skipped at stop, then spawned later with no parent
   intent.

Fixes:

- cancel_job writes Cancelled BEFORE killing. apply_exit_status's
  existing terminal-phase guard now fires and skips the Retrying
  write.
- run_job_regular and run_job_pty re-check the phase between
  cmd.spawn() and the Running write. If terminal, kill the freshly-
  spawned child and bail — closes the narrower window where stop
  arrives between spawn and the DB update.
- Added service_non_terminal_jobs (Pending/Waiting/Running/Retrying)
  alongside service_running_jobs and switched the four cancellation
  call sites to it. service_running_jobs is kept for status handlers,
  where the narrower meaning is correct.

Verified: 5 sequential and 10 rapid (300 ms) start/stop cycles, plus
5 cycles via hero_whiteboard --start/--stop (the SDK restart_service
path). Zero orphans in all runs; final state always matches the DB.
Author
Member

fix #61

fix https://forge.ourworld.tf/lhumina_code/hero_proc/issues/61
omarz merged commit 2266b1c0ba into development 2026-04-30 11:52:52 +00:00
omarz deleted branch development_supervisor_orphan_fix 2026-04-30 11:52:52 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc!79
No description provided.