fix(supervisor): stop service no longer leaves orphan processes #79
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc!79
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "development_supervisor_orphan_fix"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Repeating service.start / service.stop on a multi-action service (e.g.
hero_whiteboard, with separate _server and _ui actions) accumulated
orphan processes parented by hero_proc_server itself, with their DB
jobs marked cancelled (pid=0) but the OS processes alive.
Two races in the stop path:
cancel_job killed the child first, then wrote phase=Cancelled. While
kill_process_tree was blocking in its SIGTERM grace wait, the
executor task's child.wait() returned and apply_exit_status read
the still-Running phase, saw a non-zero exit + retry_policy slots
left, and wrote phase=Retrying. The 500 ms supervisor poll then
respawned a fresh child. cancel_job finally wrote Cancelled, pid=
None — but a live PID was now untracked.
handle_stop / handle_start(replace_existing) / handle_kill /
handle_stop_all only iterated Running/Retrying jobs. Pending jobs
(created by service.start but not yet picked up by the next 500 ms
poll tick) were skipped at stop, then spawned later with no parent
intent.
Fixes:
existing terminal-phase guard now fires and skips the Retrying
write.
cmd.spawn() and the Running write. If terminal, kill the freshly-
spawned child and bail — closes the narrower window where stop
arrives between spawn and the DB update.
alongside service_running_jobs and switched the four cancellation
call sites to it. service_running_jobs is kept for status handlers,
where the narrower meaning is correct.
Verified: 5 sequential and 10 rapid (300 ms) start/stop cycles, plus
5 cycles via hero_whiteboard --start/--stop (the SDK restart_service
path). Zero orphans in all runs; final state always matches the DB.
fix #61