Orphan supervised processes accumulate after stop/restart cycles #61
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#61
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Observed
On a long-running multi-user box (138.201.206.39, 2026-04-29),
pgrep -affor supervised binaries returns multiple PIDs per service — far more than hero_proc'sproc service listknows about.For user salma (current snapshot):
That's 7 hero_collab orphans + 2 hero_aibroker_server orphans for one user. Confirmed not collab-specific — pattern affects multiple service types.
Pattern
PIDs cluster around restart events (1462843/1462875/1463221 grouped, then 1527585, then 1556820, then 1558829/1558830). Each restart spawned new processes but didn't terminate the old ones.
Hypothesis
proc service stopand/orproc service restartdoesn't reliably kill the entire process tree before spawning a new one. Possibilities:Why this matters
hero_collab_uiprocesses hold open~/hero/var/sockets/hero_collab/ui.sock, blocking new instances from bindingRepro / diagnostic
pgrep -afproc service restart <name>pgrep -af— sometimes both old and new PID coexistThe trigger is intermittent on this box — sometimes restart is clean, sometimes not. A stress-test loop while watching
pgrepshould isolate the conditions.Suggested fix direction
Before spawning a replacement, the supervisor should:
waitpidor/proc/<pid>checkFor already-orphaned processes, a
proc service reapcommand (or auto-reap-on-startup based on UDS ownership) would help recover stuck boxes.Confirmed in the wild — heavy accumulation on a long-running box
Hit this hard on one of my hosts. After ~8+ days of restarts,
psshows multiple live generations of nearly every singleton service still resident, plus a textbookSIGCHLD/waitpidleak. Posting evidence in case it helps narrow down where in the supervisor the bug lives.Duplicate live parents (newest PID is the one currently bound to the UDS; the rest are orphans)
hero_code_servehero_db_uihero_db_serverhero_osis/hero_osis_uihero_slides_*hero_whiteboardhero_collab_*,hero_logic_*,hero_books_*,hero_indexer_*,hero_aibroker_*,hero_biz*,hero_voice_*,hero_agent_*,hero_foundry_*,hero_embedder_*,hero_proxy_ui,hero_livekit_*,hero_os_*,hero_codescalerThat's roughly 2–3 generations of most services. Aggregate RSS of the orphan generations is ~1.5 GB.
Smoking gun: an actual zombie
Parent
405098is an oldhero_livekit_se(the current live one is1855273). So the old supervisor child:livekit-serversubprocess (SIGCHLDignored / nowaitpid), andhero_livekit_sewas spawned.Both halves of #61 in one PID pair.
hero_code_serveworker pools are doubledThe two live parents each carry a full ~30-worker pool:
404145→ children404162–404193(worker RSS ~1.4 MB, paged out)1740421→ children1740438–1740469(worker RSS ~11 MB, hot)Two complete pools fighting for the same UDS — exactly the symptom predicted in the issue body.
Implication for the fix
The proposed direction (TERM the process group,
waitpiduntil exit, escalate to KILL on timeout, then mark slot free) matches what the evidence shows is missing. Two extra things worth folding in based on this data:/procfor processes whosecommmatches a registered service binary but whoseppid != hero_procwould catch these.SIGCHLDhandler /waitpidloop in the supervisor children, orprctl(PR_SET_PDEATHSIG, SIGTERM)on grandchildren — the livekit zombie shows the leak isn't only at the supervisor → service boundary, it's also at service → grandchild.@omarz has already fixed it