Orphan supervised processes accumulate after stop/restart cycles #61

Open
opened 2026-04-29 03:32:46 +00:00 by sameh-farouk · 2 comments
Member

Observed

On a long-running multi-user box (138.201.206.39, 2026-04-29), pgrep -af for supervised binaries returns multiple PIDs per service — far more than hero_proc's proc service list knows about.

For user salma (current snapshot):

913095  /home/salma/hero/bin/hero_collab_ui
1462843 /home/salma/hero/bin/hero_collab_ui
1462875 /home/salma/hero/bin/hero_collab_server
1463221 /home/salma/hero/bin/hero_aibroker_server
1527585 /home/salma/hero/bin/hero_collab_ui
1556820 /home/salma/hero/bin/hero_collab_ui
1558779 /home/salma/hero/bin/hero_aibroker_server
1558829 /home/salma/hero/bin/hero_collab_ui
1558830 /home/salma/hero/bin/hero_collab_server

That's 7 hero_collab orphans + 2 hero_aibroker_server orphans for one user. Confirmed not collab-specific — pattern affects multiple service types.

Pattern

PIDs cluster around restart events (1462843/1462875/1463221 grouped, then 1527585, then 1556820, then 1558829/1558830). Each restart spawned new processes but didn't terminate the old ones.

Hypothesis

proc service stop and/or proc service restart doesn't reliably kill the entire process tree before spawning a new one. Possibilities:

  • TERM signal sent to the parent doesn't propagate to children when the parent doesn't double-fork properly
  • Retry policy spawns a new instance before the previous instance fully exits, and the supervisor loses track of the original PID
  • Race: SIGTERM is sent, the retry policy spawns a replacement, the original's exit signal is lost

Why this matters

  • Memory waste: ~50–150 MB per orphan × 9 orphans = ~1 GB on this user alone
  • UDS conflicts: orphan hero_collab_ui processes hold open ~/hero/var/sockets/hero_collab/ui.sock, blocking new instances from binding
  • Stale state: orphans hold cached config in memory (e.g., the livekit.secret cache issue tracked separately) — the new "supervised" instance reads fresh config but the orphan keeps serving stale traffic if it owns the socket
  • Cumulative drift: 8+ days of restarts on this box left this user with 9 orphans

Repro / diagnostic

  1. Start a hero_proc-managed service
  2. Note its PID via pgrep -af
  3. proc service restart <name>
  4. Compare pgrep -af — sometimes both old and new PID coexist

The trigger is intermittent on this box — sometimes restart is clean, sometimes not. A stress-test loop while watching pgrep should isolate the conditions.

Suggested fix direction

Before spawning a replacement, the supervisor should:

  1. Send SIGTERM to the entire process group (negative PID), not just the parent
  2. Wait for actual exit (or kill timeout) — confirmed via waitpid or /proc/<pid> check
  3. Only then mark the slot available for a new spawn

For already-orphaned processes, a proc service reap command (or auto-reap-on-startup based on UDS ownership) would help recover stuck boxes.

## Observed On a long-running multi-user box (138.201.206.39, 2026-04-29), `pgrep -af` for supervised binaries returns multiple PIDs per service — far more than hero_proc's `proc service list` knows about. For user salma (current snapshot): ``` 913095 /home/salma/hero/bin/hero_collab_ui 1462843 /home/salma/hero/bin/hero_collab_ui 1462875 /home/salma/hero/bin/hero_collab_server 1463221 /home/salma/hero/bin/hero_aibroker_server 1527585 /home/salma/hero/bin/hero_collab_ui 1556820 /home/salma/hero/bin/hero_collab_ui 1558779 /home/salma/hero/bin/hero_aibroker_server 1558829 /home/salma/hero/bin/hero_collab_ui 1558830 /home/salma/hero/bin/hero_collab_server ``` That's 7 hero_collab orphans + 2 hero_aibroker_server orphans for one user. Confirmed not collab-specific — pattern affects multiple service types. ## Pattern PIDs cluster around restart events (1462843/1462875/1463221 grouped, then 1527585, then 1556820, then 1558829/1558830). Each restart spawned new processes but didn't terminate the old ones. ## Hypothesis `proc service stop` and/or `proc service restart` doesn't reliably kill the entire process tree before spawning a new one. Possibilities: - TERM signal sent to the parent doesn't propagate to children when the parent doesn't double-fork properly - Retry policy spawns a new instance before the previous instance fully exits, and the supervisor loses track of the original PID - Race: SIGTERM is sent, the retry policy spawns a replacement, the original's exit signal is lost ## Why this matters - Memory waste: ~50–150 MB per orphan × 9 orphans = ~1 GB on this user alone - UDS conflicts: orphan `hero_collab_ui` processes hold open `~/hero/var/sockets/hero_collab/ui.sock`, blocking new instances from binding - Stale state: orphans hold cached config in memory (e.g., the livekit.secret cache issue tracked separately) — the new "supervised" instance reads fresh config but the orphan keeps serving stale traffic if it owns the socket - Cumulative drift: 8+ days of restarts on this box left this user with 9 orphans ## Repro / diagnostic 1. Start a hero_proc-managed service 2. Note its PID via `pgrep -af` 3. `proc service restart <name>` 4. Compare `pgrep -af` — sometimes both old and new PID coexist The trigger is intermittent on this box — sometimes restart is clean, sometimes not. A stress-test loop while watching `pgrep` should isolate the conditions. ## Suggested fix direction Before spawning a replacement, the supervisor should: 1. Send SIGTERM to the entire process group (negative PID), not just the parent 2. Wait for actual exit (or kill timeout) — confirmed via `waitpid` or `/proc/<pid>` check 3. Only then mark the slot available for a new spawn For already-orphaned processes, a `proc service reap` command (or auto-reap-on-startup based on UDS ownership) would help recover stuck boxes.
mahmoud self-assigned this 2026-04-30 10:41:26 +00:00
mahmoud added this to the ACTIVE project 2026-04-30 10:41:30 +00:00
mahmoud added this to the now milestone 2026-04-30 10:41:33 +00:00
Owner

Confirmed in the wild — heavy accumulation on a long-running box

Hit this hard on one of my hosts. After ~8+ days of restarts, ps shows multiple live generations of nearly every singleton service still resident, plus a textbook SIGCHLD/waitpid leak. Posting evidence in case it helps narrow down where in the supervisor the bug lives.

Duplicate live parents (newest PID is the one currently bound to the UDS; the rest are orphans)

Service Live parent PIDs Expected
hero_code_serve 404145, 1740421 1
hero_db_ui 404738, 1533338, 1740117 1
hero_db_server 404756, 1533546, 1740321 1
hero_osis / hero_osis_ui 404963/4, 811011/2, 1803707/8 1 each
hero_slides_* 405688/9, 1787034/5, 1804901/2 1 each
hero_whiteboard 405777/8, 1533319/28, 1805176/7 2 (ui+server)
hero_collab_*, hero_logic_*, hero_books_*, hero_indexer_*, hero_aibroker_*, hero_biz*, hero_voice_*, hero_agent_*, hero_foundry_*, hero_embedder_*, hero_proxy_ui, hero_livekit_*, hero_os_*, hero_codescaler 2× each 1 each

That's roughly 2–3 generations of most services. Aggregate RSS of the orphan generations is ~1.5 GB.

Smoking gun: an actual zombie

405150 │ 405098 │ livekit-server │ Zombie │ 0 B

Parent 405098 is an old hero_livekit_se (the current live one is 1855273). So the old supervisor child:

  1. never reaped its own livekit-server subprocess (SIGCHLD ignored / no waitpid), and
  2. was itself never killed when the new hero_livekit_se was spawned.

Both halves of #61 in one PID pair.

hero_code_serve worker pools are doubled

The two live parents each carry a full ~30-worker pool:

  • old parent 404145 → children 404162404193 (worker RSS ~1.4 MB, paged out)
  • new parent 1740421 → children 17404381740469 (worker RSS ~11 MB, hot)

Two complete pools fighting for the same UDS — exactly the symptom predicted in the issue body.

Implication for the fix

The proposed direction (TERM the process group, waitpid until exit, escalate to KILL on timeout, then mark slot free) matches what the evidence shows is missing. Two extra things worth folding in based on this data:

  1. Reaper for already-orphaned PIDs at startup — boxes that have been running through the buggy version need a way to clean up without a manual sweep. Walking /proc for processes whose comm matches a registered service binary but whose ppid != hero_proc would catch these.
  2. Explicit SIGCHLD handler / waitpid loop in the supervisor children, or prctl(PR_SET_PDEATHSIG, SIGTERM) on grandchildren — the livekit zombie shows the leak isn't only at the supervisor → service boundary, it's also at service → grandchild.
## Confirmed in the wild — heavy accumulation on a long-running box Hit this hard on one of my hosts. After ~8+ days of restarts, `ps` shows **multiple live generations of nearly every singleton service** still resident, plus a textbook `SIGCHLD`/`waitpid` leak. Posting evidence in case it helps narrow down where in the supervisor the bug lives. ### Duplicate live parents (newest PID is the one currently bound to the UDS; the rest are orphans) | Service | Live parent PIDs | Expected | |---|---|---| | `hero_code_serve` | 404145, 1740421 | 1 | | `hero_db_ui` | 404738, 1533338, 1740117 | 1 | | `hero_db_server` | 404756, 1533546, 1740321 | 1 | | `hero_osis` / `hero_osis_ui` | 404963/4, 811011/2, 1803707/8 | 1 each | | `hero_slides_*` | 405688/9, 1787034/5, 1804901/2 | 1 each | | `hero_whiteboard` | 405777/8, 1533319/28, 1805176/7 | 2 (ui+server) | | `hero_collab_*`, `hero_logic_*`, `hero_books_*`, `hero_indexer_*`, `hero_aibroker_*`, `hero_biz*`, `hero_voice_*`, `hero_agent_*`, `hero_foundry_*`, `hero_embedder_*`, `hero_proxy_ui`, `hero_livekit_*`, `hero_os_*`, `hero_codescaler` | 2× each | 1 each | That's roughly **2–3 generations** of most services. Aggregate RSS of the orphan generations is ~1.5 GB. ### Smoking gun: an actual zombie ``` 405150 │ 405098 │ livekit-server │ Zombie │ 0 B ``` Parent `405098` is an **old** `hero_livekit_se` (the current live one is `1855273`). So the old supervisor child: 1. never reaped its own `livekit-server` subprocess (`SIGCHLD` ignored / no `waitpid`), and 2. was itself never killed when the new `hero_livekit_se` was spawned. Both halves of #61 in one PID pair. ### `hero_code_serve` worker pools are doubled The two live parents each carry a full ~30-worker pool: - old parent `404145` → children `404162`–`404193` (worker RSS ~1.4 MB, paged out) - new parent `1740421` → children `1740438`–`1740469` (worker RSS ~11 MB, hot) Two complete pools fighting for the same UDS — exactly the symptom predicted in the issue body. ### Implication for the fix The proposed direction (TERM the **process group**, `waitpid` until exit, escalate to KILL on timeout, then mark slot free) matches what the evidence shows is missing. Two extra things worth folding in based on this data: 1. **Reaper for already-orphaned PIDs at startup** — boxes that have been running through the buggy version need a way to clean up without a manual sweep. Walking `/proc` for processes whose `comm` matches a registered service binary but whose `ppid != hero_proc` would catch these. 2. **Explicit `SIGCHLD` handler / `waitpid` loop in the supervisor children**, or `prctl(PR_SET_PDEATHSIG, SIGTERM)` on grandchildren — the livekit zombie shows the leak isn't only at the supervisor → service boundary, it's also at service → grandchild.
mahmoud removed their assignment 2026-04-30 11:22:40 +00:00
Owner

@omarz has already fixed it

@omarz has already fixed it
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#61
No description provided.