fix(BrowserPool): idle reaper + Drop cleanup + startup sweep + Chrome tree-kill #19

Closed

sameh-farouk wants to merge 1 commit from fix/browser-pool-lifecycle into development

sameh-farouk commented

2026-05-06 09:37:03 +00:00

Member

What

Closes #18.

The original PR (commit db98124) added an idle-timeout reaper, a startup sweep, a BrowserInstance::Drop, and per-browser last_activity tracking. End-to-end testing with a real Chrome on macOS surfaced a deeper leak: the chromelambda-{id} directory reappears ~10 ms after we delete it and stays alive indefinitely.

This PR keeps the original four mechanisms and adds a fifth — a properly engineered Chrome process-tree teardown — that closes the leak.

Diagnosis (verified with traces, not guessed)

tokio::process::Child::kill().await (which chromiumoxide::Browser::kill resolves to) only signals the launcher PID, not Chrome's helpers (renderer, GPU, utility). See rust-lang/rust#115241. chromiumoxide's spawn Command does not call process_group or pre_exec, so when the launcher dies the helpers reparent to init and keep running long enough to recreate chromelambda-{id}/Default/Cache/Cache_Data/index-dir/ and re-write the-real-index.

The trace, run against the original code:

[TRACE close remove_dir_all]   before_existed=true  result=Ok  after_exists=false
[TRACE drop remove_dir_all]    before=false  result=Err(NotFound)  after=false
[TRACE close postpoll] +10ms   exists=true  contents=Some(1)
[TRACE close postpoll] +500ms  exists=true  contents=Some(1)

Cleanup succeeded. Then a helper recreated the dir 10 ms later and it persisted indefinitely.

Engineering options considered

Approach	Correctness	Cost
`UserDataDirGuard` + field-drop order + 100 ms sleep	Passes locally, racy under load — the 100 ms is unbounded and helpers can outlive it	Small
A. Spawn Chrome ourselves with `Command::process_group(0)` + `Browser::connect`, kill via `killpg(pgid, SIGKILL)`	Cleanest	Forces us to replicate chromiumoxide's private launch-arg construction (`DEFAULT_ARGS`, `--remote-debugging-port`, sandbox/headless flags) since `BrowserConfig::launch` is not customizable — high maintenance
B. Walk Chrome's descendant tree post-spawn, SIGKILL each, retry until empty (this PR)	Standard Unix idiom for tree-killing when `process_group` wasn't set at spawn — see chromedp / Aymeric Beaumet and the puppeteer ecosystem's `tree-kill` solutions	Confined to our wrapper code; chromiumoxide stays a black box

Going with B. A is the right architectural fix long-term — I'll file a follow-up suggesting upstream chromiumoxide expose BrowserConfig::process_group(...) so we can migrate.

What changed

BrowserInstance:

browser: tokio::sync::RwLock<Option<Browser>> instead of bare Browser — close() takes the write lock and take()s the value out; page ops (new_page) hold a read lock concurrently.
chrome_pid: Option<u32> captured at construction time (browser.get_mut_child().and_then(|c| c.inner.id())) so close() can walk descendants without &mut Browser.
close() ordering, exact:
1. Close pages.
2. let Some(mut browser) = guard.take() else { return Ok(()); } — idempotent.
3. kill_descendant_tree(pid) — bounded retry, SIGKILLs every descendant.
4. browser.kill().await — reaps the launcher.
5. remove_dir_all(user_data_dir).
6. verify_dir_stays_dead: poll for ~200 ms; if a helper survived step 3 and recreated the dir, repeat.
Drop is now sync remove_dir_all only — full tree-kill needs an await and lives in close().

kill_descendant_tree is #[cfg(target_os)]-split: Linux walks /proc/<pid>/status PPid (no extra deps, no subprocess); macOS uses pgrep -P (ships with the OS, only matters for dev tests). Production target is Linux.

BrowserPool::Drop becomes best-effort fire-and-forget — spawns a cleanup task on the current runtime if there is one. Canonical shutdown is pool.destroy_all().await.

libc = "0.2" added as a Unix-only dep for kill(2). No other dep additions.

How to verify

cargo run -p hero_browser_core --example test_lifecycle from the workspace root.

The example spawns real Chrome (mac path hard-coded; Linux auto-detects) and asserts on both sides of the bug:

Cause-side (process count via pgrep -f {uuid}): pre-flight asserts each browser has ≥ 2 Chrome processes (proves we're actually exercising the descendant-kill path), then asserts 0 processes after cleanup, holding through 1 s of polls.
Symptom-side (filesystem): user_data_dir is gone after cleanup and stays gone for 1 s of polls (the original bug-mode where helpers recreated it ~10 ms later).

Five invariants, all covered:

pool.destroy_all().await kills the survivor's whole process tree (in my macOS run: 8 Chrome procs → 0) and removes its dir; both stay clean for 1 s.
get_browser bumps last_activity (active browser survives 6 s of pings with idle_timeout=3s).
Idle timeout reaper destroys idle browsers in [3s, 5s] window.
Startup sweep removes pre-existing chromelambda-* dirs.
After reap of idle browser: 0 Chrome processes belonging to it, dir stays gone for 1 s.

Sample output ends with === all five lifecycle tests PASSED ===.

Test honesty

This test is a regression check, not exhaustive proof. Known coverage gaps:

Runs only on macOS (the pgrep -P branch of list_descendants); the Linux /proc walk has not been run end-to-end. Production target is Linux.
Single-spawn, ~30 s runtime — not a stress test. The 15-day kristof5 leak accumulated under real RPC traffic.
The verify_dir_stays_dead retry loop in close() is bounded (~200 ms) and the post-cleanup poll is bounded (1 s). A pathologically slow helper could in principle outlive both.

Not in this PR

LRU eviction at capacity — kept separate from the leak fix for reviewability.
Upstream chromiumoxide PR adding BrowserConfig::process_group(...) — would let us delete the tree-walk in favor of killpg. Filing as a follow-up.

## What Closes #18. The original PR (commit `db98124`) added an idle-timeout reaper, a startup sweep, a `BrowserInstance::Drop`, and per-browser `last_activity` tracking. End-to-end testing with a real Chrome on macOS surfaced a deeper leak: **the `chromelambda-{id}` directory reappears ~10 ms after we delete it and stays alive indefinitely.** This PR keeps the original four mechanisms and adds a fifth — a properly engineered Chrome process-tree teardown — that closes the leak. ## Diagnosis (verified with traces, not guessed) `tokio::process::Child::kill().await` (which `chromiumoxide::Browser::kill` resolves to) only signals the **launcher PID**, not Chrome's helpers (renderer, GPU, utility). See [rust-lang/rust#115241](https://github.com/rust-lang/rust/issues/115241). chromiumoxide's spawn `Command` does not call `process_group` or `pre_exec`, so when the launcher dies the helpers reparent to init and keep running long enough to recreate `chromelambda-{id}/Default/Cache/Cache_Data/index-dir/` and re-write `the-real-index`. The trace, run against the original code: ``` [TRACE close remove_dir_all] before_existed=true result=Ok after_exists=false [TRACE drop remove_dir_all] before=false result=Err(NotFound) after=false [TRACE close postpoll] +10ms exists=true contents=Some(1) [TRACE close postpoll] +500ms exists=true contents=Some(1) ``` Cleanup succeeded. Then a helper recreated the dir 10 ms later and it persisted indefinitely. ## Engineering options considered | Approach | Correctness | Cost | |---|---|---| | `UserDataDirGuard` + field-drop order + 100 ms sleep | Passes locally, racy under load — the 100 ms is unbounded and helpers can outlive it | Small | | **A.** Spawn Chrome ourselves with `Command::process_group(0)` + `Browser::connect`, kill via `killpg(pgid, SIGKILL)` | Cleanest | Forces us to replicate chromiumoxide's private launch-arg construction (`DEFAULT_ARGS`, `--remote-debugging-port`, sandbox/headless flags) since `BrowserConfig::launch` is not customizable — high maintenance | | **B.** Walk Chrome's descendant tree post-spawn, SIGKILL each, retry until empty (this PR) | Standard Unix idiom for tree-killing when `process_group` wasn't set at spawn — see [chromedp / Aymeric Beaumet](https://aymericbeaumet.com/prevent-chromedp-chromium-zombie-processes-from-stacking/) and the puppeteer ecosystem's `tree-kill` solutions | Confined to our wrapper code; chromiumoxide stays a black box | Going with **B**. A is the right architectural fix long-term — I'll file a follow-up suggesting upstream chromiumoxide expose `BrowserConfig::process_group(...)` so we can migrate. ## What changed `BrowserInstance`: - `browser: tokio::sync::RwLock<Option<Browser>>` instead of bare `Browser` — `close()` takes the write lock and `take()`s the value out; page ops (`new_page`) hold a read lock concurrently. - `chrome_pid: Option<u32>` captured at construction time (`browser.get_mut_child().and_then(|c| c.inner.id())`) so `close()` can walk descendants without `&mut Browser`. - `close()` ordering, exact: 1. Close pages. 2. `let Some(mut browser) = guard.take() else { return Ok(()); }` — idempotent. 3. `kill_descendant_tree(pid)` — bounded retry, SIGKILLs every descendant. 4. `browser.kill().await` — reaps the launcher. 5. `remove_dir_all(user_data_dir)`. 6. `verify_dir_stays_dead`: poll for ~200 ms; if a helper survived step 3 and recreated the dir, repeat. - `Drop` is now sync `remove_dir_all` only — full tree-kill needs an `await` and lives in `close()`. `kill_descendant_tree` is `#[cfg(target_os)]`-split: Linux walks `/proc/<pid>/status` PPid (no extra deps, no subprocess); macOS uses `pgrep -P` (ships with the OS, only matters for dev tests). Production target is Linux. `BrowserPool::Drop` becomes best-effort fire-and-forget — spawns a cleanup task on the current runtime if there is one. Canonical shutdown is `pool.destroy_all().await`. `libc = "0.2"` added as a Unix-only dep for `kill(2)`. No other dep additions. ## How to verify `cargo run -p hero_browser_core --example test_lifecycle` from the workspace root. The example spawns real Chrome (mac path hard-coded; Linux auto-detects) and asserts on **both** sides of the bug: - **Cause-side** (process count via `pgrep -f {uuid}`): pre-flight asserts each browser has ≥ 2 Chrome processes (proves we're actually exercising the descendant-kill path), then asserts 0 processes after cleanup, holding through 1 s of polls. - **Symptom-side** (filesystem): user_data_dir is gone after cleanup and stays gone for 1 s of polls (the original bug-mode where helpers recreated it ~10 ms later). Five invariants, all covered: 1. `pool.destroy_all().await` kills the survivor's whole process tree (in my macOS run: 8 Chrome procs → 0) and removes its dir; both stay clean for 1 s. 2. `get_browser` bumps `last_activity` (active browser survives 6 s of pings with `idle_timeout=3s`). 3. Idle timeout reaper destroys idle browsers in [3s, 5s] window. 4. Startup sweep removes pre-existing `chromelambda-*` dirs. 5. After reap of idle browser: 0 Chrome processes belonging to it, dir stays gone for 1 s. Sample output ends with `=== all five lifecycle tests PASSED ===`. ### Test honesty This test is a regression check, not exhaustive proof. Known coverage gaps: - Runs only on macOS (the `pgrep -P` branch of `list_descendants`); the Linux `/proc` walk has not been run end-to-end. Production target is Linux. - Single-spawn, ~30 s runtime — not a stress test. The 15-day kristof5 leak accumulated under real RPC traffic. - The `verify_dir_stays_dead` retry loop in `close()` is bounded (~200 ms) and the post-cleanup poll is bounded (1 s). A pathologically slow helper could in principle outlive both. ## Not in this PR - LRU eviction at capacity — kept separate from the leak fix for reviewability. - Upstream chromiumoxide PR adding `BrowserConfig::process_group(...)` — would let us delete the tree-walk in favor of `killpg`. Filing as a follow-up.

sameh-farouk added 1 commit

2026-05-06 09:37:03 +00:00

fix(BrowserPool): add idle-timeout reaper, Drop-based cleanup, startup sweep, and fix Chrome helper leak

Test / test (pull_request) Successful in 2m32s

Details

Build and Test / build (pull_request) Successful in 4m37s

Details

c3d187df72

Closes lhumina_code/hero_browser#18.

Without these, hero_browser_server accumulates Chrome processes, RAM,
and /tmp/chromelambda-* directories indefinitely — every client that
crashes or forgets to call browser_destroy leaks one session
permanently. On kristof5, 15 days of accumulation reached 128 Chrome
processes / ~17 GB RSS.

The fix is layered:

1. Idle-timeout reaper. Per-browser last_activity: Mutex<Instant> on
   BrowserInstance, bumped automatically by BrowserPool::get_browser
   (so any RPC handler that touches a browser counts as activity).
   PoolConfig gains idle_timeout: Option<Duration> (default 30 min)
   and reap_check_interval: Duration (default 60s). When a tokio
   runtime is available, BrowserPool::new spawns a reaper task that
   destroys idle browsers under the write lock with a TOCTOU re-check.
   The task holds only a Weak<...> to the browsers map and is also
   aborted explicitly on BrowserPool::Drop.

2. Startup sweep. BrowserPool::new calls
   sweep_stale_chromelambda_dirs() to remove pre-existing
   /tmp/chromelambda-* dirs. Safe because by design only one
   hero_browser_server runs per uid (uid-fixed Unix socket + TCP
   port singletons in hero_browser_server::main), so any leftover
   dir owned by us is a crash-leftover. PermissionDenied on other
   users' dirs is silently skipped.

3. Tree-killing the Chrome process group on close().

   Diagnosis (verified with traces against the original code, see
   commit history): Chrome forks helper subprocesses (renderer, GPU,
   utility) under the launcher. chromiumoxide's Browser::kill resolves
   to tokio's Child::kill, which only sends SIGKILL to the launcher
   PID — see rust-lang/rust#115241 — and chromiumoxide does not
   configure a separate process group on the spawn Command. The
   helpers reparent to init when the launcher dies and continue
   writing to --user-data-dir, recreating the directory ~10 ms after
   the original close()/Drop completed. The trace showed dir reappear
   with Default/Cache/Cache_Data/index-dir/the-real-index and persist
   indefinitely.

   The engineered fix in BrowserInstance:
     - Wrap chromiumoxide::Browser in tokio::sync::RwLock<Option<_>>
       so close() can take it out under a write lock; page ops
       (new_page) take a read lock and run concurrently.
     - Capture the launcher PID at construction (browser.get_mut_child
       requires &mut Browser, only available before we wrap it).
     - close() runs in this exact order:
         1. Close pages.
         2. Take the Browser out under the write lock.
         3. SIGKILL every descendant of the launcher via
            kill_descendant_tree (Linux: walk /proc; macOS: pgrep -P;
            bounded retry until tree empty).
         4. browser.kill().await — reaps the launcher.
         5. remove_dir_all on user_data_dir.
         6. verify_dir_stays_dead: poll for ~200 ms; if a helper
            survived step 3 and recreated the dir, repeat.

   This is the standard Unix idiom for killing a process tree when
   process_group(0) wasn't set at spawn. Process-group-at-spawn
   (Fix A in the design discussion) would be marginally cleaner but
   requires duplicating chromiumoxide's private launch-arg logic
   (DEFAULT_ARGS, --remote-debugging-port handling, sandbox/headless
   flags) since BrowserConfig::launch is not customizable. Filing a
   follow-up to upstream chromiumoxide adding process_group support
   would let us migrate cleanly.

4. BrowserPool::Drop becomes best-effort fire-and-forget. The
   canonical async shutdown is the existing destroy_all().await; if
   the pool is dropped without that, Drop spawns a cleanup task on
   the current runtime so helpers don't leak past process exit.
   BrowserInstance::Drop is now a sync remove_dir_all only; the full
   tree-kill needs an await and lives in close().

5. libc added as a unix-only dependency for kill(2); no other deps.

6. New examples/test_lifecycle.rs spawns real Chrome (macOS dev path
   hard-codes the standard install) and asserts five invariants:
   startup sweep removes pre-existing dirs; get_browser bumps
   last_activity (active browser survives); idle timeout reaps;
   reaped dir stays gone for 1 s (the original bug repro); and
   destroy_all cleans up the survivor with the same 1 s persistence
   check.

Not in this PR:
- LRU eviction at capacity (kept separate from the leak fix).
- Upstream chromiumoxide PR for BrowserConfig::process_group(...) to
  let us drop the tree-walk in favor of killpg.