fix(BrowserPool): use chromiumoxide's documented graceful shutdown (supersedes #19) #20

Merged

sameh-farouk merged 3 commits from fix/browser-pool-graceful-close into development

2026-05-06 12:51:01 +00:00

sameh-farouk commented

2026-05-06 11:21:49 +00:00

Member

What

Closes #18. Supersedes #19 (which can stay open as a safety net until this lands).

PR #19 fixed the leak with a tree-walk SIGKILL approach: enumerate Chrome's descendants via /proc (Linux) or pgrep -P (macOS), SIGKILL each, retry until empty, then poll the dir for ~200 ms in case a helper recreated it. It worked, but it was treating a symptom — the leak only existed in the first place because we were using chromiumoxide wrong.

The actual diagnosis

chromiumoxide's README example shows the documented shutdown sequence:

drop(page);
browser.close().await?;
handle.abort();

Our BrowserInstance::close() was doing none of these. We just dropped the Browser, which triggered chromiumoxide's kill_on_drop SIGKILL on the launcher PID only. SIGKILL bypasses Chrome's own shutdown logic — renderer/GPU/utility helpers reparent to init and continue writing to --user-data-dir, recreating the directory we'd just deleted (verified by tracing: dir reappeared at +10 ms and persisted indefinitely).

The fix isn't a stronger SIGKILL. It's not SIGKILLing in the first place. Browser::close().await sends the CDP Browser.close command; Chrome runs its own shutdown, kills its own helpers, flushes pending writes, and exits the launcher cleanly. Browser::wait().await reaps the launcher. Then the dir is quiescent and safe to remove.

What changed vs PR #19

Bumped chromiumoxide 0.7 → 0.9.1 (latest, Feb 2026). The tokio-runtime feature was removed in 0.8 (tokio is now the only runtime); dropped from Cargo.toml.
BrowserInstance::close() now does the documented sequence with bounded timeouts (lock acquisition, graceful close, launcher reap), falling back to kill().await only if the graceful path fails or times out.
Wrapped Browser in tokio::sync::RwLock<Option<Browser>> because Browser::close and Browser::wait take &mut self. Page ops (new_page, &self) take read locks concurrently; only close() takes the write lock briefly.
Deleted entirely: kill_descendant_tree, verify_dir_stays_dead, list_descendants, the per-platform cfg(target_os) splits, the chrome_pid field, the libc Unix-only dependency.
Net diff vs PR #19's tip: −144 lines (and one less dep).

Kept from PR #19: RwLock<Option<Browser>> ownership wrapper, idle-timeout reaper, last_activity tracking, startup sweep of stale chromelambda-* dirs, BrowserPool::Drop fire-and-forget cleanup.

Regressions surfaced by independent review (commit 2)

After commit 1 (the actual fix), an independent regression-focused code review caught seven issues that the macOS-only lifecycle test would never have surfaced. Commit 2 addresses all of them:

Critical

chromiumoxide 0.9 silently mangles every .arg("--foo") call. The new Arg struct's From<&str> treats the entire string as a flag key and prepends -- itself, so .arg("--no-sandbox") emits ----no-sandbox and Chrome ignores it. On Linux production where Chrome's setuid sandbox isn't always available, every BrowserInstance::new would fail at launch — a complete service outage on every Hero OS dev box. The macOS-only lifecycle test couldn't catch this because macOS Chrome doesn't need --no-sandbox.
Fixed by using the .no_sandbox() builder shortcut, dropping leading --, using the (&str, &str) tuple form for --key=value flags, and removing args that chromiumoxide's DEFAULT_ARGS already includes.
hero_browser_server's SIGTERM/SIGINT path returned without ever calling pool.destroy_all().await. BrowserPool::Drop only does fire-and-forget tokio::spawn, but during runtime teardown the spawned task is aborted before it runs — leaving Chrome helpers reparented to init and burning RAM/CPU on the host until the next reboot. The PR #19 tree-walk approach masked this because its SIGKILL was synchronous and runtime-independent. Fixed by adding explicit pool.destroy_all().await before run_server returns, with a 30 s timeout cap.

Important

close() now bounds the write-lock acquisition (2 s) and browser.wait().await (3 s) so a wedged new_page can't block the reaper or shutdown indefinitely.
handler_task is now explicitly aborted in close()'s kill-fallback paths and in Drop. chromiumoxide's Handler only resolves cleanly after a successful CDP Browser.close roundtrip; without the abort, SIGKILL'd browsers leak the handler task on the runtime forever. Renamed _handler_task → handler_task.
BrowserInstance::Drop now drops the chromiumoxide Browser before remove_dir_all, so kill_on_drop fires before the dir-removal race window. Prior order was the same helper-recreates-dir bug we're fixing, just on the Drop fallback path.
BrowserPool::destroy_all is now concurrent (futures::future::join_all) with errors logged, not propagated. Prior serial for loop with ? short-circuit could have stalled shutdown for ~8 minutes worst case (50 browsers × 10 s timeout) and silently leaked half the pool if any browser errored.
(covered by 6)

How to verify

cargo run -p hero_browser_core --example test_lifecycle from the workspace root.

The example spawns real Chrome (mac path hard-coded; Linux auto-detects) and asserts on both sides of the bug:

Cause-side (process count via pgrep -f {uuid}): pre-flight asserts each browser has ≥ 2 Chrome processes (proves we're actually exercising the helper-tree shutdown), then asserts 0 processes after cleanup, holding through 1 s of polls.
Symptom-side (filesystem): user_data_dir is gone after cleanup and stays gone for 1 s of polls (the original bug-mode where helpers recreated it ~10 ms later).

Five invariants, all covered:

pool.destroy_all().await cleans up the survivor. In my macOS run: 7 → 0 Chrome procs, dir gone, both stay clean for 1 s.
get_browser bumps last_activity (active browser survives 6 s of pings with idle_timeout=3s).
Idle timeout reaper destroys idle browsers in [3s, 5s] window.
Startup sweep removes pre-existing chromelambda-* dirs.
After reap of idle browser: 0 Chrome processes belonging to it, dir stays gone for 1 s.

Sample output ends with === all five lifecycle tests PASSED ===.

Honest test limitations

This is a regression check, not exhaustive proof:

Linux /proc walk path is no longer present, so the platform-split issue from PR #19 is gone. The graceful close path is identical on Linux and macOS — both just send CDP Browser.close and wait. So unlike PR #19, "passes on macOS" is a much stronger signal here.
Single-spawn, ~30 s runtime — not a stress test. The 15-day kristof5 leak accumulated under real RPC traffic.
The kill-fallback path is rare in practice but still sub-optimal: it only kills the launcher PID, not the helper tree. If you hit it (timeout > 10 s on graceful close), the original leak class re-emerges briefly until the next startup sweep.

Not in this PR

LRU eviction at capacity — kept separate from the leak fix for reviewability.
A fix for the rare kill-fallback path leaving helpers alive — would require either upstream chromiumoxide cooperation or rewriting our launch path. Filing as a follow-up; the graceful path is the dominant case.

## What Closes #18. **Supersedes #19** (which can stay open as a safety net until this lands). PR #19 fixed the leak with a tree-walk SIGKILL approach: enumerate Chrome's descendants via `/proc` (Linux) or `pgrep -P` (macOS), SIGKILL each, retry until empty, then poll the dir for ~200 ms in case a helper recreated it. It worked, but it was treating a symptom — the leak only existed in the first place because **we were using chromiumoxide wrong**. ## The actual diagnosis chromiumoxide's README example shows the documented shutdown sequence: ```rust drop(page); browser.close().await?; handle.abort(); ``` Our `BrowserInstance::close()` was doing **none of these**. We just dropped the `Browser`, which triggered chromiumoxide's `kill_on_drop` SIGKILL on the launcher PID only. SIGKILL bypasses Chrome's own shutdown logic — renderer/GPU/utility helpers reparent to init and continue writing to `--user-data-dir`, recreating the directory we'd just deleted (verified by tracing: dir reappeared at +10 ms and persisted indefinitely). The fix isn't a stronger SIGKILL. It's not SIGKILLing in the first place. `Browser::close().await` sends the CDP `Browser.close` command; Chrome runs its own shutdown, kills its own helpers, flushes pending writes, and exits the launcher cleanly. `Browser::wait().await` reaps the launcher. Then the dir is quiescent and safe to remove. ## What changed vs PR #19 - Bumped `chromiumoxide` 0.7 → 0.9.1 (latest, Feb 2026). The `tokio-runtime` feature was removed in 0.8 (tokio is now the only runtime); dropped from `Cargo.toml`. - `BrowserInstance::close()` now does the documented sequence with bounded timeouts (lock acquisition, graceful close, launcher reap), falling back to `kill().await` only if the graceful path fails or times out. - Wrapped `Browser` in `tokio::sync::RwLock<Option<Browser>>` because `Browser::close` and `Browser::wait` take `&mut self`. Page ops (`new_page`, `&self`) take read locks concurrently; only `close()` takes the write lock briefly. - **Deleted entirely**: `kill_descendant_tree`, `verify_dir_stays_dead`, `list_descendants`, the per-platform `cfg(target_os)` splits, the `chrome_pid` field, the `libc` Unix-only dependency. - Net diff vs PR #19's tip: **−144 lines** (and one less dep). Kept from PR #19: `RwLock<Option<Browser>>` ownership wrapper, idle-timeout reaper, `last_activity` tracking, startup sweep of stale `chromelambda-*` dirs, `BrowserPool::Drop` fire-and-forget cleanup. ## Regressions surfaced by independent review (commit 2) After commit 1 (the actual fix), an independent regression-focused code review caught seven issues that the macOS-only lifecycle test would never have surfaced. Commit 2 addresses all of them: ### Critical 1. **`chromiumoxide 0.9` silently mangles every `.arg("--foo")` call.** The new `Arg` struct's `From<&str>` treats the entire string as a flag *key* and prepends `--` itself, so `.arg("--no-sandbox")` emits `----no-sandbox` and Chrome ignores it. On Linux production where Chrome's setuid sandbox isn't always available, every `BrowserInstance::new` would fail at launch — a complete service outage on every Hero OS dev box. The macOS-only lifecycle test couldn't catch this because macOS Chrome doesn't need `--no-sandbox`. Fixed by using the `.no_sandbox()` builder shortcut, dropping leading `--`, using the `(&str, &str)` tuple form for `--key=value` flags, and removing args that chromiumoxide's `DEFAULT_ARGS` already includes. 2. **`hero_browser_server`'s SIGTERM/SIGINT path returned without ever calling `pool.destroy_all().await`.** `BrowserPool::Drop` only does fire-and-forget `tokio::spawn`, but during runtime teardown the spawned task is aborted before it runs — leaving Chrome helpers reparented to init and burning RAM/CPU on the host until the next reboot. The PR #19 tree-walk approach masked this because its SIGKILL was synchronous and runtime-independent. Fixed by adding explicit `pool.destroy_all().await` before `run_server` returns, with a 30 s timeout cap. ### Important 3. `close()` now bounds the write-lock acquisition (2 s) and `browser.wait().await` (3 s) so a wedged `new_page` can't block the reaper or shutdown indefinitely. 4. `handler_task` is now explicitly aborted in close()'s kill-fallback paths and in `Drop`. chromiumoxide's `Handler` only resolves cleanly after a successful CDP `Browser.close` roundtrip; without the abort, SIGKILL'd browsers leak the handler task on the runtime forever. Renamed `_handler_task` → `handler_task`. 5. `BrowserInstance::Drop` now drops the chromiumoxide `Browser` *before* `remove_dir_all`, so `kill_on_drop` fires before the dir-removal race window. Prior order was the same helper-recreates-dir bug we're fixing, just on the Drop fallback path. 6. `BrowserPool::destroy_all` is now concurrent (`futures::future::join_all`) with errors logged, not propagated. Prior serial `for` loop with `?` short-circuit could have stalled shutdown for ~8 minutes worst case (50 browsers × 10 s timeout) and silently leaked half the pool if any browser errored. 7. (covered by 6) ## How to verify `cargo run -p hero_browser_core --example test_lifecycle` from the workspace root. The example spawns real Chrome (mac path hard-coded; Linux auto-detects) and asserts on **both** sides of the bug: - **Cause-side** (process count via `pgrep -f {uuid}`): pre-flight asserts each browser has ≥ 2 Chrome processes (proves we're actually exercising the helper-tree shutdown), then asserts 0 processes after cleanup, holding through 1 s of polls. - **Symptom-side** (filesystem): user_data_dir is gone after cleanup and stays gone for 1 s of polls (the original bug-mode where helpers recreated it ~10 ms later). Five invariants, all covered: 1. `pool.destroy_all().await` cleans up the survivor. In my macOS run: 7 → 0 Chrome procs, dir gone, both stay clean for 1 s. 2. `get_browser` bumps `last_activity` (active browser survives 6 s of pings with `idle_timeout=3s`). 3. Idle timeout reaper destroys idle browsers in [3s, 5s] window. 4. Startup sweep removes pre-existing `chromelambda-*` dirs. 5. After reap of idle browser: 0 Chrome processes belonging to it, dir stays gone for 1 s. Sample output ends with `=== all five lifecycle tests PASSED ===`. ### Honest test limitations This is a regression check, not exhaustive proof: - **Linux `/proc` walk path is no longer present**, so the platform-split issue from PR #19 is gone. The graceful close path is identical on Linux and macOS — both just send CDP `Browser.close` and wait. So unlike PR #19, "passes on macOS" is a much stronger signal here. - **Single-spawn, ~30 s runtime — not a stress test.** The 15-day kristof5 leak accumulated under real RPC traffic. - **The kill-fallback path is rare in practice but still sub-optimal**: it only kills the launcher PID, not the helper tree. If you hit it (timeout > 10 s on graceful close), the original leak class re-emerges briefly until the next startup sweep. ## Not in this PR - LRU eviction at capacity — kept separate from the leak fix for reviewability. - A fix for the rare kill-fallback path leaving helpers alive — would require either upstream chromiumoxide cooperation or rewriting our launch path. Filing as a follow-up; the graceful path is the dominant case.

sameh-farouk added 3 commits

2026-05-06 11:21:49 +00:00

fix(BrowserPool): add idle-timeout reaper, Drop-based cleanup, startup sweep, and fix Chrome helper leak d417ed877a

Closes lhumina_code/hero_browser#18.

Without these, hero_browser_server accumulates Chrome processes, RAM,
and /tmp/chromelambda-* directories indefinitely — every client that
crashes or forgets to call browser_destroy leaks one session
permanently. On kristof5, 15 days of accumulation reached 128 Chrome
processes / ~17 GB RSS.

The fix is layered:

1. Idle-timeout reaper. Per-browser last_activity: Mutex<Instant> on
   BrowserInstance, bumped automatically by BrowserPool::get_browser
   (so any RPC handler that touches a browser counts as activity).
   PoolConfig gains idle_timeout: Option<Duration> (default 30 min)
   and reap_check_interval: Duration (default 60s). When a tokio
   runtime is available, BrowserPool::new spawns a reaper task that
   destroys idle browsers under the write lock with a TOCTOU re-check.
   The task holds only a Weak<...> to the browsers map and is also
   aborted explicitly on BrowserPool::Drop.

2. Startup sweep. BrowserPool::new calls
   sweep_stale_chromelambda_dirs() to remove pre-existing
   /tmp/chromelambda-* dirs. Safe because by design only one
   hero_browser_server runs per uid (uid-fixed Unix socket + TCP
   port singletons in hero_browser_server::main), so any leftover
   dir owned by us is a crash-leftover. PermissionDenied on other
   users' dirs is silently skipped.

3. Tree-killing the Chrome process group on close().

   Diagnosis (verified with traces against the original code, see
   commit history): Chrome forks helper subprocesses (renderer, GPU,
   utility) under the launcher. chromiumoxide's Browser::kill resolves
   to tokio's Child::kill, which only sends SIGKILL to the launcher
   PID — see rust-lang/rust#115241 — and chromiumoxide does not
   configure a separate process group on the spawn Command. The
   helpers reparent to init when the launcher dies and continue
   writing to --user-data-dir, recreating the directory ~10 ms after
   the original close()/Drop completed. The trace showed dir reappear
   with Default/Cache/Cache_Data/index-dir/the-real-index and persist
   indefinitely.

   The engineered fix in BrowserInstance:
     - Wrap chromiumoxide::Browser in tokio::sync::RwLock<Option<_>>
       so close() can take it out under a write lock; page ops
       (new_page) take a read lock and run concurrently.
     - Capture the launcher PID at construction (browser.get_mut_child
       requires &mut Browser, only available before we wrap it).
     - close() runs in this exact order:
         1. Close pages.
         2. Take the Browser out under the write lock.
         3. SIGKILL every descendant of the launcher via
            kill_descendant_tree (Linux: walk /proc; macOS: pgrep -P;
            bounded retry until tree empty).
         4. browser.kill().await — reaps the launcher.
         5. remove_dir_all on user_data_dir.
         6. verify_dir_stays_dead: poll for ~200 ms; if a helper
            survived step 3 and recreated the dir, repeat.

   This is the standard Unix idiom for killing a process tree when
   process_group(0) wasn't set at spawn. Process-group-at-spawn
   (Fix A in the design discussion) would be marginally cleaner but
   requires duplicating chromiumoxide's private launch-arg logic
   (DEFAULT_ARGS, --remote-debugging-port handling, sandbox/headless
   flags) since BrowserConfig::launch is not customizable. Filing a
   follow-up to upstream chromiumoxide adding process_group support
   would let us migrate cleanly.

4. BrowserPool::Drop becomes best-effort fire-and-forget. The
   canonical async shutdown is the existing destroy_all().await; if
   the pool is dropped without that, Drop spawns a cleanup task on
   the current runtime so helpers don't leak past process exit.
   BrowserInstance::Drop is now a sync remove_dir_all only; the full
   tree-kill needs an await and lives in close().

5. libc added as a unix-only dependency for kill(2); no other deps.

6. New examples/test_lifecycle.rs spawns real Chrome (macOS dev path
   hard-codes the standard install) and asserts five invariants:
   startup sweep removes pre-existing dirs; get_browser bumps
   last_activity (active browser survives); idle timeout reaps;
   reaped dir stays gone for 1 s (the original bug repro); and
   destroy_all cleans up the survivor with the same 1 s persistence
   check.

Not in this PR:
- LRU eviction at capacity (kept separate from the leak fix).
- Upstream chromiumoxide PR for BrowserConfig::process_group(...) to
  let us drop the tree-walk in favor of killpg.

fix(BrowserPool): use chromiumoxide's documented graceful shutdown 6114a20021

Closes lhumina_code/hero_browser#18.

This replaces the SIGKILL-tree-walk approach with the chromiumoxide-correct
sequence shown in the library's README:

    browser.close().await   // CDP Browser.close — Chrome shuts down its own tree
    browser.wait().await    // reap the launcher
    remove_dir_all(...)     // dir is now quiescent

# What was wrong before

The original code (and the prior fix in this PR series) never called
Browser::close().await or Browser::wait().await. We just dropped the Browser,
which triggered chromiumoxide's kill_on_drop SIGKILL on the launcher PID only.
SIGKILL bypasses Chrome's own shutdown sequence — the renderer/GPU/utility
helpers reparent to init and continue writing to --user-data-dir, recreating
the directory we'd just deleted (verified by tracing: dir reappeared at +10ms
and persisted indefinitely).

The bug was using the library wrong, not a library bug. chromiumoxide's
README example explicitly shows `browser.close().await?;` as the shutdown
sequence; we never adopted it because Browser::close requires &mut self and
our BrowserInstance.browser was a bare field. Once that's wrapped in a lock
that allows mutable access (RwLock<Option<Browser>>), the correct sequence
falls into place.

# What changed vs the previous fix attempt

  - Bumped chromiumoxide 0.7 → 0.9.1 (latest, Feb 2026). The tokio-runtime
    feature was removed in 0.8 (tokio is now the only runtime); dropped from
    Cargo.toml.
  - close() now does the documented sequence with a 10-second timeout
    fallback to kill+wait if Browser.close hangs (defense for a truly stuck
    Chrome).
  - Deleted: kill_descendant_tree, verify_dir_stays_dead, list_descendants
    (Linux /proc walk and macOS pgrep -P fallback), the chrome_pid field, the
    libc unix-only dependency, and all the cfg(target_os) splits.
  - Kept: tokio::sync::RwLock<Option<Browser>> wrapper (still needed because
    Browser::close and Browser::wait take &mut self), the idle-timeout
    reaper, last_activity tracking, startup sweep, and BrowserPool::Drop's
    fire-and-forget cleanup.

# Result

Net -144 lines vs the tree-walk approach. No platform-specific code, no
external process spawns (pgrep), no syscall-level retry loops, no magic
numbers. Same five lifecycle tests pass with both cause-side (process count
via pgrep -f {uuid}) and symptom-side (dir existence) checks: 8/7 Chrome
procs pre-cleanup → 0/0 post-cleanup, holding through 1s of polls.

fix: regressions surfaced by review of the graceful-close bump

Test / test (pull_request) Successful in 2m41s

Details

Build and Test / build (pull_request) Successful in 4m41s

Details

5d9cb094b1

Independent regression review of the chromiumoxide 0.7→0.9 bump caught
issues that the macOS-only lifecycle test would never have surfaced.

# Critical

1. chromiumoxide 0.9's `arg()` mangles every `--foo` string we passed.
   The new `Arg` struct's `From<&str>` impl treats the entire string as a
   flag *key* and prepends `--` itself, so `.arg("--no-sandbox")` emitted
   `----no-sandbox` (which Chrome silently ignores). On Linux production
   where Chrome's setuid sandbox isn't always available, this would have
   caused every `BrowserInstance::new` to fail at launch — a complete
   service outage on every Hero OS dev box. macOS Chrome doesn't need
   --no-sandbox so the lifecycle test passed regardless.
   Fixed by using the `.no_sandbox()` builder shortcut (which emits both
   --no-sandbox and --disable-setuid-sandbox correctly), passing flag
   keys without the leading `--`, and using the `(&str, &str)` tuple
   form for `--key=value` flags. Also dropped the args that
   chromiumoxide's DEFAULT_ARGS already includes
   (`disable-dev-shm-usage`, `no-first-run`).

2. hero_browser_server's SIGTERM/SIGINT path returned without ever
   calling `pool.destroy_all().await`. `BrowserPool::Drop` only does
   fire-and-forget `tokio::spawn(close_all)`, but during runtime
   teardown the spawned task is aborted before it runs — leaving
   Chrome helpers reparented to init and consuming RAM/CPU on the
   host until the box reboots. The previous tree-walk fix masked
   this because its SIGKILL was synchronous and runtime-independent.
   Now `run_server` calls `pool.destroy_all().await` explicitly, with
   a 30-second timeout cap, before returning.

# Important

3. `close()` now bounds the write-lock acquisition (2 s) so a
   wedged `new_page` holding a read lock against an unresponsive
   Chrome can't block the reaper or shutdown indefinitely. Also
   bounds `browser.wait().await` (3 s).

4. `handler_task` is now explicitly aborted in close()'s kill-fallback
   paths and in `Drop`. chromiumoxide's `Handler` only resolves cleanly
   after a successful `Browser.close` CDP roundtrip; without the abort,
   SIGKILL'd browsers leak the handler task on the runtime forever.
   Renamed the field from `_handler_task` to `handler_task` since it
   is now actively used.

5. `BrowserInstance::Drop` now drops the chromiumoxide Browser
   *before* `remove_dir_all`, so the kill_on_drop SIGKILL fires
   before the dir-removal racing window. Prior order was the same
   helper-recreates-dir bug we're fixing, just on the Drop fallback
   path.

6. `BrowserPool::destroy_all` now drains the map atomically and runs
   per-browser closes via `futures::future::join_all` (concurrent)
   instead of a serial `for` loop with `?` short-circuit. With the
   default `max_browsers: 50` and the 10s per-browser close timeout,
   the prior serial path could stall shutdown for ~8 minutes; concurrent
   closes mean overall latency is bounded by the slowest single browser
   (10 s + slack), not their sum. Errors are logged but no longer
   short-circuit the rest.

All five lifecycle tests still pass. Diff vs the bump-only commit:
+137/-54.