fix(BrowserPool): use chromiumoxide's documented graceful shutdown (supersedes #19) #20

Merged
sameh-farouk merged 3 commits from fix/browser-pool-graceful-close into development 2026-05-06 12:51:01 +00:00
Member

What

Closes #18. Supersedes #19 (which can stay open as a safety net until this lands).

PR #19 fixed the leak with a tree-walk SIGKILL approach: enumerate Chrome's descendants via /proc (Linux) or pgrep -P (macOS), SIGKILL each, retry until empty, then poll the dir for ~200 ms in case a helper recreated it. It worked, but it was treating a symptom — the leak only existed in the first place because we were using chromiumoxide wrong.

The actual diagnosis

chromiumoxide's README example shows the documented shutdown sequence:

drop(page);
browser.close().await?;
handle.abort();

Our BrowserInstance::close() was doing none of these. We just dropped the Browser, which triggered chromiumoxide's kill_on_drop SIGKILL on the launcher PID only. SIGKILL bypasses Chrome's own shutdown logic — renderer/GPU/utility helpers reparent to init and continue writing to --user-data-dir, recreating the directory we'd just deleted (verified by tracing: dir reappeared at +10 ms and persisted indefinitely).

The fix isn't a stronger SIGKILL. It's not SIGKILLing in the first place. Browser::close().await sends the CDP Browser.close command; Chrome runs its own shutdown, kills its own helpers, flushes pending writes, and exits the launcher cleanly. Browser::wait().await reaps the launcher. Then the dir is quiescent and safe to remove.

What changed vs PR #19

  • Bumped chromiumoxide 0.7 → 0.9.1 (latest, Feb 2026). The tokio-runtime feature was removed in 0.8 (tokio is now the only runtime); dropped from Cargo.toml.
  • BrowserInstance::close() now does the documented sequence with bounded timeouts (lock acquisition, graceful close, launcher reap), falling back to kill().await only if the graceful path fails or times out.
  • Wrapped Browser in tokio::sync::RwLock<Option<Browser>> because Browser::close and Browser::wait take &mut self. Page ops (new_page, &self) take read locks concurrently; only close() takes the write lock briefly.
  • Deleted entirely: kill_descendant_tree, verify_dir_stays_dead, list_descendants, the per-platform cfg(target_os) splits, the chrome_pid field, the libc Unix-only dependency.
  • Net diff vs PR #19's tip: −144 lines (and one less dep).

Kept from PR #19: RwLock<Option<Browser>> ownership wrapper, idle-timeout reaper, last_activity tracking, startup sweep of stale chromelambda-* dirs, BrowserPool::Drop fire-and-forget cleanup.

Regressions surfaced by independent review (commit 2)

After commit 1 (the actual fix), an independent regression-focused code review caught seven issues that the macOS-only lifecycle test would never have surfaced. Commit 2 addresses all of them:

Critical

  1. chromiumoxide 0.9 silently mangles every .arg("--foo") call. The new Arg struct's From<&str> treats the entire string as a flag key and prepends -- itself, so .arg("--no-sandbox") emits ----no-sandbox and Chrome ignores it. On Linux production where Chrome's setuid sandbox isn't always available, every BrowserInstance::new would fail at launch — a complete service outage on every Hero OS dev box. The macOS-only lifecycle test couldn't catch this because macOS Chrome doesn't need --no-sandbox.
    Fixed by using the .no_sandbox() builder shortcut, dropping leading --, using the (&str, &str) tuple form for --key=value flags, and removing args that chromiumoxide's DEFAULT_ARGS already includes.

  2. hero_browser_server's SIGTERM/SIGINT path returned without ever calling pool.destroy_all().await. BrowserPool::Drop only does fire-and-forget tokio::spawn, but during runtime teardown the spawned task is aborted before it runs — leaving Chrome helpers reparented to init and burning RAM/CPU on the host until the next reboot. The PR #19 tree-walk approach masked this because its SIGKILL was synchronous and runtime-independent. Fixed by adding explicit pool.destroy_all().await before run_server returns, with a 30 s timeout cap.

Important

  1. close() now bounds the write-lock acquisition (2 s) and browser.wait().await (3 s) so a wedged new_page can't block the reaper or shutdown indefinitely.

  2. handler_task is now explicitly aborted in close()'s kill-fallback paths and in Drop. chromiumoxide's Handler only resolves cleanly after a successful CDP Browser.close roundtrip; without the abort, SIGKILL'd browsers leak the handler task on the runtime forever. Renamed _handler_taskhandler_task.

  3. BrowserInstance::Drop now drops the chromiumoxide Browser before remove_dir_all, so kill_on_drop fires before the dir-removal race window. Prior order was the same helper-recreates-dir bug we're fixing, just on the Drop fallback path.

  4. BrowserPool::destroy_all is now concurrent (futures::future::join_all) with errors logged, not propagated. Prior serial for loop with ? short-circuit could have stalled shutdown for ~8 minutes worst case (50 browsers × 10 s timeout) and silently leaked half the pool if any browser errored.

  5. (covered by 6)

How to verify

cargo run -p hero_browser_core --example test_lifecycle from the workspace root.

The example spawns real Chrome (mac path hard-coded; Linux auto-detects) and asserts on both sides of the bug:

  • Cause-side (process count via pgrep -f {uuid}): pre-flight asserts each browser has ≥ 2 Chrome processes (proves we're actually exercising the helper-tree shutdown), then asserts 0 processes after cleanup, holding through 1 s of polls.
  • Symptom-side (filesystem): user_data_dir is gone after cleanup and stays gone for 1 s of polls (the original bug-mode where helpers recreated it ~10 ms later).

Five invariants, all covered:

  1. pool.destroy_all().await cleans up the survivor. In my macOS run: 7 → 0 Chrome procs, dir gone, both stay clean for 1 s.
  2. get_browser bumps last_activity (active browser survives 6 s of pings with idle_timeout=3s).
  3. Idle timeout reaper destroys idle browsers in [3s, 5s] window.
  4. Startup sweep removes pre-existing chromelambda-* dirs.
  5. After reap of idle browser: 0 Chrome processes belonging to it, dir stays gone for 1 s.

Sample output ends with === all five lifecycle tests PASSED ===.

Honest test limitations

This is a regression check, not exhaustive proof:

  • Linux /proc walk path is no longer present, so the platform-split issue from PR #19 is gone. The graceful close path is identical on Linux and macOS — both just send CDP Browser.close and wait. So unlike PR #19, "passes on macOS" is a much stronger signal here.
  • Single-spawn, ~30 s runtime — not a stress test. The 15-day kristof5 leak accumulated under real RPC traffic.
  • The kill-fallback path is rare in practice but still sub-optimal: it only kills the launcher PID, not the helper tree. If you hit it (timeout > 10 s on graceful close), the original leak class re-emerges briefly until the next startup sweep.

Not in this PR

  • LRU eviction at capacity — kept separate from the leak fix for reviewability.
  • A fix for the rare kill-fallback path leaving helpers alive — would require either upstream chromiumoxide cooperation or rewriting our launch path. Filing as a follow-up; the graceful path is the dominant case.
## What Closes #18. **Supersedes #19** (which can stay open as a safety net until this lands). PR #19 fixed the leak with a tree-walk SIGKILL approach: enumerate Chrome's descendants via `/proc` (Linux) or `pgrep -P` (macOS), SIGKILL each, retry until empty, then poll the dir for ~200 ms in case a helper recreated it. It worked, but it was treating a symptom — the leak only existed in the first place because **we were using chromiumoxide wrong**. ## The actual diagnosis chromiumoxide's README example shows the documented shutdown sequence: ```rust drop(page); browser.close().await?; handle.abort(); ``` Our `BrowserInstance::close()` was doing **none of these**. We just dropped the `Browser`, which triggered chromiumoxide's `kill_on_drop` SIGKILL on the launcher PID only. SIGKILL bypasses Chrome's own shutdown logic — renderer/GPU/utility helpers reparent to init and continue writing to `--user-data-dir`, recreating the directory we'd just deleted (verified by tracing: dir reappeared at +10 ms and persisted indefinitely). The fix isn't a stronger SIGKILL. It's not SIGKILLing in the first place. `Browser::close().await` sends the CDP `Browser.close` command; Chrome runs its own shutdown, kills its own helpers, flushes pending writes, and exits the launcher cleanly. `Browser::wait().await` reaps the launcher. Then the dir is quiescent and safe to remove. ## What changed vs PR #19 - Bumped `chromiumoxide` 0.7 → 0.9.1 (latest, Feb 2026). The `tokio-runtime` feature was removed in 0.8 (tokio is now the only runtime); dropped from `Cargo.toml`. - `BrowserInstance::close()` now does the documented sequence with bounded timeouts (lock acquisition, graceful close, launcher reap), falling back to `kill().await` only if the graceful path fails or times out. - Wrapped `Browser` in `tokio::sync::RwLock<Option<Browser>>` because `Browser::close` and `Browser::wait` take `&mut self`. Page ops (`new_page`, `&self`) take read locks concurrently; only `close()` takes the write lock briefly. - **Deleted entirely**: `kill_descendant_tree`, `verify_dir_stays_dead`, `list_descendants`, the per-platform `cfg(target_os)` splits, the `chrome_pid` field, the `libc` Unix-only dependency. - Net diff vs PR #19's tip: **−144 lines** (and one less dep). Kept from PR #19: `RwLock<Option<Browser>>` ownership wrapper, idle-timeout reaper, `last_activity` tracking, startup sweep of stale `chromelambda-*` dirs, `BrowserPool::Drop` fire-and-forget cleanup. ## Regressions surfaced by independent review (commit 2) After commit 1 (the actual fix), an independent regression-focused code review caught seven issues that the macOS-only lifecycle test would never have surfaced. Commit 2 addresses all of them: ### Critical 1. **`chromiumoxide 0.9` silently mangles every `.arg("--foo")` call.** The new `Arg` struct's `From<&str>` treats the entire string as a flag *key* and prepends `--` itself, so `.arg("--no-sandbox")` emits `----no-sandbox` and Chrome ignores it. On Linux production where Chrome's setuid sandbox isn't always available, every `BrowserInstance::new` would fail at launch — a complete service outage on every Hero OS dev box. The macOS-only lifecycle test couldn't catch this because macOS Chrome doesn't need `--no-sandbox`. Fixed by using the `.no_sandbox()` builder shortcut, dropping leading `--`, using the `(&str, &str)` tuple form for `--key=value` flags, and removing args that chromiumoxide's `DEFAULT_ARGS` already includes. 2. **`hero_browser_server`'s SIGTERM/SIGINT path returned without ever calling `pool.destroy_all().await`.** `BrowserPool::Drop` only does fire-and-forget `tokio::spawn`, but during runtime teardown the spawned task is aborted before it runs — leaving Chrome helpers reparented to init and burning RAM/CPU on the host until the next reboot. The PR #19 tree-walk approach masked this because its SIGKILL was synchronous and runtime-independent. Fixed by adding explicit `pool.destroy_all().await` before `run_server` returns, with a 30 s timeout cap. ### Important 3. `close()` now bounds the write-lock acquisition (2 s) and `browser.wait().await` (3 s) so a wedged `new_page` can't block the reaper or shutdown indefinitely. 4. `handler_task` is now explicitly aborted in close()'s kill-fallback paths and in `Drop`. chromiumoxide's `Handler` only resolves cleanly after a successful CDP `Browser.close` roundtrip; without the abort, SIGKILL'd browsers leak the handler task on the runtime forever. Renamed `_handler_task` → `handler_task`. 5. `BrowserInstance::Drop` now drops the chromiumoxide `Browser` *before* `remove_dir_all`, so `kill_on_drop` fires before the dir-removal race window. Prior order was the same helper-recreates-dir bug we're fixing, just on the Drop fallback path. 6. `BrowserPool::destroy_all` is now concurrent (`futures::future::join_all`) with errors logged, not propagated. Prior serial `for` loop with `?` short-circuit could have stalled shutdown for ~8 minutes worst case (50 browsers × 10 s timeout) and silently leaked half the pool if any browser errored. 7. (covered by 6) ## How to verify `cargo run -p hero_browser_core --example test_lifecycle` from the workspace root. The example spawns real Chrome (mac path hard-coded; Linux auto-detects) and asserts on **both** sides of the bug: - **Cause-side** (process count via `pgrep -f {uuid}`): pre-flight asserts each browser has ≥ 2 Chrome processes (proves we're actually exercising the helper-tree shutdown), then asserts 0 processes after cleanup, holding through 1 s of polls. - **Symptom-side** (filesystem): user_data_dir is gone after cleanup and stays gone for 1 s of polls (the original bug-mode where helpers recreated it ~10 ms later). Five invariants, all covered: 1. `pool.destroy_all().await` cleans up the survivor. In my macOS run: 7 → 0 Chrome procs, dir gone, both stay clean for 1 s. 2. `get_browser` bumps `last_activity` (active browser survives 6 s of pings with `idle_timeout=3s`). 3. Idle timeout reaper destroys idle browsers in [3s, 5s] window. 4. Startup sweep removes pre-existing `chromelambda-*` dirs. 5. After reap of idle browser: 0 Chrome processes belonging to it, dir stays gone for 1 s. Sample output ends with `=== all five lifecycle tests PASSED ===`. ### Honest test limitations This is a regression check, not exhaustive proof: - **Linux `/proc` walk path is no longer present**, so the platform-split issue from PR #19 is gone. The graceful close path is identical on Linux and macOS — both just send CDP `Browser.close` and wait. So unlike PR #19, "passes on macOS" is a much stronger signal here. - **Single-spawn, ~30 s runtime — not a stress test.** The 15-day kristof5 leak accumulated under real RPC traffic. - **The kill-fallback path is rare in practice but still sub-optimal**: it only kills the launcher PID, not the helper tree. If you hit it (timeout > 10 s on graceful close), the original leak class re-emerges briefly until the next startup sweep. ## Not in this PR - LRU eviction at capacity — kept separate from the leak fix for reviewability. - A fix for the rare kill-fallback path leaving helpers alive — would require either upstream chromiumoxide cooperation or rewriting our launch path. Filing as a follow-up; the graceful path is the dominant case.
Closes lhumina_code/hero_browser#18.

Without these, hero_browser_server accumulates Chrome processes, RAM,
and /tmp/chromelambda-* directories indefinitely — every client that
crashes or forgets to call browser_destroy leaks one session
permanently. On kristof5, 15 days of accumulation reached 128 Chrome
processes / ~17 GB RSS.

The fix is layered:

1. Idle-timeout reaper. Per-browser last_activity: Mutex<Instant> on
   BrowserInstance, bumped automatically by BrowserPool::get_browser
   (so any RPC handler that touches a browser counts as activity).
   PoolConfig gains idle_timeout: Option<Duration> (default 30 min)
   and reap_check_interval: Duration (default 60s). When a tokio
   runtime is available, BrowserPool::new spawns a reaper task that
   destroys idle browsers under the write lock with a TOCTOU re-check.
   The task holds only a Weak<...> to the browsers map and is also
   aborted explicitly on BrowserPool::Drop.

2. Startup sweep. BrowserPool::new calls
   sweep_stale_chromelambda_dirs() to remove pre-existing
   /tmp/chromelambda-* dirs. Safe because by design only one
   hero_browser_server runs per uid (uid-fixed Unix socket + TCP
   port singletons in hero_browser_server::main), so any leftover
   dir owned by us is a crash-leftover. PermissionDenied on other
   users' dirs is silently skipped.

3. Tree-killing the Chrome process group on close().

   Diagnosis (verified with traces against the original code, see
   commit history): Chrome forks helper subprocesses (renderer, GPU,
   utility) under the launcher. chromiumoxide's Browser::kill resolves
   to tokio's Child::kill, which only sends SIGKILL to the launcher
   PID — see rust-lang/rust#115241 — and chromiumoxide does not
   configure a separate process group on the spawn Command. The
   helpers reparent to init when the launcher dies and continue
   writing to --user-data-dir, recreating the directory ~10 ms after
   the original close()/Drop completed. The trace showed dir reappear
   with Default/Cache/Cache_Data/index-dir/the-real-index and persist
   indefinitely.

   The engineered fix in BrowserInstance:
     - Wrap chromiumoxide::Browser in tokio::sync::RwLock<Option<_>>
       so close() can take it out under a write lock; page ops
       (new_page) take a read lock and run concurrently.
     - Capture the launcher PID at construction (browser.get_mut_child
       requires &mut Browser, only available before we wrap it).
     - close() runs in this exact order:
         1. Close pages.
         2. Take the Browser out under the write lock.
         3. SIGKILL every descendant of the launcher via
            kill_descendant_tree (Linux: walk /proc; macOS: pgrep -P;
            bounded retry until tree empty).
         4. browser.kill().await — reaps the launcher.
         5. remove_dir_all on user_data_dir.
         6. verify_dir_stays_dead: poll for ~200 ms; if a helper
            survived step 3 and recreated the dir, repeat.

   This is the standard Unix idiom for killing a process tree when
   process_group(0) wasn't set at spawn. Process-group-at-spawn
   (Fix A in the design discussion) would be marginally cleaner but
   requires duplicating chromiumoxide's private launch-arg logic
   (DEFAULT_ARGS, --remote-debugging-port handling, sandbox/headless
   flags) since BrowserConfig::launch is not customizable. Filing a
   follow-up to upstream chromiumoxide adding process_group support
   would let us migrate cleanly.

4. BrowserPool::Drop becomes best-effort fire-and-forget. The
   canonical async shutdown is the existing destroy_all().await; if
   the pool is dropped without that, Drop spawns a cleanup task on
   the current runtime so helpers don't leak past process exit.
   BrowserInstance::Drop is now a sync remove_dir_all only; the full
   tree-kill needs an await and lives in close().

5. libc added as a unix-only dependency for kill(2); no other deps.

6. New examples/test_lifecycle.rs spawns real Chrome (macOS dev path
   hard-codes the standard install) and asserts five invariants:
   startup sweep removes pre-existing dirs; get_browser bumps
   last_activity (active browser survives); idle timeout reaps;
   reaped dir stays gone for 1 s (the original bug repro); and
   destroy_all cleans up the survivor with the same 1 s persistence
   check.

Not in this PR:
- LRU eviction at capacity (kept separate from the leak fix).
- Upstream chromiumoxide PR for BrowserConfig::process_group(...) to
  let us drop the tree-walk in favor of killpg.
Closes lhumina_code/hero_browser#18.

This replaces the SIGKILL-tree-walk approach with the chromiumoxide-correct
sequence shown in the library's README:

    browser.close().await   // CDP Browser.close — Chrome shuts down its own tree
    browser.wait().await    // reap the launcher
    remove_dir_all(...)     // dir is now quiescent

# What was wrong before

The original code (and the prior fix in this PR series) never called
Browser::close().await or Browser::wait().await. We just dropped the Browser,
which triggered chromiumoxide's kill_on_drop SIGKILL on the launcher PID only.
SIGKILL bypasses Chrome's own shutdown sequence — the renderer/GPU/utility
helpers reparent to init and continue writing to --user-data-dir, recreating
the directory we'd just deleted (verified by tracing: dir reappeared at +10ms
and persisted indefinitely).

The bug was using the library wrong, not a library bug. chromiumoxide's
README example explicitly shows `browser.close().await?;` as the shutdown
sequence; we never adopted it because Browser::close requires &mut self and
our BrowserInstance.browser was a bare field. Once that's wrapped in a lock
that allows mutable access (RwLock<Option<Browser>>), the correct sequence
falls into place.

# What changed vs the previous fix attempt

  - Bumped chromiumoxide 0.7 → 0.9.1 (latest, Feb 2026). The tokio-runtime
    feature was removed in 0.8 (tokio is now the only runtime); dropped from
    Cargo.toml.
  - close() now does the documented sequence with a 10-second timeout
    fallback to kill+wait if Browser.close hangs (defense for a truly stuck
    Chrome).
  - Deleted: kill_descendant_tree, verify_dir_stays_dead, list_descendants
    (Linux /proc walk and macOS pgrep -P fallback), the chrome_pid field, the
    libc unix-only dependency, and all the cfg(target_os) splits.
  - Kept: tokio::sync::RwLock<Option<Browser>> wrapper (still needed because
    Browser::close and Browser::wait take &mut self), the idle-timeout
    reaper, last_activity tracking, startup sweep, and BrowserPool::Drop's
    fire-and-forget cleanup.

# Result

Net -144 lines vs the tree-walk approach. No platform-specific code, no
external process spawns (pgrep), no syscall-level retry loops, no magic
numbers. Same five lifecycle tests pass with both cause-side (process count
via pgrep -f {uuid}) and symptom-side (dir existence) checks: 8/7 Chrome
procs pre-cleanup → 0/0 post-cleanup, holding through 1s of polls.
fix: regressions surfaced by review of the graceful-close bump
All checks were successful
Test / test (pull_request) Successful in 2m41s
Build and Test / build (pull_request) Successful in 4m41s
5d9cb094b1
Independent regression review of the chromiumoxide 0.7→0.9 bump caught
issues that the macOS-only lifecycle test would never have surfaced.

# Critical

1. chromiumoxide 0.9's `arg()` mangles every `--foo` string we passed.
   The new `Arg` struct's `From<&str>` impl treats the entire string as a
   flag *key* and prepends `--` itself, so `.arg("--no-sandbox")` emitted
   `----no-sandbox` (which Chrome silently ignores). On Linux production
   where Chrome's setuid sandbox isn't always available, this would have
   caused every `BrowserInstance::new` to fail at launch — a complete
   service outage on every Hero OS dev box. macOS Chrome doesn't need
   --no-sandbox so the lifecycle test passed regardless.
   Fixed by using the `.no_sandbox()` builder shortcut (which emits both
   --no-sandbox and --disable-setuid-sandbox correctly), passing flag
   keys without the leading `--`, and using the `(&str, &str)` tuple
   form for `--key=value` flags. Also dropped the args that
   chromiumoxide's DEFAULT_ARGS already includes
   (`disable-dev-shm-usage`, `no-first-run`).

2. hero_browser_server's SIGTERM/SIGINT path returned without ever
   calling `pool.destroy_all().await`. `BrowserPool::Drop` only does
   fire-and-forget `tokio::spawn(close_all)`, but during runtime
   teardown the spawned task is aborted before it runs — leaving
   Chrome helpers reparented to init and consuming RAM/CPU on the
   host until the box reboots. The previous tree-walk fix masked
   this because its SIGKILL was synchronous and runtime-independent.
   Now `run_server` calls `pool.destroy_all().await` explicitly, with
   a 30-second timeout cap, before returning.

# Important

3. `close()` now bounds the write-lock acquisition (2 s) so a
   wedged `new_page` holding a read lock against an unresponsive
   Chrome can't block the reaper or shutdown indefinitely. Also
   bounds `browser.wait().await` (3 s).

4. `handler_task` is now explicitly aborted in close()'s kill-fallback
   paths and in `Drop`. chromiumoxide's `Handler` only resolves cleanly
   after a successful `Browser.close` CDP roundtrip; without the abort,
   SIGKILL'd browsers leak the handler task on the runtime forever.
   Renamed the field from `_handler_task` to `handler_task` since it
   is now actively used.

5. `BrowserInstance::Drop` now drops the chromiumoxide Browser
   *before* `remove_dir_all`, so the kill_on_drop SIGKILL fires
   before the dir-removal racing window. Prior order was the same
   helper-recreates-dir bug we're fixing, just on the Drop fallback
   path.

6. `BrowserPool::destroy_all` now drains the map atomically and runs
   per-browser closes via `futures::future::join_all` (concurrent)
   instead of a serial `for` loop with `?` short-circuit. With the
   default `max_browsers: 50` and the 10s per-browser close timeout,
   the prior serial path could stall shutdown for ~8 minutes; concurrent
   closes mean overall latency is bounded by the slowest single browser
   (10 s + slack), not their sum. Errors are logged but no longer
   short-circuit the rest.

All five lifecycle tests still pass. Diff vs the bump-only commit:
+137/-54.
sameh-farouk merged commit 4647269639 into development 2026-05-06 12:51:01 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_browser!20
No description provided.