fix(BrowserPool): use chromiumoxide's documented graceful shutdown (supersedes #19) #20
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_browser!20
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/browser-pool-graceful-close"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Closes #18. Supersedes #19 (which can stay open as a safety net until this lands).
PR #19 fixed the leak with a tree-walk SIGKILL approach: enumerate Chrome's descendants via
/proc(Linux) orpgrep -P(macOS), SIGKILL each, retry until empty, then poll the dir for ~200 ms in case a helper recreated it. It worked, but it was treating a symptom — the leak only existed in the first place because we were using chromiumoxide wrong.The actual diagnosis
chromiumoxide's README example shows the documented shutdown sequence:
Our
BrowserInstance::close()was doing none of these. We just dropped theBrowser, which triggered chromiumoxide'skill_on_dropSIGKILL on the launcher PID only. SIGKILL bypasses Chrome's own shutdown logic — renderer/GPU/utility helpers reparent to init and continue writing to--user-data-dir, recreating the directory we'd just deleted (verified by tracing: dir reappeared at +10 ms and persisted indefinitely).The fix isn't a stronger SIGKILL. It's not SIGKILLing in the first place.
Browser::close().awaitsends the CDPBrowser.closecommand; Chrome runs its own shutdown, kills its own helpers, flushes pending writes, and exits the launcher cleanly.Browser::wait().awaitreaps the launcher. Then the dir is quiescent and safe to remove.What changed vs PR #19
chromiumoxide0.7 → 0.9.1 (latest, Feb 2026). Thetokio-runtimefeature was removed in 0.8 (tokio is now the only runtime); dropped fromCargo.toml.BrowserInstance::close()now does the documented sequence with bounded timeouts (lock acquisition, graceful close, launcher reap), falling back tokill().awaitonly if the graceful path fails or times out.Browserintokio::sync::RwLock<Option<Browser>>becauseBrowser::closeandBrowser::waittake&mut self. Page ops (new_page,&self) take read locks concurrently; onlyclose()takes the write lock briefly.kill_descendant_tree,verify_dir_stays_dead,list_descendants, the per-platformcfg(target_os)splits, thechrome_pidfield, thelibcUnix-only dependency.Kept from PR #19:
RwLock<Option<Browser>>ownership wrapper, idle-timeout reaper,last_activitytracking, startup sweep of stalechromelambda-*dirs,BrowserPool::Dropfire-and-forget cleanup.Regressions surfaced by independent review (commit 2)
After commit 1 (the actual fix), an independent regression-focused code review caught seven issues that the macOS-only lifecycle test would never have surfaced. Commit 2 addresses all of them:
Critical
chromiumoxide 0.9silently mangles every.arg("--foo")call. The newArgstruct'sFrom<&str>treats the entire string as a flag key and prepends--itself, so.arg("--no-sandbox")emits----no-sandboxand Chrome ignores it. On Linux production where Chrome's setuid sandbox isn't always available, everyBrowserInstance::newwould fail at launch — a complete service outage on every Hero OS dev box. The macOS-only lifecycle test couldn't catch this because macOS Chrome doesn't need--no-sandbox.Fixed by using the
.no_sandbox()builder shortcut, dropping leading--, using the(&str, &str)tuple form for--key=valueflags, and removing args that chromiumoxide'sDEFAULT_ARGSalready includes.hero_browser_server's SIGTERM/SIGINT path returned without ever callingpool.destroy_all().await.BrowserPool::Droponly does fire-and-forgettokio::spawn, but during runtime teardown the spawned task is aborted before it runs — leaving Chrome helpers reparented to init and burning RAM/CPU on the host until the next reboot. The PR #19 tree-walk approach masked this because its SIGKILL was synchronous and runtime-independent. Fixed by adding explicitpool.destroy_all().awaitbeforerun_serverreturns, with a 30 s timeout cap.Important
close()now bounds the write-lock acquisition (2 s) andbrowser.wait().await(3 s) so a wedgednew_pagecan't block the reaper or shutdown indefinitely.handler_taskis now explicitly aborted in close()'s kill-fallback paths and inDrop. chromiumoxide'sHandleronly resolves cleanly after a successful CDPBrowser.closeroundtrip; without the abort, SIGKILL'd browsers leak the handler task on the runtime forever. Renamed_handler_task→handler_task.BrowserInstance::Dropnow drops the chromiumoxideBrowserbeforeremove_dir_all, sokill_on_dropfires before the dir-removal race window. Prior order was the same helper-recreates-dir bug we're fixing, just on the Drop fallback path.BrowserPool::destroy_allis now concurrent (futures::future::join_all) with errors logged, not propagated. Prior serialforloop with?short-circuit could have stalled shutdown for ~8 minutes worst case (50 browsers × 10 s timeout) and silently leaked half the pool if any browser errored.(covered by 6)
How to verify
cargo run -p hero_browser_core --example test_lifecyclefrom the workspace root.The example spawns real Chrome (mac path hard-coded; Linux auto-detects) and asserts on both sides of the bug:
pgrep -f {uuid}): pre-flight asserts each browser has ≥ 2 Chrome processes (proves we're actually exercising the helper-tree shutdown), then asserts 0 processes after cleanup, holding through 1 s of polls.Five invariants, all covered:
pool.destroy_all().awaitcleans up the survivor. In my macOS run: 7 → 0 Chrome procs, dir gone, both stay clean for 1 s.get_browserbumpslast_activity(active browser survives 6 s of pings withidle_timeout=3s).chromelambda-*dirs.Sample output ends with
=== all five lifecycle tests PASSED ===.Honest test limitations
This is a regression check, not exhaustive proof:
/procwalk path is no longer present, so the platform-split issue from PR #19 is gone. The graceful close path is identical on Linux and macOS — both just send CDPBrowser.closeand wait. So unlike PR #19, "passes on macOS" is a much stronger signal here.Not in this PR
Independent regression review of the chromiumoxide 0.7→0.9 bump caught issues that the macOS-only lifecycle test would never have surfaced. # Critical 1. chromiumoxide 0.9's `arg()` mangles every `--foo` string we passed. The new `Arg` struct's `From<&str>` impl treats the entire string as a flag *key* and prepends `--` itself, so `.arg("--no-sandbox")` emitted `----no-sandbox` (which Chrome silently ignores). On Linux production where Chrome's setuid sandbox isn't always available, this would have caused every `BrowserInstance::new` to fail at launch — a complete service outage on every Hero OS dev box. macOS Chrome doesn't need --no-sandbox so the lifecycle test passed regardless. Fixed by using the `.no_sandbox()` builder shortcut (which emits both --no-sandbox and --disable-setuid-sandbox correctly), passing flag keys without the leading `--`, and using the `(&str, &str)` tuple form for `--key=value` flags. Also dropped the args that chromiumoxide's DEFAULT_ARGS already includes (`disable-dev-shm-usage`, `no-first-run`). 2. hero_browser_server's SIGTERM/SIGINT path returned without ever calling `pool.destroy_all().await`. `BrowserPool::Drop` only does fire-and-forget `tokio::spawn(close_all)`, but during runtime teardown the spawned task is aborted before it runs — leaving Chrome helpers reparented to init and consuming RAM/CPU on the host until the box reboots. The previous tree-walk fix masked this because its SIGKILL was synchronous and runtime-independent. Now `run_server` calls `pool.destroy_all().await` explicitly, with a 30-second timeout cap, before returning. # Important 3. `close()` now bounds the write-lock acquisition (2 s) so a wedged `new_page` holding a read lock against an unresponsive Chrome can't block the reaper or shutdown indefinitely. Also bounds `browser.wait().await` (3 s). 4. `handler_task` is now explicitly aborted in close()'s kill-fallback paths and in `Drop`. chromiumoxide's `Handler` only resolves cleanly after a successful `Browser.close` CDP roundtrip; without the abort, SIGKILL'd browsers leak the handler task on the runtime forever. Renamed the field from `_handler_task` to `handler_task` since it is now actively used. 5. `BrowserInstance::Drop` now drops the chromiumoxide Browser *before* `remove_dir_all`, so the kill_on_drop SIGKILL fires before the dir-removal racing window. Prior order was the same helper-recreates-dir bug we're fixing, just on the Drop fallback path. 6. `BrowserPool::destroy_all` now drains the map atomically and runs per-browser closes via `futures::future::join_all` (concurrent) instead of a serial `for` loop with `?` short-circuit. With the default `max_browsers: 50` and the 10s per-browser close timeout, the prior serial path could stall shutdown for ~8 minutes; concurrent closes mean overall latency is bounded by the slowest single browser (10 s + slack), not their sum. Errors are logged but no longer short-circuit the rest. All five lifecycle tests still pass. Diff vs the bump-only commit: +137/-54.