fix(BrowserPool): idle reaper + Drop cleanup + startup sweep + Chrome tree-kill #19
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_browser!19
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/browser-pool-lifecycle"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Closes #18.
The original PR (commit
db98124) added an idle-timeout reaper, a startup sweep, aBrowserInstance::Drop, and per-browserlast_activitytracking. End-to-end testing with a real Chrome on macOS surfaced a deeper leak: thechromelambda-{id}directory reappears ~10 ms after we delete it and stays alive indefinitely.This PR keeps the original four mechanisms and adds a fifth — a properly engineered Chrome process-tree teardown — that closes the leak.
Diagnosis (verified with traces, not guessed)
tokio::process::Child::kill().await(whichchromiumoxide::Browser::killresolves to) only signals the launcher PID, not Chrome's helpers (renderer, GPU, utility). See rust-lang/rust#115241. chromiumoxide's spawnCommanddoes not callprocess_grouporpre_exec, so when the launcher dies the helpers reparent to init and keep running long enough to recreatechromelambda-{id}/Default/Cache/Cache_Data/index-dir/and re-writethe-real-index.The trace, run against the original code:
Cleanup succeeded. Then a helper recreated the dir 10 ms later and it persisted indefinitely.
Engineering options considered
UserDataDirGuard+ field-drop order + 100 ms sleepCommand::process_group(0)+Browser::connect, kill viakillpg(pgid, SIGKILL)DEFAULT_ARGS,--remote-debugging-port, sandbox/headless flags) sinceBrowserConfig::launchis not customizable — high maintenanceprocess_groupwasn't set at spawn — see chromedp / Aymeric Beaumet and the puppeteer ecosystem'stree-killsolutionsGoing with B. A is the right architectural fix long-term — I'll file a follow-up suggesting upstream chromiumoxide expose
BrowserConfig::process_group(...)so we can migrate.What changed
BrowserInstance:browser: tokio::sync::RwLock<Option<Browser>>instead of bareBrowser—close()takes the write lock andtake()s the value out; page ops (new_page) hold a read lock concurrently.chrome_pid: Option<u32>captured at construction time (browser.get_mut_child().and_then(|c| c.inner.id())) soclose()can walk descendants without&mut Browser.close()ordering, exact:let Some(mut browser) = guard.take() else { return Ok(()); }— idempotent.kill_descendant_tree(pid)— bounded retry, SIGKILLs every descendant.browser.kill().await— reaps the launcher.remove_dir_all(user_data_dir).verify_dir_stays_dead: poll for ~200 ms; if a helper survived step 3 and recreated the dir, repeat.Dropis now syncremove_dir_allonly — full tree-kill needs anawaitand lives inclose().kill_descendant_treeis#[cfg(target_os)]-split: Linux walks/proc/<pid>/statusPPid (no extra deps, no subprocess); macOS usespgrep -P(ships with the OS, only matters for dev tests). Production target is Linux.BrowserPool::Dropbecomes best-effort fire-and-forget — spawns a cleanup task on the current runtime if there is one. Canonical shutdown ispool.destroy_all().await.libc = "0.2"added as a Unix-only dep forkill(2). No other dep additions.How to verify
cargo run -p hero_browser_core --example test_lifecyclefrom the workspace root.The example spawns real Chrome (mac path hard-coded; Linux auto-detects) and asserts on both sides of the bug:
pgrep -f {uuid}): pre-flight asserts each browser has ≥ 2 Chrome processes (proves we're actually exercising the descendant-kill path), then asserts 0 processes after cleanup, holding through 1 s of polls.Five invariants, all covered:
pool.destroy_all().awaitkills the survivor's whole process tree (in my macOS run: 8 Chrome procs → 0) and removes its dir; both stay clean for 1 s.get_browserbumpslast_activity(active browser survives 6 s of pings withidle_timeout=3s).chromelambda-*dirs.Sample output ends with
=== all five lifecycle tests PASSED ===.Test honesty
This test is a regression check, not exhaustive proof. Known coverage gaps:
pgrep -Pbranch oflist_descendants); the Linux/procwalk has not been run end-to-end. Production target is Linux.verify_dir_stays_deadretry loop inclose()is bounded (~200 ms) and the post-cleanup poll is bounded (1 s). A pathologically slow helper could in principle outlive both.Not in this PR
BrowserConfig::process_group(...)— would let us delete the tree-walk in favor ofkillpg. Filing as a follow-up.c3d187df729617e09752Pull request closed