BrowserPool: no automatic lifecycle management → unbounded session leak (kristof5: 128 Chrome procs, ~17 GB RSS over 15 days) #18
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_browser#18
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
BrowserPoolhas no automatic lifecycle management. A browser session is only ever reclaimed if a client explicitly calls thebrowser_destroyRPC. If the client crashes, drops the connection, or simply forgets to call it, the session leaks permanently — Chrome process tree, RAM, the/tmp/chromelambda-<uuid>/user-data directory, and a slot in the pool'sHashMap.On long-running deployments this accumulates until the host runs out of memory and stops accepting SSH logins (reboot is currently the only recovery).
Live evidence and full investigation: lhumina_code/home#205.
Where in the code
All references against
f250e1a(May 1).crates/hero_browser_core/src/browser/pool.rs:crates/hero_browser_core/src/browser/browser.rs:But there is no
impl Drop for BrowserInstanceanywhere in the workspace (verified withgrep -rn "impl Drop" crates/). If anArc<BrowserInstance>is dropped withoutclose()being called explicitly first, the Chrome process and its tempdir are orphaned.crates/hero_browser_server/src/server.rs:386— thebrowser_destroyRPC handler is the only path that callspool.destroy_browser.There is no
tokio::time::interval-based reaper anywhere inhero_browser_coreorhero_browser_server(the existingtokio::spawncalls are for the per-browser CDP WebSocket handler and for SIGINT/SIGTERM handling — none of them sweep idle sessions).Live evidence on kristof5 (sampled during investigation)
hero_browser_serverPID 3752859, ownerdespiegk, uptime 15 days 8 hours.chromelambda-<uuid>/user-data directories — i.e. 10 unreclaimed sessions on a single service instance.The same code is running on kristof4 with just 1 active session right now; the kristof5 numbers are what 15 days of accumulated leaks look like.
Why a config-side workaround isn't enough
There is no idle timeout to configure, and
max_browsersonly causes new requests to error rather than reclaiming old sessions. Operators currently have no way to bound resource usage other than restarting the service — which is precisely what we want to avoid.Recommended engineered fix
The fix needs to address four orthogonal failure modes, in the order they fail:
1. Idle-session reaper (the core fix — addresses leak in steady state)
last_activity: AtomicI64(orMutex<Instant>) toBrowserInstance, stamped on every successful CDP round-trip fromBrowserInstanceand every page-level operation.PoolConfigwithidle_timeout: Option<Duration>(defaultSome(Duration::from_secs(30 * 60))= 30 minutes, opt-out byNone, documented).BrowserPool::new, ifidle_timeout.is_some(), spawn a single tokio task that wakes ontokio::time::interval(Duration::from_secs(60)), snapshots the IDs ofBrowserInstances whoselast_activityis older thanidle_timeout, and callsself.destroy_browseron each. Atokio::sync::Notify(or task abort handle stored on the pool) shuts the reaper down with the pool.info!with the elapsed-idle duration so operators can see what's happening.2.
impl Drop for BrowserInstance(safety net — addresses leak from bugs / panics)user_data_dir.close()stays the preferred async path; Drop is for the case whereclose()was never reached. Usetokio::runtime::Handle::try_current+block_in_placeif a runtime is available, or fall back tostd::process::Child::kill+std::fs::remove_dir_all.3. Startup sweep of stale
/tmp/chromelambda-*(addresses leftover state from prior crash)BrowserPool::new, scanstd::env::temp_dir()forchromelambda-*entries that don't correspond to any browser the pool is about to create, andremove_dir_allthem. Process-crash recovery, no behavior change in steady state.4. LRU eviction at capacity (replaces the current error)
create_browserwould exceedmax_browsers, evict the least-recently-used browser instead of returningError::Pool("Maximum browser limit reached"). Bounded resource use without ever rejecting clients. OptionalPoolConfig::on_capacity: CapacityPolicy { Reject | EvictLru }for callers that want the old behavior.What this leaves to the caller
Clients should still call
browser_destroywhen they're done — it's cheaper than waiting for the reaper. The fix is about not depending on them for correctness.Cross-reference
Umbrella: lhumina_code/home#205
Sibling: hero_skills issue (nu login-shell orphans) — to be filed.