BrowserPool: no automatic lifecycle management → unbounded session leak (kristof5: 128 Chrome procs, ~17 GB RSS over 15 days) #18

New issue

Closed

opened 2026-05-04 10:40:48 +00:00 by sameh-farouk · 0 comments

sameh-farouk commented

2026-05-04 10:40:48 +00:00

Member

Summary

BrowserPool has no automatic lifecycle management. A browser session is only ever reclaimed if a client explicitly calls the browser_destroy RPC. If the client crashes, drops the connection, or simply forgets to call it, the session leaks permanently — Chrome process tree, RAM, the /tmp/chromelambda-<uuid>/ user-data directory, and a slot in the pool's HashMap.

On long-running deployments this accumulates until the host runs out of memory and stops accepting SSH logins (reboot is currently the only recovery).

Live evidence and full investigation: lhumina_code/home#205.

Where in the code

All references against f250e1a (May 1).

crates/hero_browser_core/src/browser/pool.rs:

// PoolConfig (lines 57-65)
pub struct PoolConfig {
    pub max_browsers: usize,           // default: 50
    pub default_options: BrowserOptions,
}
// — no idle_timeout, no eviction policy.

// BrowserPool (lines 107-110)
pub struct BrowserPool {
    config: PoolConfig,
    browsers: Arc<RwLock<HashMap<String, Arc<BrowserInstance>>>>,
}
// — no last-activity tracking, no background reaper task.

// create_browser (line 154) — errors out when at capacity instead of evicting:
if browser_count >= self.config.max_browsers {
    return Err(Error::Pool(format!("Maximum browser limit ({}) reached", ...)));
}

// destroy_browser (line 209) — only called from the RPC handler.
// destroy_all (line 221) — only called from server shutdown paths.

crates/hero_browser_core/src/browser/browser.rs:

// line 166 — tempdir created on browser launch
let user_data_dir = std::env::temp_dir().join(format!("chromelambda-{}", browser_id));

// line 348-360 — close() works correctly when called
pub async fn close(&self) -> Result<()> {
    // closes pages, then:
    if let Some(ref dir) = self.user_data_dir {
        let _ = std::fs::remove_dir_all(dir);
    }
    Ok(())
}

But there is no impl Drop for BrowserInstance anywhere in the workspace (verified with grep -rn "impl Drop" crates/). If an Arc<BrowserInstance> is dropped without close() being called explicitly first, the Chrome process and its tempdir are orphaned.

crates/hero_browser_server/src/server.rs:386 — the browser_destroy RPC handler is the only path that calls pool.destroy_browser.

There is no tokio::time::interval-based reaper anywhere in hero_browser_core or hero_browser_server (the existing tokio::spawn calls are for the per-browser CDP WebSocket handler and for SIGINT/SIGTERM handling — none of them sweep idle sessions).

Live evidence on kristof5 (sampled during investigation)

hero_browser_server PID 3752859, owner despiegk, uptime 15 days 8 hours.
128 Chrome processes, combined RSS ~16.9 GB — i.e. essentially all of the user's 18 GB footprint on the box.
10 distinct chromelambda-<uuid>/ user-data directories — i.e. 10 unreclaimed sessions on a single service instance.

The same code is running on kristof4 with just 1 active session right now; the kristof5 numbers are what 15 days of accumulated leaks look like.

Why a config-side workaround isn't enough

There is no idle timeout to configure, and max_browsers only causes new requests to error rather than reclaiming old sessions. Operators currently have no way to bound resource usage other than restarting the service — which is precisely what we want to avoid.

Recommended engineered fix

The fix needs to address four orthogonal failure modes, in the order they fail:

1. Idle-session reaper (the core fix — addresses leak in steady state)

Add last_activity: AtomicI64 (or Mutex<Instant>) to BrowserInstance, stamped on every successful CDP round-trip from BrowserInstance and every page-level operation.
Extend PoolConfig with idle_timeout: Option<Duration> (default Some(Duration::from_secs(30 * 60)) = 30 minutes, opt-out by None, documented).
In BrowserPool::new, if idle_timeout.is_some(), spawn a single tokio task that wakes on tokio::time::interval(Duration::from_secs(60)), snapshots the IDs of BrowserInstances whose last_activity is older than idle_timeout, and calls self.destroy_browser on each. A tokio::sync::Notify (or task abort handle stored on the pool) shuts the reaper down with the pool.
Log every reap at info! with the elapsed-idle duration so operators can see what's happening.

2. `impl Drop for BrowserInstance` (safety net — addresses leak from bugs / panics)

Synchronously kills the underlying Chrome process and best-effort removes user_data_dir.
Has to be sync (Drop can't await), so close() stays the preferred async path; Drop is for the case where close() was never reached. Use tokio::runtime::Handle::try_current + block_in_place if a runtime is available, or fall back to std::process::Child::kill + std::fs::remove_dir_all.
This is the single most important defensive change: it converts a "leaks forever" bug into "self-heals on next pool drop or process exit."

3. Startup sweep of stale `/tmp/chromelambda-*` (addresses leftover state from prior crash)

On BrowserPool::new, scan std::env::temp_dir() for chromelambda-* entries that don't correspond to any browser the pool is about to create, and remove_dir_all them. Process-crash recovery, no behavior change in steady state.

4. LRU eviction at capacity (replaces the current error)

When create_browser would exceed max_browsers, evict the least-recently-used browser instead of returning Error::Pool("Maximum browser limit reached"). Bounded resource use without ever rejecting clients. Optional PoolConfig::on_capacity: CapacityPolicy { Reject | EvictLru } for callers that want the old behavior.

What this leaves to the caller

Clients should still call browser_destroy when they're done — it's cheaper than waiting for the reaper. The fix is about not depending on them for correctness.

Cross-reference

Umbrella: lhumina_code/home#205
Sibling: hero_skills issue (nu login-shell orphans) — to be filed.

## Summary `BrowserPool` has no automatic lifecycle management. A browser session is only ever reclaimed if a client explicitly calls the `browser_destroy` RPC. If the client crashes, drops the connection, or simply forgets to call it, the session leaks permanently — Chrome process tree, RAM, the `/tmp/chromelambda-<uuid>/` user-data directory, and a slot in the pool's `HashMap`. On long-running deployments this accumulates until the host runs out of memory and stops accepting SSH logins (reboot is currently the only recovery). Live evidence and full investigation: **[lhumina_code/home#205](https://forge.ourworld.tf/lhumina_code/home/issues/205)**. ## Where in the code All references against `f250e1a` (May 1). `crates/hero_browser_core/src/browser/pool.rs`: ```rust // PoolConfig (lines 57-65) pub struct PoolConfig { pub max_browsers: usize, // default: 50 pub default_options: BrowserOptions, } // — no idle_timeout, no eviction policy. // BrowserPool (lines 107-110) pub struct BrowserPool { config: PoolConfig, browsers: Arc<RwLock<HashMap<String, Arc<BrowserInstance>>>>, } // — no last-activity tracking, no background reaper task. // create_browser (line 154) — errors out when at capacity instead of evicting: if browser_count >= self.config.max_browsers { return Err(Error::Pool(format!("Maximum browser limit ({}) reached", ...))); } // destroy_browser (line 209) — only called from the RPC handler. // destroy_all (line 221) — only called from server shutdown paths. ``` `crates/hero_browser_core/src/browser/browser.rs`: ```rust // line 166 — tempdir created on browser launch let user_data_dir = std::env::temp_dir().join(format!("chromelambda-{}", browser_id)); // line 348-360 — close() works correctly when called pub async fn close(&self) -> Result<()> { // closes pages, then: if let Some(ref dir) = self.user_data_dir { let _ = std::fs::remove_dir_all(dir); } Ok(()) } ``` But there is **no `impl Drop for BrowserInstance`** anywhere in the workspace (verified with `grep -rn "impl Drop" crates/`). If an `Arc<BrowserInstance>` is dropped without `close()` being called explicitly first, the Chrome process and its tempdir are orphaned. `crates/hero_browser_server/src/server.rs:386` — the `browser_destroy` RPC handler is the **only** path that calls `pool.destroy_browser`. There is no `tokio::time::interval`-based reaper anywhere in `hero_browser_core` or `hero_browser_server` (the existing `tokio::spawn` calls are for the per-browser CDP WebSocket handler and for SIGINT/SIGTERM handling — none of them sweep idle sessions). ## Live evidence on kristof5 (sampled during investigation) - `hero_browser_server` PID 3752859, owner `despiegk`, uptime **15 days 8 hours**. - **128 Chrome processes**, combined RSS **~16.9 GB** — i.e. essentially all of the user's 18 GB footprint on the box. - 10 distinct `chromelambda-<uuid>/` user-data directories — i.e. 10 unreclaimed sessions on a single service instance. The same code is running on kristof4 with just 1 active session right now; the kristof5 numbers are what 15 days of accumulated leaks look like. ## Why a config-side workaround isn't enough There is no idle timeout to configure, and `max_browsers` only causes new requests to error rather than reclaiming old sessions. Operators currently have no way to bound resource usage other than restarting the service — which is precisely what we want to avoid. ## Recommended engineered fix The fix needs to address four orthogonal failure modes, in the order they fail: ### 1. Idle-session reaper (the core fix — addresses leak in steady state) - Add `last_activity: AtomicI64` (or `Mutex<Instant>`) to `BrowserInstance`, stamped on every successful CDP round-trip from `BrowserInstance` and every page-level operation. - Extend `PoolConfig` with `idle_timeout: Option<Duration>` (default `Some(Duration::from_secs(30 * 60))` = 30 minutes, opt-out by `None`, documented). - In `BrowserPool::new`, if `idle_timeout.is_some()`, spawn a single tokio task that wakes on `tokio::time::interval(Duration::from_secs(60))`, snapshots the IDs of `BrowserInstance`s whose `last_activity` is older than `idle_timeout`, and calls `self.destroy_browser` on each. A `tokio::sync::Notify` (or task abort handle stored on the pool) shuts the reaper down with the pool. - Log every reap at `info!` with the elapsed-idle duration so operators can see what's happening. ### 2. `impl Drop for BrowserInstance` (safety net — addresses leak from bugs / panics) - Synchronously kills the underlying Chrome process and best-effort removes `user_data_dir`. - Has to be sync (Drop can't await), so `close()` stays the preferred async path; Drop is for the case where `close()` was never reached. Use `tokio::runtime::Handle::try_current` + `block_in_place` if a runtime is available, or fall back to `std::process::Child::kill` + `std::fs::remove_dir_all`. - This is the single most important defensive change: it converts a "leaks forever" bug into "self-heals on next pool drop or process exit." ### 3. Startup sweep of stale `/tmp/chromelambda-*` (addresses leftover state from prior crash) - On `BrowserPool::new`, scan `std::env::temp_dir()` for `chromelambda-*` entries that don't correspond to any browser the pool is about to create, and `remove_dir_all` them. Process-crash recovery, no behavior change in steady state. ### 4. LRU eviction at capacity (replaces the current error) - When `create_browser` would exceed `max_browsers`, evict the least-recently-used browser instead of returning `Error::Pool("Maximum browser limit reached")`. Bounded resource use without ever rejecting clients. Optional `PoolConfig::on_capacity: CapacityPolicy { Reject | EvictLru }` for callers that want the old behavior. ### What this leaves to the caller Clients should still call `browser_destroy` when they're done — it's cheaper than waiting for the reaper. The fix is about not depending on them for correctness. ## Cross-reference Umbrella: [lhumina_code/home#205](https://forge.ourworld.tf/lhumina_code/home/issues/205) Sibling: hero_skills issue (nu login-shell orphans) — to be filed.