story: resolve mem leakage #205

Open
opened 2026-05-03 14:51:35 +00:00 by sameh-farouk · 2 comments
Member

resolve mem leakage

  • make sure new user flow resolves this by design (see specs as defined)
    • memory limit on each user
    • no more nushell by default as shell (install script has been adjusted)
    • no more processes in root for hero_...
  • egypt: make sure we have template user nicely prepared, and can create users from codescalers in driver
  • kristof: validate user creation flow on dev boxes (previous step)
  • peter: work with each developer on each of the hetzner machines, ask them to commit their work and re-install their user sessions, only do this once we have tested flow (kristof has see it)
  • egypt: put feature in browser mcp, to catch stale non active sessions kill them
  • egypt: put fure in hero_proc which catches zombie processes or left over nu shells and kill them, same for rust analyzers,... basically a cleanup/protect feature
  • validate all issues resolved below

Symptom (as observed on kristof5 / 168.119.77.253)

Over time the box becomes unable to accept new SSH logins until it is rebooted. Two independent leaks accumulate concurrently, each touching a different resource:

  • One CPU core per stuck nushell REPL (State R, ~100 % CPU forever) — multiplies with every interrupted SSH session.
  • ~1.7 GB of RAM per leaked headless-Chrome session held by hero_browser_server — multiplies with every browser-RPC client that doesn't tear down.

When the host approaches its sshd resource limits (LoginGraceTime + MaxStartups + accept queue, plus userspace exhaustion), ssh connections start hanging at the prompt or failing outright; reboot is the only consistent recovery path that's been used so far.

This issue captures both root causes and points at the upstream fixes.


Issue 1 — orphan nushell REPLs after SSH disconnect

Repo: lhumina_code/hero_skills — login-shell configuration
Severity: keeps consuming a full core per orphan, indefinitely

What we see right now on kristof5

Two nu processes from user despiegk, parent 1 (orphaned), 1–4 KB RSS, both State R, each pegging a core. CPU burned per process: ~290 hours.

PID argv Started session-NNN.scope Origin
318380 /home/despiegk/hero/bin/nu Mon Apr 20 10:35:46 session-15039.scope (still active (abandoned)) sshd from 130.117.88.68, session closed cleanly Apr 20 13:06:31
2189278 /home/despiegk/hero/bin/nu -l -i Tue Apr 21 13:07:12 session-15369.scope (still active (abandoned)) sshd from 212.163.32.114, ended Broken pipe Apr 21 13:24:54

Both loginuid=1001 (despiegk), both confirmed via loginctl show-session to be abandoned-but-alive systemd-logind scopes. Neither has a controlling tty; signals from the dead sshd never reached them.

Root cause in hero_skills

tools/modules/installers/multiuser.nu deliberately sets the user's login shell directly to nushell:

# tools/modules/installers/multiuser.nu:578
let shell = $"($hero_dir)/bin/nu"
...
# :609
^sudo usermod --shell $shell $username
# :611
^sudo useradd --no-create-home --home-dir $homedir --shell $shell $username

The doc-comment on multi_user_add (line 543) makes this explicit: "login shell set directly to ~/hero/bin/nu (no namespace wrapper)".

This means sshd execs nushell directly as the login shell — there is no tmux/zellij/mosh wrapper holding the pty. When the SSH connection dies (clean disconnect, broken pipe, network drop, laptop sleep) the pty is yanked out from under the running nu process.

Why nushell then spins forever

Nushell's REPL doesn't handle "controlling tty/IO source has vanished" cleanly — the read loop returns EIO (or short reads), nushell prints an error and re-enters the read, repeat at 100 % CPU. This is a known upstream bug class:

  • nushell/nushell#6455 — exact symptom: infinite "Input/output error" loop, high CPU, Ctrl+C does not break it, only kill from another terminal works (originally reported for flatpak enter; the same pty-vanishes trigger applies to SSH disconnect).
  • nushell/nushell#17964 — recent (2025) one-core-pegged report when nu is the system shell.
  • Adjacent: #9876, #10219, #9497, #5029, #7938.

Notably, hero_skills' own tools/install.sh (lines 747–781) explicitly reverts the calling user's login shell back to bash if it finds it set to nu — i.e. the project already treats nu-as-login-shell as undesirable for the install user, but multi_user_add does the opposite for every other user. This asymmetry is the bug.

Suggested fix in hero_skills

  1. Don't set nu as the user's login shell in multi_user_add. Two viable options:
    • Set login shell to /bin/bash; have the user's ~/.bashrc exec into tmux new-session -A -s hero (or zellij attach -c hero) which then runs nu. The multiplexer holds the pty, so SSH death no longer reaches nu. tmux/zellij is also necessary infrastructure for reattach-after-disconnect that users actually want.
    • Or use a small bash wrapper script as the login shell that forks nu as a child. When the parent (the wrapper) gets SIGHUP from sshd, both go down cleanly.
  2. Update multi_user_del to also loginctl kill-user / loginctl terminate-user to flush stale abandoned scopes, so deleting a leaky user doesn't leave orphans behind.
  3. Optionally file nushell/nushell issue with a minimal SSH-disconnect repro; #6455 isn't tagged for that trigger and looks closed without a documented fix.

Issue 2 — hero_browser_server never reclaims idle browser sessions

Repo: lhumina_code/hero_browser — pool / lifecycle
Severity: ~1.7 GB RAM per leaked session, accumulates over service uptime

What we see right now

On kristof5:

  • hero_browser_server (PID 3752859, owner despiegk) running 15 days 8 hours.
  • 128 Chrome processes / 16.9 GB RSS combined.
  • 10 distinct chromelambda-<uuid> user-data-dirs, i.e. 10 unreclaimed browser sessions.

On kristof4 (138.201.206.39, for comparison):

  • Same binary, owner salma, running ~4 days.
  • 16 Chrome processes / 1.56 GB RSS, 1 chromelambda dir, plus 10 leftover /tmp/chromelambda-* directories from previous sessions whose Chrome processes are gone but whose tempdirs were never cleaned.

The pattern is identical on both boxes; the leak rate just differs by usage.

Root cause in hero_browser

crates/hero_browser_core/src/browser/pool.rs (BrowserPool) holds sessions in a plain map with no lifecycle policy:

// pool.rs:107
pub struct BrowserPool {
    browsers: Arc<RwLock<HashMap<String, Arc<BrowserInstance>>>>,
    ...
}

The pool exposes:

  • create_browser (pool.rs:154)
  • destroy_browser (pool.rs:209) — explicit, RPC-driven only
  • destroy_all (pool.rs:221)
  • browser_count / max_browsers / list_browsers / get_browser

There is no idle timeout, no TTL, no last-activity tracking, no background reaper, no impl Drop for BrowserInstance, and no eviction when max_browsers is exceeded (creation just errors). The cleanup code itself works (BrowserInstance::close() at browser.rs:348 calls remove_dir_all on the tempdir) — but nothing schedules it.

The only way a session is reclaimed is if a client explicitly calls the browser_destroy RPC (crates/hero_browser_server/src/server.rs:386). Any client that crashes, drops the connection, or simply forgets to call browser_destroy permanently leaks one Chrome session — process tree, RAM, /tmp/chromelambda-* dir, and slot in the HashMap.

Suggested fix in hero_browser

  1. Track per-browser last-activity timestamp; spawn a single tokio interval task that destroys browsers idle for longer than PoolConfig::idle_timeout (new field, default e.g. 30 min).
  2. Add impl Drop for BrowserInstance that kills the Chrome process and removes the tempdir, so accidental drops also clean up.
  3. On server startup, sweep /tmp/chromelambda-* dirs that don't correspond to a live BrowserInstance (handles leftovers from a previous run, like the 10 currently sitting on kristof4).
  4. When max_browsers is hit, evict the least-recently-used session instead of erroring on create_browser.

Evidence (build-IDs, commits) — for reproducibility

Box Service PID Build-ID Source HEAD
kristof5 hero_codescalers_server (running) 2179825 eac56e30…b2f846b older than 1e0a829 (Apr 30) — running binary on disk replaced; new build sitting unused at 1e0a829
kristof5 hero_browser_server (running) 3752859 d77f1fbe…79b1f7f unknown, ≥9 commits behind hero_browser f250e1a (May 1); on-disk binary deleted, source tree under despiegk is incomplete
kristof4 hero_browser_server (running) 1970566 791d235b…aef78248 ~4 commits behind hero_browser f250e1a
kristof4 hero_codescalers_server (running == on-disk) 1307655 3b3d76db…e4e885fc c0d4ef6 (Apr 27) — 14 commits behind kristof5's mirror of origin/development

Repo state at investigation time (2026-05-03): kristof5's hero_codescalers and hero_skills checkouts are at origin tip; hero_browser is unbuilt because despiegk's checkout is missing/incomplete. kristof4's /root/hero/code0/hero_codescalers has not been pulled since Apr 27.


Cross-references

  • This is the umbrella issue. Sub-issues to file:
    • lhumina_code/hero_skillsdon't make nu the login shell in multi_user_add; use a multiplexer wrapper.
    • lhumina_code/hero_browseradd idle-timeout / TTL / Drop-based reaper to BrowserPool.
  • Existing hero_skills evidence that nu-as-login-shell is known-bad: tools/install.sh:747-781 already reverts nu→bash for the install user.
  • Upstream bug class for the nushell symptom: nushell/nushell#6455, #17964.
# resolve mem leakage - make sure new user flow resolves this by design (see specs as defined) - memory limit on each user - no more nushell by default as shell (install script has been adjusted) - no more processes in root for hero_... - egypt: make sure we have template user nicely prepared, and can create users from codescalers in driver - kristof: validate user creation flow on dev boxes (previous step) - peter: work with each developer on each of the hetzner machines, ask them to commit their work and re-install their user sessions, only do this once we have tested flow (kristof has see it) - egypt: put feature in browser mcp, to catch stale non active sessions kill them - egypt: put fure in hero_proc which catches zombie processes or left over nu shells and kill them, same for rust analyzers,... basically a cleanup/protect feature - validate all issues resolved below ## Symptom (as observed on kristof5 / 168.119.77.253) Over time the box becomes unable to accept new SSH logins until it is rebooted. Two independent leaks accumulate concurrently, each touching a different resource: - One CPU core per stuck nushell REPL (`State R`, ~100 % CPU forever) — multiplies with every interrupted SSH session. - ~1.7 GB of RAM per leaked headless-Chrome session held by `hero_browser_server` — multiplies with every browser-RPC client that doesn't tear down. When the host approaches its sshd resource limits (LoginGraceTime + MaxStartups + accept queue, plus userspace exhaustion), `ssh` connections start hanging at the prompt or failing outright; reboot is the only consistent recovery path that's been used so far. This issue captures both root causes and points at the upstream fixes. --- ## Issue 1 — orphan nushell REPLs after SSH disconnect **Repo:** [`lhumina_code/hero_skills`](https://forge.ourworld.tf/lhumina_code/hero_skills) — login-shell configuration **Severity:** keeps consuming a full core per orphan, indefinitely ### What we see right now on kristof5 Two `nu` processes from user `despiegk`, parent `1` (orphaned), 1–4 KB RSS, both `State R`, each pegging a core. CPU burned per process: ~290 hours. | PID | argv | Started | session-NNN.scope | Origin | |---|---|---|---|---| | 318380 | `/home/despiegk/hero/bin/nu` | Mon Apr 20 10:35:46 | `session-15039.scope` (still `active (abandoned)`) | sshd from `130.117.88.68`, session closed cleanly Apr 20 13:06:31 | | 2189278 | `/home/despiegk/hero/bin/nu -l -i` | Tue Apr 21 13:07:12 | `session-15369.scope` (still `active (abandoned)`) | sshd from `212.163.32.114`, ended `Broken pipe` Apr 21 13:24:54 | Both `loginuid=1001` (despiegk), both confirmed via `loginctl show-session` to be abandoned-but-alive systemd-logind scopes. Neither has a controlling tty; signals from the dead sshd never reached them. ### Root cause in hero_skills `tools/modules/installers/multiuser.nu` deliberately sets the user's login shell directly to nushell: ```nushell # tools/modules/installers/multiuser.nu:578 let shell = $"($hero_dir)/bin/nu" ... # :609 ^sudo usermod --shell $shell $username # :611 ^sudo useradd --no-create-home --home-dir $homedir --shell $shell $username ``` The doc-comment on `multi_user_add` (line 543) makes this explicit: *"login shell set directly to ~/hero/bin/nu (no namespace wrapper)"*. This means `sshd` execs nushell directly as the login shell — there is no `tmux`/`zellij`/`mosh` wrapper holding the pty. When the SSH connection dies (clean disconnect, broken pipe, network drop, laptop sleep) the pty is yanked out from under the running `nu` process. ### Why nushell then spins forever Nushell's REPL doesn't handle "controlling tty/IO source has vanished" cleanly — the read loop returns `EIO` (or short reads), nushell prints an error and re-enters the read, repeat at 100 % CPU. This is a known upstream bug class: - [`nushell/nushell#6455`](https://github.com/nushell/nushell/issues/6455) — exact symptom: infinite "Input/output error" loop, high CPU, Ctrl+C does not break it, only kill from another terminal works (originally reported for `flatpak enter`; the same pty-vanishes trigger applies to SSH disconnect). - [`nushell/nushell#17964`](https://github.com/nushell/nushell/issues/17964) — recent (2025) one-core-pegged report when nu is the system shell. - Adjacent: [`#9876`](https://github.com/nushell/nushell/issues/9876), [`#10219`](https://github.com/nushell/nushell/issues/10219), [`#9497`](https://github.com/nushell/nushell/issues/9497), [`#5029`](https://github.com/nushell/nushell/issues/5029), [`#7938`](https://github.com/nushell/nushell/issues/7938). Notably, hero_skills' own `tools/install.sh` (lines 747–781) explicitly **reverts** the calling user's login shell back to bash if it finds it set to nu — i.e. the project already treats nu-as-login-shell as undesirable for the install user, but `multi_user_add` does the opposite for every other user. This asymmetry is the bug. ### Suggested fix in hero_skills 1. Don't set `nu` as the user's login shell in `multi_user_add`. Two viable options: - Set login shell to `/bin/bash`; have the user's `~/.bashrc` `exec` into `tmux new-session -A -s hero` (or `zellij attach -c hero`) which then runs `nu`. The multiplexer holds the pty, so SSH death no longer reaches `nu`. tmux/zellij is also necessary infrastructure for reattach-after-disconnect that users actually want. - Or use a small bash wrapper script as the login shell that forks `nu` as a child. When the parent (the wrapper) gets SIGHUP from sshd, both go down cleanly. 2. Update `multi_user_del` to also `loginctl kill-user` / `loginctl terminate-user` to flush stale abandoned scopes, so deleting a leaky user doesn't leave orphans behind. 3. Optionally file [`nushell/nushell`](https://github.com/nushell/nushell) issue with a minimal SSH-disconnect repro; #6455 isn't tagged for that trigger and looks closed without a documented fix. --- ## Issue 2 — `hero_browser_server` never reclaims idle browser sessions **Repo:** [`lhumina_code/hero_browser`](https://forge.ourworld.tf/lhumina_code/hero_browser) — pool / lifecycle **Severity:** ~1.7 GB RAM per leaked session, accumulates over service uptime ### What we see right now On kristof5: - `hero_browser_server` (PID 3752859, owner despiegk) running 15 days 8 hours. - 128 Chrome processes / **16.9 GB RSS** combined. - 10 distinct `chromelambda-<uuid>` user-data-dirs, i.e. 10 unreclaimed browser sessions. On kristof4 (138.201.206.39, for comparison): - Same binary, owner `salma`, running ~4 days. - 16 Chrome processes / 1.56 GB RSS, 1 chromelambda dir, plus 10 *leftover* `/tmp/chromelambda-*` directories from previous sessions whose Chrome processes are gone but whose tempdirs were never cleaned. The pattern is identical on both boxes; the leak rate just differs by usage. ### Root cause in hero_browser `crates/hero_browser_core/src/browser/pool.rs` (`BrowserPool`) holds sessions in a plain map with no lifecycle policy: ```rust // pool.rs:107 pub struct BrowserPool { browsers: Arc<RwLock<HashMap<String, Arc<BrowserInstance>>>>, ... } ``` The pool exposes: - `create_browser` (pool.rs:154) - `destroy_browser` (pool.rs:209) — explicit, RPC-driven only - `destroy_all` (pool.rs:221) - `browser_count` / `max_browsers` / `list_browsers` / `get_browser` There is **no idle timeout, no TTL, no last-activity tracking, no background reaper, no `impl Drop for BrowserInstance`**, and no eviction when `max_browsers` is exceeded (creation just errors). The cleanup code itself works (`BrowserInstance::close()` at `browser.rs:348` calls `remove_dir_all` on the tempdir) — but nothing schedules it. The only way a session is reclaimed is if a client explicitly calls the `browser_destroy` RPC (`crates/hero_browser_server/src/server.rs:386`). Any client that crashes, drops the connection, or simply forgets to call `browser_destroy` permanently leaks one Chrome session — process tree, RAM, `/tmp/chromelambda-*` dir, and slot in the `HashMap`. ### Suggested fix in hero_browser 1. Track per-browser last-activity timestamp; spawn a single tokio interval task that destroys browsers idle for longer than `PoolConfig::idle_timeout` (new field, default e.g. 30 min). 2. Add `impl Drop for BrowserInstance` that kills the Chrome process and removes the tempdir, so accidental drops also clean up. 3. On server startup, sweep `/tmp/chromelambda-*` dirs that don't correspond to a live `BrowserInstance` (handles leftovers from a previous run, like the 10 currently sitting on kristof4). 4. When `max_browsers` is hit, evict the least-recently-used session instead of erroring on `create_browser`. --- ## Evidence (build-IDs, commits) — for reproducibility | Box | Service | PID | Build-ID | Source HEAD | |---|---|---|---|---| | kristof5 | `hero_codescalers_server` (running) | 2179825 | `eac56e30…b2f846b` | older than `1e0a829` (Apr 30) — running binary on disk replaced; new build sitting unused at `1e0a829` | | kristof5 | `hero_browser_server` (running) | 3752859 | `d77f1fbe…79b1f7f` | unknown, ≥9 commits behind `hero_browser` `f250e1a` (May 1); on-disk binary deleted, source tree under despiegk is incomplete | | kristof4 | `hero_browser_server` (running) | 1970566 | `791d235b…aef78248` | ~4 commits behind `hero_browser` `f250e1a` | | kristof4 | `hero_codescalers_server` (running == on-disk) | 1307655 | `3b3d76db…e4e885fc` | `c0d4ef6` (Apr 27) — 14 commits behind kristof5's mirror of `origin/development` | Repo state at investigation time (2026-05-03): kristof5's `hero_codescalers` and `hero_skills` checkouts are at origin tip; `hero_browser` is unbuilt because despiegk's checkout is missing/incomplete. kristof4's `/root/hero/code0/hero_codescalers` has not been pulled since Apr 27. --- ## Cross-references - This is the umbrella issue. Sub-issues to file: - `lhumina_code/hero_skills` — *don't make `nu` the login shell in `multi_user_add`; use a multiplexer wrapper.* - `lhumina_code/hero_browser` — *add idle-timeout / TTL / Drop-based reaper to `BrowserPool`.* - Existing hero_skills evidence that nu-as-login-shell is known-bad: `tools/install.sh:747-781` already reverts nu→bash for the install user. - Upstream bug class for the nushell symptom: [`nushell/nushell#6455`](https://github.com/nushell/nushell/issues/6455), [`#17964`](https://github.com/nushell/nushell/issues/17964).
Author
Member

Sub-issue filed: lhumina_code/hero_browser#18 (BrowserPool lifecycle).

Sub-issue filed: lhumina_code/hero_browser#18 (BrowserPool lifecycle).
Author
Member

Sub-issue filed: lhumina_code/hero_skills#199 (nu login-shell orphans).

Sub-issue filed: lhumina_code/hero_skills#199 (nu login-shell orphans).
despiegk changed title from kristof5: nushell login-shell + hero_browser session leak — recurring resource starvation, SSH login blocked until reboot to story: resolve mem leakage 2026-05-05 03:53:03 +00:00
despiegk added this to the ACTIVE project 2026-05-05 03:59:36 +00:00
despiegk added the due date 2026-05-05 2026-05-05 03:59:44 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

2026-05-05

Dependencies

No dependencies set.

Reference
lhumina_code/home#205
No description provided.