story: resolve mem leakage #205
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
resolve mem leakage
Symptom (as observed on kristof5 / 168.119.77.253)
Over time the box becomes unable to accept new SSH logins until it is rebooted. Two independent leaks accumulate concurrently, each touching a different resource:
State R, ~100 % CPU forever) — multiplies with every interrupted SSH session.hero_browser_server— multiplies with every browser-RPC client that doesn't tear down.When the host approaches its sshd resource limits (LoginGraceTime + MaxStartups + accept queue, plus userspace exhaustion),
sshconnections start hanging at the prompt or failing outright; reboot is the only consistent recovery path that's been used so far.This issue captures both root causes and points at the upstream fixes.
Issue 1 — orphan nushell REPLs after SSH disconnect
Repo:
lhumina_code/hero_skills— login-shell configurationSeverity: keeps consuming a full core per orphan, indefinitely
What we see right now on kristof5
Two
nuprocesses from userdespiegk, parent1(orphaned), 1–4 KB RSS, bothState R, each pegging a core. CPU burned per process: ~290 hours./home/despiegk/hero/bin/nusession-15039.scope(stillactive (abandoned))130.117.88.68, session closed cleanly Apr 20 13:06:31/home/despiegk/hero/bin/nu -l -isession-15369.scope(stillactive (abandoned))212.163.32.114, endedBroken pipeApr 21 13:24:54Both
loginuid=1001(despiegk), both confirmed vialoginctl show-sessionto be abandoned-but-alive systemd-logind scopes. Neither has a controlling tty; signals from the dead sshd never reached them.Root cause in hero_skills
tools/modules/installers/multiuser.nudeliberately sets the user's login shell directly to nushell:The doc-comment on
multi_user_add(line 543) makes this explicit: "login shell set directly to ~/hero/bin/nu (no namespace wrapper)".This means
sshdexecs nushell directly as the login shell — there is notmux/zellij/moshwrapper holding the pty. When the SSH connection dies (clean disconnect, broken pipe, network drop, laptop sleep) the pty is yanked out from under the runningnuprocess.Why nushell then spins forever
Nushell's REPL doesn't handle "controlling tty/IO source has vanished" cleanly — the read loop returns
EIO(or short reads), nushell prints an error and re-enters the read, repeat at 100 % CPU. This is a known upstream bug class:nushell/nushell#6455— exact symptom: infinite "Input/output error" loop, high CPU, Ctrl+C does not break it, only kill from another terminal works (originally reported forflatpak enter; the same pty-vanishes trigger applies to SSH disconnect).nushell/nushell#17964— recent (2025) one-core-pegged report when nu is the system shell.#9876,#10219,#9497,#5029,#7938.Notably, hero_skills' own
tools/install.sh(lines 747–781) explicitly reverts the calling user's login shell back to bash if it finds it set to nu — i.e. the project already treats nu-as-login-shell as undesirable for the install user, butmulti_user_adddoes the opposite for every other user. This asymmetry is the bug.Suggested fix in hero_skills
nuas the user's login shell inmulti_user_add. Two viable options:/bin/bash; have the user's~/.bashrcexecintotmux new-session -A -s hero(orzellij attach -c hero) which then runsnu. The multiplexer holds the pty, so SSH death no longer reachesnu. tmux/zellij is also necessary infrastructure for reattach-after-disconnect that users actually want.nuas a child. When the parent (the wrapper) gets SIGHUP from sshd, both go down cleanly.multi_user_delto alsologinctl kill-user/loginctl terminate-userto flush stale abandoned scopes, so deleting a leaky user doesn't leave orphans behind.nushell/nushellissue with a minimal SSH-disconnect repro; #6455 isn't tagged for that trigger and looks closed without a documented fix.Issue 2 —
hero_browser_servernever reclaims idle browser sessionsRepo:
lhumina_code/hero_browser— pool / lifecycleSeverity: ~1.7 GB RAM per leaked session, accumulates over service uptime
What we see right now
On kristof5:
hero_browser_server(PID 3752859, owner despiegk) running 15 days 8 hours.chromelambda-<uuid>user-data-dirs, i.e. 10 unreclaimed browser sessions.On kristof4 (138.201.206.39, for comparison):
salma, running ~4 days./tmp/chromelambda-*directories from previous sessions whose Chrome processes are gone but whose tempdirs were never cleaned.The pattern is identical on both boxes; the leak rate just differs by usage.
Root cause in hero_browser
crates/hero_browser_core/src/browser/pool.rs(BrowserPool) holds sessions in a plain map with no lifecycle policy:The pool exposes:
create_browser(pool.rs:154)destroy_browser(pool.rs:209) — explicit, RPC-driven onlydestroy_all(pool.rs:221)browser_count/max_browsers/list_browsers/get_browserThere is no idle timeout, no TTL, no last-activity tracking, no background reaper, no
impl Drop for BrowserInstance, and no eviction whenmax_browsersis exceeded (creation just errors). The cleanup code itself works (BrowserInstance::close()atbrowser.rs:348callsremove_dir_allon the tempdir) — but nothing schedules it.The only way a session is reclaimed is if a client explicitly calls the
browser_destroyRPC (crates/hero_browser_server/src/server.rs:386). Any client that crashes, drops the connection, or simply forgets to callbrowser_destroypermanently leaks one Chrome session — process tree, RAM,/tmp/chromelambda-*dir, and slot in theHashMap.Suggested fix in hero_browser
PoolConfig::idle_timeout(new field, default e.g. 30 min).impl Drop for BrowserInstancethat kills the Chrome process and removes the tempdir, so accidental drops also clean up./tmp/chromelambda-*dirs that don't correspond to a liveBrowserInstance(handles leftovers from a previous run, like the 10 currently sitting on kristof4).max_browsersis hit, evict the least-recently-used session instead of erroring oncreate_browser.Evidence (build-IDs, commits) — for reproducibility
hero_codescalers_server(running)eac56e30…b2f846b1e0a829(Apr 30) — running binary on disk replaced; new build sitting unused at1e0a829hero_browser_server(running)d77f1fbe…79b1f7fhero_browserf250e1a(May 1); on-disk binary deleted, source tree under despiegk is incompletehero_browser_server(running)791d235b…aef78248hero_browserf250e1ahero_codescalers_server(running == on-disk)3b3d76db…e4e885fcc0d4ef6(Apr 27) — 14 commits behind kristof5's mirror oforigin/developmentRepo state at investigation time (2026-05-03): kristof5's
hero_codescalersandhero_skillscheckouts are at origin tip;hero_browseris unbuilt because despiegk's checkout is missing/incomplete. kristof4's/root/hero/code0/hero_codescalershas not been pulled since Apr 27.Cross-references
lhumina_code/hero_skills— don't makenuthe login shell inmulti_user_add; use a multiplexer wrapper.lhumina_code/hero_browser— add idle-timeout / TTL / Drop-based reaper toBrowserPool.tools/install.sh:747-781already reverts nu→bash for the install user.nushell/nushell#6455,#17964.Sub-issue filed: lhumina_code/hero_browser#18 (BrowserPool lifecycle).
Sub-issue filed: lhumina_code/hero_skills#199 (nu login-shell orphans).
kristof5: nushell login-shell + hero_browser session leak — recurring resource starvation, SSH login blocked until rebootto story: resolve mem leakage