multi_user_add: nu set as login shell → SSH disconnect leaves orphan nushell REPLs at 100% CPU #199

Open
opened 2026-05-04 10:56:46 +00:00 by sameh-farouk · 0 comments
Member

Summary

multi_user_add sets each user's login shell directly to ~/hero/bin/nu. This puts a bare nushell REPL between sshd and the user's pty with no multiplexer in between. When the SSH connection dies (clean disconnect, broken pipe, network drop, laptop sleep) the pty is yanked out from under the running nu, and nushell's REPL hits an upstream bug class where the read loop returns EIO repeatedly and spins at 100 % CPU forever — orphaned to PID 1, in an abandoned systemd-logind scope, unreachable by Ctrl+C or sshd's SIGHUP.

Each interrupted SSH session leaves another core permanently consumed. Live evidence and full investigation: lhumina_code/home#205.

Where in the code

tools/modules/installers/multiuser.nu:

# line 543 (doc-comment, makes the design choice explicit)
#   - login shell set directly to ~/hero/bin/nu (no namespace wrapper)

# line 578
let shell = $"($hero_dir)/bin/nu"

# line 609 (existing user — update path)
^sudo usermod --shell $shell $username

# line 611 (new user — create path)
^sudo useradd --no-create-home --home-dir $homedir --shell $shell $username

The project already treats nu-as-login-shell as undesirable for the install user — tools/install.sh:747-781 explicitly reverts the calling user's shell back to bash (Linux) / zsh (macOS) if it finds it set to nu. multi_user_add is asymmetric with that policy.

Why nushell spins forever after SSH disconnect

sshd execs the login shell directly. With nu as login shell, there is no tmux/zellij/mosh between sshd and nu to hold the pty. When SSH dies, the pty is closed under nu's feet. Nushell's REPL loop doesn't handle that case cleanly — the read returns EIO (or short reads), nu prints an error, retries the read, repeats at 100 % CPU. Upstream issues confirming the bug class:

Ctrl+C cannot break out (no controlling tty), and SIGHUP from the dead sshd never arrives (its sshd process exited before sending it). The only ways out are kill -9 from another session or loginctl kill-session <id>. Two such orphans on kristof5 right now have burned ~290 hours of CPU each.

Note: multi_user_del already calls ^sudo killall -u $username (line 1056), which sends SIGTERM, so the deletion path probably reaps these orphans. The leak is in the steady-state running path — every disconnected SSH session that didn't terminate cleanly leaves an orphan that survives until the user is deleted or the box is rebooted.

The proper fix is to put a multiplexer between sshd and nu. This both prevents the bug (nu never sees its tty disappear) and gives users session reattach-after-disconnect, which they want anyway.

1. Set the login shell to /bin/bash, not ~/hero/bin/nu

In multi_user_add, drop the nu shell assignment and use the system shell:

# multiuser.nu:578 → /bin/bash
let shell = "/bin/bash"

# multiuser.nu:609, 611 — unchanged otherwise
^sudo usermod --shell $shell $username
^sudo useradd --no-create-home --home-dir $homedir --shell $shell $username

2. Auto-attach to a multiplexer on interactive SSH

The template's ~/.bashrc (which multi_user_template_create already provisions) gets an exec-into-tmux block, gated so it only fires for interactive SSH and not nested invocations:

# ~/.bashrc — append
if [[ -n "$SSH_CONNECTION" && $- == *i* && -z "$TMUX" && -z "$ZELLIJ" ]]; then
    exec tmux new-session -A -s hero "$HOME/hero/bin/nu"
fi

Behavior:

  • Interactive SSH login → bash → exec tmux → tmux runs nu in the hero session, attaching if it already exists.
  • Non-interactive SSH (ssh user@host nu -c …, scp, rsync, ForceCommand-style admin tooling) → $- == *i* is false → block is skipped, command runs normally under bash.
  • Nested shells (tmux split, manual bash from inside nu) → $TMUX is set → block is skipped, no recursion.
  • The multiplexer keeps a live pty across SSH disconnects, so nu never sees its tty go away.

3. Provide a HERO_NO_TMUX escape hatch

if [[ -n "$SSH_CONNECTION" && $- == *i* && -z "$TMUX" && -z "$ZELLIJ" && -z "$HERO_NO_TMUX" ]]; then
    exec tmux new-session -A -s hero "$HOME/hero/bin/nu"
fi

Lets operators bypass the wrapper for debugging without editing the user's home (ssh user@host -o SetEnv='HERO_NO_TMUX=1').

4. Ensure tmux is installed by the installer

tools/install.sh should install tmux as a hard dependency (it's already standard on every distro the project targets). Same line of reasoning as the recent install_nushell || die change at install.sh:744.

What this does NOT fix

  • The underlying nushell bug — that's an upstream fix in nushell/nushell and out of scope here. Worth filing a minimal SSH-disconnect repro upstream against #6455 once the workaround is in place.
  • Orphans created before the fix lands — operators will still need to clean up existing stuck-nu processes once (kristof5 has two right now).

Cross-reference

Umbrella: lhumina_code/home#205
Sibling: lhumina_code/hero_browser#18 (BrowserPool lifecycle)

## Summary `multi_user_add` sets each user's login shell directly to `~/hero/bin/nu`. This puts a bare nushell REPL between sshd and the user's pty with no multiplexer in between. When the SSH connection dies (clean disconnect, broken pipe, network drop, laptop sleep) the pty is yanked out from under the running `nu`, and nushell's REPL hits an upstream bug class where the read loop returns `EIO` repeatedly and spins at 100 % CPU forever — orphaned to PID 1, in an abandoned systemd-logind scope, unreachable by `Ctrl+C` or sshd's SIGHUP. Each interrupted SSH session leaves another core permanently consumed. Live evidence and full investigation: **[lhumina_code/home#205](https://forge.ourworld.tf/lhumina_code/home/issues/205)**. ## Where in the code `tools/modules/installers/multiuser.nu`: ```nushell # line 543 (doc-comment, makes the design choice explicit) # - login shell set directly to ~/hero/bin/nu (no namespace wrapper) # line 578 let shell = $"($hero_dir)/bin/nu" # line 609 (existing user — update path) ^sudo usermod --shell $shell $username # line 611 (new user — create path) ^sudo useradd --no-create-home --home-dir $homedir --shell $shell $username ``` The project already treats nu-as-login-shell as undesirable for the install user — `tools/install.sh:747-781` explicitly reverts the calling user's shell back to `bash` (Linux) / `zsh` (macOS) if it finds it set to nu. `multi_user_add` is asymmetric with that policy. ## Why nushell spins forever after SSH disconnect `sshd` `exec`s the login shell directly. With nu as login shell, there is no `tmux`/`zellij`/`mosh` between sshd and nu to hold the pty. When SSH dies, the pty is closed under nu's feet. Nushell's REPL loop doesn't handle that case cleanly — the read returns `EIO` (or short reads), nu prints an error, retries the read, repeats at 100 % CPU. Upstream issues confirming the bug class: - [`nushell/nushell#6455`](https://github.com/nushell/nushell/issues/6455) — exact symptom: infinite "Input/output error" loop, high CPU, `Ctrl+C` does not break it (originally reported for `flatpak enter`; same pty-vanishes trigger). - [`nushell/nushell#17964`](https://github.com/nushell/nushell/issues/17964) — recent (2025) one-core-pegged report when nu is the system shell. - Adjacent: [`#9876`](https://github.com/nushell/nushell/issues/9876), [`#10219`](https://github.com/nushell/nushell/issues/10219), [`#9497`](https://github.com/nushell/nushell/issues/9497), [`#5029`](https://github.com/nushell/nushell/issues/5029), [`#7938`](https://github.com/nushell/nushell/issues/7938). `Ctrl+C` cannot break out (no controlling tty), and SIGHUP from the dead sshd never arrives (its sshd process exited before sending it). The only ways out are `kill -9` from another session or `loginctl kill-session <id>`. Two such orphans on kristof5 right now have burned ~290 hours of CPU each. Note: `multi_user_del` already calls `^sudo killall -u $username` (line 1056), which sends SIGTERM, so the *deletion* path probably reaps these orphans. The leak is in the *steady-state running* path — every disconnected SSH session that didn't terminate cleanly leaves an orphan that survives until the user is deleted or the box is rebooted. ## Recommended engineered fix The proper fix is to put a multiplexer between sshd and nu. This both prevents the bug (nu never sees its tty disappear) and gives users session reattach-after-disconnect, which they want anyway. ### 1. Set the login shell to `/bin/bash`, not `~/hero/bin/nu` In `multi_user_add`, drop the `nu` shell assignment and use the system shell: ```nushell # multiuser.nu:578 → /bin/bash let shell = "/bin/bash" # multiuser.nu:609, 611 — unchanged otherwise ^sudo usermod --shell $shell $username ^sudo useradd --no-create-home --home-dir $homedir --shell $shell $username ``` ### 2. Auto-attach to a multiplexer on interactive SSH The template's `~/.bashrc` (which `multi_user_template_create` already provisions) gets an exec-into-tmux block, gated so it only fires for interactive SSH and not nested invocations: ```bash # ~/.bashrc — append if [[ -n "$SSH_CONNECTION" && $- == *i* && -z "$TMUX" && -z "$ZELLIJ" ]]; then exec tmux new-session -A -s hero "$HOME/hero/bin/nu" fi ``` Behavior: - Interactive SSH login → bash → exec tmux → tmux runs nu in the `hero` session, attaching if it already exists. - Non-interactive SSH (`ssh user@host nu -c …`, scp, rsync, `ForceCommand`-style admin tooling) → `$- == *i*` is false → block is skipped, command runs normally under bash. - Nested shells (`tmux split`, manual `bash` from inside nu) → `$TMUX` is set → block is skipped, no recursion. - The multiplexer keeps a live pty across SSH disconnects, so nu never sees its tty go away. ### 3. Provide a `HERO_NO_TMUX` escape hatch ```bash if [[ -n "$SSH_CONNECTION" && $- == *i* && -z "$TMUX" && -z "$ZELLIJ" && -z "$HERO_NO_TMUX" ]]; then exec tmux new-session -A -s hero "$HOME/hero/bin/nu" fi ``` Lets operators bypass the wrapper for debugging without editing the user's home (`ssh user@host -o SetEnv='HERO_NO_TMUX=1'`). ### 4. Ensure tmux is installed by the installer `tools/install.sh` should install tmux as a hard dependency (it's already standard on every distro the project targets). Same line of reasoning as the recent `install_nushell || die` change at `install.sh:744`. ### What this does NOT fix - The underlying nushell bug — that's an upstream fix in `nushell/nushell` and out of scope here. Worth filing a minimal SSH-disconnect repro upstream against [#6455](https://github.com/nushell/nushell/issues/6455) once the workaround is in place. - Orphans created *before* the fix lands — operators will still need to clean up existing stuck-nu processes once (kristof5 has two right now). ## Cross-reference Umbrella: [lhumina_code/home#205](https://forge.ourworld.tf/lhumina_code/home/issues/205) Sibling: [lhumina_code/hero_browser#18](https://forge.ourworld.tf/lhumina_code/hero_browser/issues/18) (`BrowserPool` lifecycle)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills#199
No description provided.