fix(terminal): abort sibling PTY tasks on disconnect to avoid task leak (#50) #56

Merged
rawdaGastan merged 2 commits from development_fix_pty_task_leak into development 2026-04-26 08:15:57 +00:00
Member

Summary

pty_proxy spawns four concurrent tasks (browser_to_server, bytes_bridge, forwarder, server_to_browser) and races them with tokio::select!. When one finishes the select! used to return and the other three JoinHandles were dropped — which does NOT cancel the tasks. Under rapid reconnect, three tasks piled up per dead session, each holding a WebSocket sink, buffers, and Arc references.

This PR pins the handles, aborts the non-winners on each arm, and awaits the aborted handles so their captured stack frames drop before pty_proxy returns. No new crate dependency (tokio_util::sync::CancellationToken was considered and rejected for the 4-task case).

Closes #50.

Changes

crates/hero_router/src/server/terminal.rs (+50 / -5)

  • Replaced the tokio::select! tail of pty_proxy with:
    • tokio::pin!(browser_to_server, bytes_bridge, forwarder, server_to_browser);
    • Four select! arms, each of shape _ = &mut X => { abort(others); let _ = others.await; }.
  • Kept the attached().remove(&name) block AFTER the select. inject_reply (called from the terminal.reply RPC) and the internal resize handler both read the map entry during normal session operation — moving the remove earlier would silently break CPR/DA reply forwarding and resize updates.

Correctness notes

  • Double-poll avoided. The winning handle of each arm has already been polled to completion by select!. Awaiting it again can deadlock or panic depending on the tokio version. This PR only awaits the three non-winners in each arm.
  • Drop ordering preserved. The sequential let _ = h.await after the aborts ensures each cancelled task finishes unwinding before pty_proxy returns, so captured SplitSinks, sources, and channel senders drop promptly.
  • No new dependencies. Cargo.toml is untouched.

Why no unit test

The four tasks close over WebSocket sinks / SplitSink<WebSocketStream<..>> types that are not trivially mockable, and attached() is a process-global OnceLock. A meaningful end-to-end test would require an in-process axum WebSocket harness plus a tokio-tungstenite echo server — heavier than the fix itself. The change is verifiable by inspection of the four arms.

Verification

  • cargo build --workspace: pass.
  • cargo test --workspace: 71 passed / 0 failed / 0 ignored.
  • cargo clippy --workspace --all-targets -- -D warnings: clean.
  • cargo fmt --all --check: clean.

Suggested manual smoke test for reviewer

for i in $(seq 50); do
  (timeout 1 websocat "ws://127.0.0.1:9988/api/terminal/foo${i}/ws" >/dev/null &)
done
wait
# /proc/<pid>/task count should NOT grow monotonically across sessions.
## Summary `pty_proxy` spawns four concurrent tasks (`browser_to_server`, `bytes_bridge`, `forwarder`, `server_to_browser`) and races them with `tokio::select!`. When one finishes the `select!` used to return and the other three `JoinHandle`s were dropped — which does NOT cancel the tasks. Under rapid reconnect, three tasks piled up per dead session, each holding a WebSocket sink, buffers, and `Arc` references. This PR pins the handles, aborts the non-winners on each arm, and awaits the aborted handles so their captured stack frames drop before `pty_proxy` returns. No new crate dependency (`tokio_util::sync::CancellationToken` was considered and rejected for the 4-task case). Closes https://forge.ourworld.tf/lhumina_code/hero_router/issues/50. ## Changes ### `crates/hero_router/src/server/terminal.rs` (+50 / -5) - Replaced the `tokio::select!` tail of `pty_proxy` with: - `tokio::pin!(browser_to_server, bytes_bridge, forwarder, server_to_browser);` - Four `select!` arms, each of shape `_ = &mut X => { abort(others); let _ = others.await; }`. - Kept the `attached().remove(&name)` block AFTER the select. `inject_reply` (called from the `terminal.reply` RPC) and the internal resize handler both read the map entry during normal session operation — moving the remove earlier would silently break CPR/DA reply forwarding and resize updates. ## Correctness notes - **Double-poll avoided.** The winning handle of each arm has already been polled to completion by `select!`. Awaiting it again can deadlock or panic depending on the tokio version. This PR only awaits the three non-winners in each arm. - **Drop ordering preserved.** The sequential `let _ = h.await` after the aborts ensures each cancelled task finishes unwinding before `pty_proxy` returns, so captured `SplitSink`s, sources, and channel senders drop promptly. - **No new dependencies.** `Cargo.toml` is untouched. ## Why no unit test The four tasks close over `WebSocket` sinks / `SplitSink<WebSocketStream<..>>` types that are not trivially mockable, and `attached()` is a process-global `OnceLock`. A meaningful end-to-end test would require an in-process axum WebSocket harness plus a tokio-tungstenite echo server — heavier than the fix itself. The change is verifiable by inspection of the four arms. ## Verification - `cargo build --workspace`: pass. - `cargo test --workspace`: 71 passed / 0 failed / 0 ignored. - `cargo clippy --workspace --all-targets -- -D warnings`: clean. - `cargo fmt --all --check`: clean. ## Suggested manual smoke test for reviewer ``` for i in $(seq 50); do (timeout 1 websocat "ws://127.0.0.1:9988/api/terminal/foo${i}/ws" >/dev/null &) done wait # /proc/<pid>/task count should NOT grow monotonically across sessions. ```
fix(terminal): abort sibling PTY tasks on disconnect to avoid task leak
All checks were successful
Build & Test / check (push) Successful in 2m27s
Build & Test / check (pull_request) Successful in 1m44s
965b3967c3
#50

pty_proxy spawns four concurrent tasks (browser_to_server, bytes_bridge,
forwarder, server_to_browser) and races them with tokio::select!. When
one finishes the select returned, but the other three JoinHandles were
just dropped — which does NOT cancel the underlying tasks. They
continued running until their channels naturally drained, holding
WebSocket sinks, Vec buffers, and Arc refs across rapid reconnects.

- tokio::pin! the four handles so they can be selected-on by reference
  and later aborted.
- In each select! arm, abort the three non-winners via
  JoinHandle::abort(), then let _ = h.await each of those three so
  their captured stack frames drop before pty_proxy returns.
- The winning handle is NOT awaited a second time — it was already
  polled to completion by select! and a double-poll can deadlock or
  panic depending on the tokio version.
- attached().remove(&name) is kept after the select intentionally:
  inject_reply (via terminal.reply RPC) and the internal resize
  handler both read the map entry during normal session operation.
  Moving the remove earlier would silently break CPR/DA reply
  forwarding and resize updates.

No new crate dependencies. tokio_util::sync::CancellationToken was
considered and rejected for the 4-task case.
Merge branch 'development' into development_fix_pty_task_leak
All checks were successful
Build & Test / check (pull_request) Successful in 2m15s
Build & Test / check (push) Successful in 2m28s
7892ae371b
rawdaGastan merged commit a41c72c142 into development 2026-04-26 08:15:57 +00:00
rawdaGastan deleted branch development_fix_pty_task_leak 2026-04-26 08:15:57 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_router!56
No description provided.