PTY WebSocket select! leaks three of four spawned tasks on disconnect #50
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_router#50
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Category: resource / low-medium
What
The PTY WebSocket proxy spawns four concurrent tasks and races them with
tokio::select!. When one completes (e.g. the browser closes the WebSocket), theselect!returns, but the other three tasks are not aborted — dropping aJoinHandledoes not cancel the task. They continue running until their channels naturally drain.Where
crates/hero_router/src/server/terminal.rs:618-632The four tasks are:
browser_to_serverNone/Errbytes_bridgebytes_txare droppedforwardermsg_txis dropped (ownsserver_sink)server_to_browserNone/ErrThe channels are shared between tasks, so even after one task exits the others may keep their senders alive — e.g. if
browser_to_serverdies first,forwarderstill holdsserver_sinkand keeps writing to hero_proc's WS socket untilmsg_txis dropped. Sincemsg_txis captured bybrowser_to_server(which is finishing) andbytes_bridge, it takes both to drop beforeforwarderunblocks.Why it's wrong
Arcinto the session map. The map entry is removed on L628, but the tasks keep live references to their captured variables.map.remove(&name), so they continue operating on a session that no longer exists in the registry. A resize message orinject_reply()racing with cleanup can briefly find the entry gone and fail-fast.server_to_browsercan block forever on.next().awaitwith no way to unstick it.Reproduction
Open 50 terminal sessions in a loop, close each immediately after connecting:
Suggested fix
Collect the
JoinHandles, then on the winning arm.abort()the rest:Or use a
CancellationTokenshared across all four tasks (cleaner if the task count grows).Either way, explicitly drop
msg_tx/bytes_txbefore theselect!(whoever doesn't need to send anymore) so the other side naturally unblocks too.Implementation Spec for Issue #50
Objective
Abort the three remaining
pty_proxytasks when any one of the four completes, so WebSocket handles, channel senders, and captured buffers are released promptly on disconnect instead of lingering across session churn.Requirements
browser_to_server,bytes_bridge,forwarder, orserver_to_browserfinishes, the other three MUST be aborted.select!returns, the four handles must be awaited (results ignored) so each task's captured stack frame — includingserver_sink,browser_sink,browser_source,server_source, andmsg_tx/reply_txclones — drops beforepty_proxyreturns. This prevents the tasks from outliving the function.attached().remove(&name)call must stay after theselect!(and can run concurrently with or just before the finalawaits). It must NOT be moved beforeselect!— the map entry is actively read during the session byinject_reply(external, viaterminal.replyRPC) and bybrowser_to_server(internal, for resize state at line 493). Moving the remove earlier would silently break CPR/DA reply forwarding.tokio::pin!+JoinHandle::abort(). Confirmedtokio_util(which would provideCancellationToken) is not a dependency ofhero_router.pty_proxy's return type (Result<()>), error handling in the caller (pty_handler), and theJOB_NAME_PREFIX/session-registry contract are all unchanged.Files to Modify/Create
crates/hero_router/src/server/terminal.rs— rewrite thetokio::select!block at roughly lines 617-623 to pin handles and abort siblings on the winning arm; follow with sequential.awaiton all four handles so captured resources drop; keep theattached().remove(&name)block where it is today (just before the trailingOk(())).Implementation Plan
Step 1: Rewrite the select-and-cleanup tail of
pty_proxyFiles:
crates/hero_router/src/server/terminal.rsCurrent shape (lines 617-631):
Replace with:
Subtasks:
crates/hero_router/src/server/terminal.rswith the block above.tokio::pin!is a macro re-export fromtokiowhich is already fully-featured in this crate).attached().remove(&name)call to before theselect!.tokio_util::sync::CancellationTokenor add any dependency toCargo.toml.Dependencies: none.
Step 2: Tests
Files: none (no new test file is practical).
Rationale: the four tasks close over
WebSocketsinks /SplitSink<WebSocketStream<..>>types that are not trivially mockable, andattached()is a process-globalOnceLock. A meaningful unit test would require an in-process axum WebSocket harness plus a tokio-tungstenite echo server, which is heavier than the fix itself. The change is verifiable by inspection — eachselect!arm demonstrably aborts exactly the three sibling handles.For the reviewer, a light manual smoke test:
Dependencies: Step 1.
Acceptance Criteria
select!arms calls.abort()on the other threeJoinHandles.let _ = ...await) so captured resources (server_sink,browser_sink,browser_source,server_source, and all sender clones) drop beforepty_proxyreturns.attached().remove(&name)stays AFTER theselect!— inject_reply and the resize handler continue to see the session during its full lifetime.Cargo.tomlis unchanged (no new crate dependencies).cargo test --workspacepasses.cargo clippy --workspace --all-targets -- -D warningspasses.pty_proxy(lines 617-631). No changes elsewhere.Notes
JoinHandle::abort()is synchronous and idempotent. Calling it on a handle whose task already completed normally (the winner) is a no-op. Awaiting an aborted task returnsErr(JoinError)withis_cancelled() == true; we discard vialet _ = ...await.drop(reply_tx)/drop(msg_tx)on the parent stack before theselect!was considered. With.abort()in place, these drops are redundant — aborts forcibly terminate downstream tasks regardless of channel state, and the parent senders are dropped implicitly whenpty_proxyreturns. Skipping the explicit drops keeps the diff minimal.tokio_util::sync::CancellationToken: the dependency would be new to this crate, and the 4-task case is cleanly expressed withtokio::pin!+.abort(). If the task count grows or nested cancellation is needed in the future, migration toCancellationTokenis straightforward.inject_replyrace window is unchanged by this fix — it matches today's behavior. Once the aborts land andbytes_bridgeis cancelled, any in-flightinject_replythat already cloned the sender will have its send silently dropped when the unbounded mpsc receiver is gone.inject_replyalready returnsErr("no active PTY for session '...'")when the map entry is missing, which callers (theterminal.replyRPC) already handle as a soft error.Spec correction — avoid double-poll of the winning handle
The pseudo-code in the previous comment had the
.awaitcalls AFTER theselect!:Awaiting a
JoinHandlethat was already polled to completion bytokio::select!is not safe — the output has already been consumed, so the second.awaiteither deadlocks (returnsPendingforever) or panics depending on the tokio version.Corrected pattern
Move the
.awaits INSIDE each arm, awaiting only the three non-winners:Each arm aborts and then awaits exactly the three sibling handles (never the winner). All other parts of the spec (pin!, map-remove ordering after select, no new deps) stay as before.
Test Results
Lint
cargo clippy --workspace --all-targets -- -D warnings: passcargo fmt --all --check: passcargo build --workspace: passNo new tests added
The four tasks close over
WebSocketsinks /SplitSink<WebSocketStream<..>>types that are not trivially mockable, andattached()is a process-globalOnceLock. A meaningful end-to-end test would require an in-process axum WebSocket harness plus a tokio-tungstenite echo server, which is heavier than the fix itself. The change is verifiable by inspection:select!arms aborts exactly the three siblingJoinHandles.JoinHandle.attached().remove(&name)block is kept AFTER the select soinject_replyand the internal resize handler continue to see the session during its full lifetime.Cargo.tomlchanges.Recommended manual smoke test (rapid reconnect):
Implementation Summary
Closing #50 via PR (linked in a follow-up comment).
Changes
crates/hero_router/src/server/terminal.rs(+50 / -5)Rewrote the
tokio::select!block at the tail ofpty_proxyto abort the three non-winning tasks when any one of the four completes, and to await only those three handles (never the winner).Before:
After:
tokio::pin!(browser_to_server, bytes_bridge, forwarder, server_to_browser);_ = &mut X => { abort(others); let _ = others.await; }attached().remove(&name)kept AFTER the select soinject_replyand the resize handler continue to see the session for its full lifetime.Why this pattern
.abort()is required: dropping aJoinHandledoes not cancel the underlying task. Eachselect!arm now explicitly aborts the three sibling handles..awaitof non-winners only: awaiting aJoinHandlethat was already polled to completion byselect!is not safe (can deadlock or panic depending on tokio version). The winner of each arm is never awaited again..awaitafter.abort(): ensures each aborted task finishes unwinding beforepty_proxyreturns, so capturedSplitSinks, sources, and channel senders drop promptly instead of lingering across session churn.inject_reply(called fromterminal.replyRPC) and the internal resize handler both read the map entry during normal session operation. Moving the remove earlier would silently break CPR/DA reply forwarding and resize updates. The teardown-time race window forinject_replyis unchanged from today.tokio::pin!+JoinHandle::abort().tokio_util'sCancellationTokenwas considered and rejected — the 4-task case is cleanly expressed in ~40 lines without adding a dependency.Verification
cargo build --workspace: pass.cargo test --workspace: 71 passed / 0 failed / 0 ignored.cargo clippy --workspace --all-targets -- -D warnings: clean.cargo fmt --all --check: clean.Review focus
Manual smoke test (rapid reconnect):
Code review should confirm:
attached().remove(&name)is still after theselect!.Cargo.tomlchanges.Pull request opened: #56
This PR implements the changes discussed in this issue.