Application-level WebSocket heartbeat (A2 — WS refactor follow-up) #16
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_collab#16
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Follow-up to #13. Closes gap A2 from the post-refactor architectural assessment.
Problem: we rely on TCP keepalive for dead-peer detection on
/ws/user/{user_id}. On a quiet channel, a dead socket can linger for 2+ minutes before the next outbound event tries to write and fails. Slack/Discord detect this in ~35s via an application-level ping/pong cycle.Solution: server sends
{"type":"ping","ts":...}every 25s; client echoes{"type":"pong"}. If no pong arrives within 10s after a ping, close the socket (client's existing backoff reconnects). Restructureshandle_user_wsaround an mpsc single-writer pattern so the heartbeat and event-forwarder can both push frames without contending on the WebSocket's split-sink.Docs:
plan/feature-ws-heartbeat.mdplan/impl-ws-heartbeat.mdSize: ~85 lines production + ~150 lines tests. 3 tasks.
Branch: lands on
feat/ws-refactor. Execution depends on #13's P6.2 (handle_user_ws) being in place; runs before #B1 (session resume) which extends the same handler.Implementation landed
Three commits on
feat/ws-refactor:030ac25feat(ui): 25s WebSocket heartbeat with single-writer mpsc patternRestructures
handle_user_wsaround an mpsc channel that both the event-forwarder and heartbeat task push onto; a single writer task drains tows_tx.tokio::select!now has 4 arms (was 2):event_forward,heartbeat,writer,recv_loop.WS_HEARTBEAT_INTERVAL= 25s,WS_HEARTBEAT_TIMEOUT= 10s. Dead-peer detection in ≤35s (vs 2+ min via TCP keepalive).4b8afd6feat(ui): client responds to server ping with pong (heartbeat)Client
handleWsEventgains acase 'ping':at the top of the switch that echoes{type:'pong'}if the socket is OPEN.8415247test(ui): heartbeat test scaffold — fixture TODO, manual dogfood primary gate3
#[ignore]'d integration tests documenting intent. Full fixture (spawn hero_collab_ui + connect tokio-tungstenite WebSocket) is deferred — it's the first integration-test harness this crate would need, and the payoff:effort ratio is disproportionate for a single feature. Shared fixture will amortize when B1's session_resume tests need the same setup.Architectural notes
ws_tx.send()-in-send-loop pattern would require a Mutex aroundws_txwith real contention. The mpsc channel makes this clean.pong_receivedusesOrdering::Relaxed. Correct — eventual visibility across a 10s grace window is sufficient; no read-before-write ordering required. The ping task RESETS the flag before sending, so a pong racing the next ping can't false-positive.handle_inbound) preventshandle_inboundfrom warn-logging "dropped unexpected inbound WS type" on every pong.tokio::select!to teardown, which fires thepresence.mark_connectionoffline RPC from #17.Manual dogfood gate
Primary verification: DevTools → Network → Offline → wait ~35s → connection status flips to "disconnected" → re-enable → client reconnects via existing backoff. This is more decisive than the ignored unit tests would be.
Tests impact
cargo test -p hero_collab_ui: 8 unit (unchanged) + 3 ignored (new heartbeat scaffold). Zero regressions.