Application-level WebSocket heartbeat (A2 — WS refactor follow-up) #16

Closed
opened 2026-04-23 19:40:55 +00:00 by sameh-farouk · 1 comment
Member

Follow-up to #13. Closes gap A2 from the post-refactor architectural assessment.

Problem: we rely on TCP keepalive for dead-peer detection on /ws/user/{user_id}. On a quiet channel, a dead socket can linger for 2+ minutes before the next outbound event tries to write and fails. Slack/Discord detect this in ~35s via an application-level ping/pong cycle.

Solution: server sends {"type":"ping","ts":...} every 25s; client echoes {"type":"pong"}. If no pong arrives within 10s after a ping, close the socket (client's existing backoff reconnects). Restructures handle_user_ws around an mpsc single-writer pattern so the heartbeat and event-forwarder can both push frames without contending on the WebSocket's split-sink.

Docs:

Size: ~85 lines production + ~150 lines tests. 3 tasks.

Branch: lands on feat/ws-refactor. Execution depends on #13's P6.2 (handle_user_ws) being in place; runs before #B1 (session resume) which extends the same handler.

Follow-up to #13. Closes gap A2 from the post-refactor architectural assessment. **Problem:** we rely on TCP keepalive for dead-peer detection on `/ws/user/{user_id}`. On a quiet channel, a dead socket can linger for 2+ minutes before the next outbound event tries to write and fails. Slack/Discord detect this in ~35s via an application-level ping/pong cycle. **Solution:** server sends `{"type":"ping","ts":...}` every 25s; client echoes `{"type":"pong"}`. If no pong arrives within 10s after a ping, close the socket (client's existing backoff reconnects). Restructures `handle_user_ws` around an mpsc single-writer pattern so the heartbeat and event-forwarder can both push frames without contending on the WebSocket's split-sink. **Docs:** - Spec: [`plan/feature-ws-heartbeat.md`](../src/branch/feat/ws-refactor/plan/feature-ws-heartbeat.md) - Impl plan: [`plan/impl-ws-heartbeat.md`](../src/branch/feat/ws-refactor/plan/impl-ws-heartbeat.md) **Size:** ~85 lines production + ~150 lines tests. 3 tasks. **Branch:** lands on `feat/ws-refactor`. Execution depends on #13's P6.2 (`handle_user_ws`) being in place; runs before #B1 (session resume) which extends the same handler.
Author
Member

Implementation landed

Three commits on feat/ws-refactor:

  • 030ac25 feat(ui): 25s WebSocket heartbeat with single-writer mpsc pattern
    Restructures handle_user_ws around an mpsc channel that both the event-forwarder and heartbeat task push onto; a single writer task drains to ws_tx. tokio::select! now has 4 arms (was 2): event_forward, heartbeat, writer, recv_loop. WS_HEARTBEAT_INTERVAL = 25s, WS_HEARTBEAT_TIMEOUT = 10s. Dead-peer detection in ≤35s (vs 2+ min via TCP keepalive).
  • 4b8afd6 feat(ui): client responds to server ping with pong (heartbeat)
    Client handleWsEvent gains a case 'ping': at the top of the switch that echoes {type:'pong'} if the socket is OPEN.
  • 8415247 test(ui): heartbeat test scaffold — fixture TODO, manual dogfood primary gate
    3 #[ignore]'d integration tests documenting intent. Full fixture (spawn hero_collab_ui + connect tokio-tungstenite WebSocket) is deferred — it's the first integration-test harness this crate would need, and the payoff:effort ratio is disproportionate for a single feature. Shared fixture will amortize when B1's session_resume tests need the same setup.

Architectural notes

  • mpsc single-writer pattern is load-bearing for B1 (#18) session resume. The ring-buffer replay in B1 needs to inject frames into the outbound stream alongside live events; a direct ws_tx.send()-in-send-loop pattern would require a Mutex around ws_tx with real contention. The mpsc channel makes this clean.
  • pong_received uses Ordering::Relaxed. Correct — eventual visibility across a 10s grace window is sufficient; no read-before-write ordering required. The ping task RESETS the flag before sending, so a pong racing the next ping can't false-positive.
  • Fast-path pong in recv-loop (before routing to handle_inbound) prevents handle_inbound from warn-logging "dropped unexpected inbound WS type" on every pong.
  • 25s interval is below the typical 30s NAT timeout and 60s ALB timeout, so pings also keep intermediary state alive.
  • Socket close on missed pong cascades via tokio::select! to teardown, which fires the presence.mark_connection offline RPC from #17.

Manual dogfood gate

Primary verification: DevTools → Network → Offline → wait ~35s → connection status flips to "disconnected" → re-enable → client reconnects via existing backoff. This is more decisive than the ignored unit tests would be.

Tests impact

cargo test -p hero_collab_ui: 8 unit (unchanged) + 3 ignored (new heartbeat scaffold). Zero regressions.

## Implementation landed Three commits on `feat/ws-refactor`: - [`030ac25`](../commit/030ac25) `feat(ui): 25s WebSocket heartbeat with single-writer mpsc pattern` Restructures `handle_user_ws` around an mpsc channel that both the event-forwarder and heartbeat task push onto; a single writer task drains to `ws_tx`. `tokio::select!` now has 4 arms (was 2): `event_forward`, `heartbeat`, `writer`, `recv_loop`. `WS_HEARTBEAT_INTERVAL` = 25s, `WS_HEARTBEAT_TIMEOUT` = 10s. Dead-peer detection in ≤35s (vs 2+ min via TCP keepalive). - [`4b8afd6`](../commit/4b8afd6) `feat(ui): client responds to server ping with pong (heartbeat)` Client `handleWsEvent` gains a `case 'ping':` at the top of the switch that echoes `{type:'pong'}` if the socket is OPEN. - [`8415247`](../commit/8415247) `test(ui): heartbeat test scaffold — fixture TODO, manual dogfood primary gate` 3 `#[ignore]`'d integration tests documenting intent. Full fixture (spawn hero_collab_ui + connect tokio-tungstenite WebSocket) is deferred — it's the first integration-test harness this crate would need, and the payoff:effort ratio is disproportionate for a single feature. Shared fixture will amortize when B1's session_resume tests need the same setup. ## Architectural notes - **mpsc single-writer pattern is load-bearing for B1 (#18) session resume.** The ring-buffer replay in B1 needs to inject frames into the outbound stream alongside live events; a direct `ws_tx.send()`-in-send-loop pattern would require a Mutex around `ws_tx` with real contention. The mpsc channel makes this clean. - **`pong_received` uses `Ordering::Relaxed`.** Correct — eventual visibility across a 10s grace window is sufficient; no read-before-write ordering required. The ping task RESETS the flag before sending, so a pong racing the next ping can't false-positive. - **Fast-path pong in recv-loop** (before routing to `handle_inbound`) prevents `handle_inbound` from warn-logging "dropped unexpected inbound WS type" on every pong. - **25s interval is below the typical 30s NAT timeout** and 60s ALB timeout, so pings also keep intermediary state alive. - Socket close on missed pong cascades via `tokio::select!` to teardown, which fires the `presence.mark_connection` offline RPC from #17. ## Manual dogfood gate Primary verification: DevTools → Network → Offline → wait ~35s → connection status flips to "disconnected" → re-enable → client reconnects via existing backoff. This is more decisive than the ignored unit tests would be. ## Tests impact `cargo test -p hero_collab_ui`: 8 unit (unchanged) + 3 ignored (new heartbeat scaffold). Zero regressions.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_collab#16
No description provided.