Application-level WebSocket heartbeat (A2 — WS refactor follow-up) #16

New issue

Closed

opened 2026-04-23 19:40:55 +00:00 by sameh-farouk · 1 comment

sameh-farouk commented

2026-04-23 19:40:55 +00:00

Member

Follow-up to #13. Closes gap A2 from the post-refactor architectural assessment.

Problem: we rely on TCP keepalive for dead-peer detection on /ws/user/{user_id}. On a quiet channel, a dead socket can linger for 2+ minutes before the next outbound event tries to write and fails. Slack/Discord detect this in ~35s via an application-level ping/pong cycle.

Solution: server sends {"type":"ping","ts":...} every 25s; client echoes {"type":"pong"}. If no pong arrives within 10s after a ping, close the socket (client's existing backoff reconnects). Restructures handle_user_ws around an mpsc single-writer pattern so the heartbeat and event-forwarder can both push frames without contending on the WebSocket's split-sink.

Docs:

Spec: plan/feature-ws-heartbeat.md
Impl plan: plan/impl-ws-heartbeat.md

Size: ~85 lines production + ~150 lines tests. 3 tasks.

Branch: lands on feat/ws-refactor. Execution depends on #13's P6.2 (handle_user_ws) being in place; runs before #B1 (session resume) which extends the same handler.

Follow-up to #13. Closes gap A2 from the post-refactor architectural assessment. **Problem:** we rely on TCP keepalive for dead-peer detection on `/ws/user/{user_id}`. On a quiet channel, a dead socket can linger for 2+ minutes before the next outbound event tries to write and fails. Slack/Discord detect this in ~35s via an application-level ping/pong cycle. **Solution:** server sends `{"type":"ping","ts":...}` every 25s; client echoes `{"type":"pong"}`. If no pong arrives within 10s after a ping, close the socket (client's existing backoff reconnects). Restructures `handle_user_ws` around an mpsc single-writer pattern so the heartbeat and event-forwarder can both push frames without contending on the WebSocket's split-sink. **Docs:** - Spec: [`plan/feature-ws-heartbeat.md`](../src/branch/feat/ws-refactor/plan/feature-ws-heartbeat.md) - Impl plan: [`plan/impl-ws-heartbeat.md`](../src/branch/feat/ws-refactor/plan/impl-ws-heartbeat.md) **Size:** ~85 lines production + ~150 lines tests. 3 tasks. **Branch:** lands on `feat/ws-refactor`. Execution depends on #13's P6.2 (`handle_user_ws`) being in place; runs before #B1 (session resume) which extends the same handler.

sameh-farouk added the

type_story

label

2026-04-23 19:40:55 +00:00

sameh-farouk referenced this issue

2026-04-23 19:42:05 +00:00

Refactor realtime transport: one WS per user + server-side fanout (Slack/Mattermost model) #13

sameh-farouk referenced this issue

2026-04-23 20:52:27 +00:00

Refactor realtime transport: one WS per user + server-side fanout (Slack/Mattermost model) #13

sameh-farouk commented

2026-04-23 20:58:47 +00:00

Author

Member

Implementation landed

Three commits on feat/ws-refactor:

030ac25 feat(ui): 25s WebSocket heartbeat with single-writer mpsc pattern
Restructures handle_user_ws around an mpsc channel that both the event-forwarder and heartbeat task push onto; a single writer task drains to ws_tx. tokio::select! now has 4 arms (was 2): event_forward, heartbeat, writer, recv_loop. WS_HEARTBEAT_INTERVAL = 25s, WS_HEARTBEAT_TIMEOUT = 10s. Dead-peer detection in ≤35s (vs 2+ min via TCP keepalive).
4b8afd6 feat(ui): client responds to server ping with pong (heartbeat)
Client handleWsEvent gains a case 'ping': at the top of the switch that echoes {type:'pong'} if the socket is OPEN.
8415247 test(ui): heartbeat test scaffold — fixture TODO, manual dogfood primary gate
3 #[ignore]'d integration tests documenting intent. Full fixture (spawn hero_collab_ui + connect tokio-tungstenite WebSocket) is deferred — it's the first integration-test harness this crate would need, and the payoff:effort ratio is disproportionate for a single feature. Shared fixture will amortize when B1's session_resume tests need the same setup.

Architectural notes

mpsc single-writer pattern is load-bearing for B1 (#18) session resume. The ring-buffer replay in B1 needs to inject frames into the outbound stream alongside live events; a direct ws_tx.send()-in-send-loop pattern would require a Mutex around ws_tx with real contention. The mpsc channel makes this clean.
pong_received uses Ordering::Relaxed. Correct — eventual visibility across a 10s grace window is sufficient; no read-before-write ordering required. The ping task RESETS the flag before sending, so a pong racing the next ping can't false-positive.
Fast-path pong in recv-loop (before routing to handle_inbound) prevents handle_inbound from warn-logging "dropped unexpected inbound WS type" on every pong.
25s interval is below the typical 30s NAT timeout and 60s ALB timeout, so pings also keep intermediary state alive.
Socket close on missed pong cascades via tokio::select! to teardown, which fires the presence.mark_connection offline RPC from #17.

Manual dogfood gate

Primary verification: DevTools → Network → Offline → wait ~35s → connection status flips to "disconnected" → re-enable → client reconnects via existing backoff. This is more decisive than the ignored unit tests would be.

Tests impact

cargo test -p hero_collab_ui: 8 unit (unchanged) + 3 ignored (new heartbeat scaffold). Zero regressions.

## Implementation landed Three commits on `feat/ws-refactor`: - [`030ac25`](../commit/030ac25) `feat(ui): 25s WebSocket heartbeat with single-writer mpsc pattern` Restructures `handle_user_ws` around an mpsc channel that both the event-forwarder and heartbeat task push onto; a single writer task drains to `ws_tx`. `tokio::select!` now has 4 arms (was 2): `event_forward`, `heartbeat`, `writer`, `recv_loop`. `WS_HEARTBEAT_INTERVAL` = 25s, `WS_HEARTBEAT_TIMEOUT` = 10s. Dead-peer detection in ≤35s (vs 2+ min via TCP keepalive). - [`4b8afd6`](../commit/4b8afd6) `feat(ui): client responds to server ping with pong (heartbeat)` Client `handleWsEvent` gains a `case 'ping':` at the top of the switch that echoes `{type:'pong'}` if the socket is OPEN. - [`8415247`](../commit/8415247) `test(ui): heartbeat test scaffold — fixture TODO, manual dogfood primary gate` 3 `#[ignore]`'d integration tests documenting intent. Full fixture (spawn hero_collab_ui + connect tokio-tungstenite WebSocket) is deferred — it's the first integration-test harness this crate would need, and the payoff:effort ratio is disproportionate for a single feature. Shared fixture will amortize when B1's session_resume tests need the same setup. ## Architectural notes - **mpsc single-writer pattern is load-bearing for B1 (#18) session resume.** The ring-buffer replay in B1 needs to inject frames into the outbound stream alongside live events; a direct `ws_tx.send()`-in-send-loop pattern would require a Mutex around `ws_tx` with real contention. The mpsc channel makes this clean. - **`pong_received` uses `Ordering::Relaxed`.** Correct — eventual visibility across a 10s grace window is sufficient; no read-before-write ordering required. The ping task RESETS the flag before sending, so a pong racing the next ping can't false-positive. - **Fast-path pong in recv-loop** (before routing to `handle_inbound`) prevents `handle_inbound` from warn-logging "dropped unexpected inbound WS type" on every pong. - **25s interval is below the typical 30s NAT timeout** and 60s ALB timeout, so pings also keep intermediary state alive. - Socket close on missed pong cascades via `tokio::select!` to teardown, which fires the `presence.mark_connection` offline RPC from #17. ## Manual dogfood gate Primary verification: DevTools → Network → Offline → wait ~35s → connection status flips to "disconnected" → re-enable → client reconnects via existing backoff. This is more decisive than the ignored unit tests would be. ## Tests impact `cargo test -p hero_collab_ui`: 8 unit (unchanged) + 3 ignored (new heartbeat scaffold). Zero regressions.

sameh-farouk referenced this issue

2026-04-23 20:58:48 +00:00

Session resume with sequence numbers + ring buffer (B1 — WS refactor follow-up) #18

sameh-farouk referenced this issue

2026-04-23 21:01:10 +00:00

Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polish #19

sameh-farouk closed this issue

2026-04-24 01:00:09 +00:00