lhumina_code/hero_collab

Fork 0

Session resume with sequence numbers + ring buffer (B1 — WS refactor follow-up) #18

New issue

Closed

opened 2026-04-23 19:40:56 +00:00 by sameh-farouk · 1 comment

sameh-farouk commented

2026-04-23 19:40:56 +00:00

Member

Follow-up to #13. Closes gap B1 from the post-refactor architectural assessment.

Problem: on a brief WS drop, the refactor's onWsReconnected catch-up refetches unread counts + mentions + current-channel's last 100 messages. But events in other channels during the drop only surface on the next 300s pollNotifications tick; non-message events (reactions, pins, presence flips, channel.added) aren't in any catch-up RPC and are just lost until the next natural refresh.

Solution: per-user monotonic sequence numbers + in-memory ring buffer (500 events per user). On reconnect, client sends ?resume_from=N; server replays everything with seq > N from the buffer. Buffer miss (too old, or session was destroyed) returns a resume.failed frame; client falls through to the existing cold catch-up.

Session lifetime = "at least one connected tab" — matches Slack's model. Full-laptop-reopen falls through to cold catch-up (acceptable).

Docs:

Spec: plan/feature-ws-session-resume.md
Impl plan: plan/impl-ws-session-resume.md

Size: ~450 lines production + ~250 lines tests. 5 tasks. Biggest of the five follow-ups.

Branch: lands on feat/ws-refactor. Must run LAST of the follow-ups — extends handle_user_ws's mpsc-writer restructure from #A2 and the presence hooks from #B2.

Explicitly out of scope (deferred to C-tier work):

Cross-server session resume (multi-instance pub/sub)
Disk-persistent buffer across server restarts
Per-tab sequence counters (we do per-user)
Client-side localStorage persistence of lastSeq

Follow-up to #13. Closes gap B1 from the post-refactor architectural assessment. **Problem:** on a brief WS drop, the refactor's `onWsReconnected` catch-up refetches unread counts + mentions + current-channel's last 100 messages. But events in *other* channels during the drop only surface on the next 300s `pollNotifications` tick; non-message events (reactions, pins, presence flips, `channel.added`) aren't in any catch-up RPC and are just lost until the next natural refresh. **Solution:** per-user monotonic sequence numbers + in-memory ring buffer (500 events per user). On reconnect, client sends `?resume_from=N`; server replays everything with `seq > N` from the buffer. Buffer miss (too old, or session was destroyed) returns a `resume.failed` frame; client falls through to the existing cold catch-up. **Session lifetime = "at least one connected tab"** — matches Slack's model. Full-laptop-reopen falls through to cold catch-up (acceptable). **Docs:** - Spec: [`plan/feature-ws-session-resume.md`](../src/branch/feat/ws-refactor/plan/feature-ws-session-resume.md) - Impl plan: [`plan/impl-ws-session-resume.md`](../src/branch/feat/ws-refactor/plan/impl-ws-session-resume.md) **Size:** ~450 lines production + ~250 lines tests. 5 tasks. Biggest of the five follow-ups. **Branch:** lands on `feat/ws-refactor`. Must run LAST of the follow-ups — extends `handle_user_ws`'s mpsc-writer restructure from #A2 and the presence hooks from #B2. **Explicitly out of scope (deferred to C-tier work):** - Cross-server session resume (multi-instance pub/sub) - Disk-persistent buffer across server restarts - Per-tab sequence counters (we do per-user) - Client-side `localStorage` persistence of `lastSeq`

sameh-farouk added the

type_story

label

2026-04-23 19:40:56 +00:00

sameh-farouk referenced this issue

2026-04-23 19:42:05 +00:00

Refactor realtime transport: one WS per user + server-side fanout (Slack/Mattermost model) #13

sameh-farouk referenced this issue

2026-04-23 20:52:27 +00:00

Refactor realtime transport: one WS per user + server-side fanout (Slack/Mattermost model) #13

sameh-farouk referenced this issue

2026-04-23 20:58:47 +00:00

Application-level WebSocket heartbeat (A2 — WS refactor follow-up) #16

sameh-farouk commented

2026-04-23 20:58:48 +00:00

Author

Member

Implementation landed

Five commits on feat/ws-refactor:

2faaba1 feat(ui): add SeqEvent wrapper for per-user sequence numbers
Adds SeqEvent { seq: u64, #[serde(flatten)] event: UserEvent } to events.rs. Flatten preserves the existing client's data.type dispatch unchanged; data.seq is the new top-level field. Round-trip test asserts wire shape.
3f57565 feat(ui): UserSession with seq counter + ring buffer (prep for resume)
Replaces UserWsMap value type from Sender<UserEvent> to Arc<UserSession> where UserSession { sender: broadcast::Sender<SeqEvent>, next_seq: AtomicU64, buffer: Mutex<VecDeque<SeqEvent>> }. RESUME_BUFFER_SIZE = 500. fanout_to_users rewritten to assign seq + append to buffer (evicting FIFO on overflow) + dispatch. 3 new unit tests (monotonic seq, eviction, unknown-user noop). handle_user_ws session-acquire block adapted to Arc<UserSession>.
dc7a17f feat(ui): handle_user_ws replays session buffer on ?resume_from
user_ws_handler gains Query<ResumeQuery> extractor. handle_user_ws gains resume_from: Option<u64> param + replay block BEFORE tokio::select! (subscribe to sender first so live events queue during replay). too_old detection via from + 1 < oldest. no_buffer detection on empty buffer. On failure, sends {"type":"resume.failed","reason":...} frame and continues to live subscription (client falls through to cold catch-up).
fa4219f feat(ui): client tracks lastSeq + requests resume on reconnect
state.lastSeq = 0 init. connectWebSocket appends ?resume_from=${state.lastSeq} when wsHasConnectedBefore && lastSeq > 0. handleWsEvent updates lastSeq on every frame with a seq field (BEFORE the switch so ping/pong/resume.failed are correctly untracked). New case 'resume.failed': resets lastSeq and calls onWsReconnected().
7c636d2 test(ui): session-resume test scaffold — fixture TODO, unit tests + dogfood primary
3 #[ignore]'d integration tests (replay with second-tab-alive, too_old failure, no_buffer failure). Same fixture-deferral strategy as A2.3 heartbeat; shared hero_collab_ui integration harness is a future task.

Architectural decisions pinned

Per-user, not per-tab, sequence counter. Slack uses per-session (≈ per-connection); Discord uses per-session with a session_id. Per-user collapses to our existing user_ws map structure. All tabs for one user agree on the seq ordering — simpler state model.
Session lifetime = "at least one connected tab". Matches Slack. When the last tab disconnects, the user_ws map entry is removed (current post-refactor behavior from P6.2), buffer + counter die. Fresh session on next connect starts at seq 1. Full-laptop-reopen falls through to cold catch-up via resume.failed.
Buffer size = 500 per user. At 10 events/sec (pessimistic), ~50s of history — enough for any tab-level blip. Memory: ~250 KB per user worst case. Tunable via the RESUME_BUFFER_SIZE constant.
rx.subscribe() ordering is critical. Must happen BEFORE reading the buffer for replay, so any live events dispatched during replay queue on rx and deliver AFTER replay completes. The relevant session.sender.subscribe() call is at line ~506 of routes.rs; replay block is at ~541. Order preserved. Client-side seq tracking dedups any edge-case overlap.
#[serde(flatten)] on SeqEvent preserves the pre-existing wire contract — data.type still dispatches correctly via the client's switch. data.seq is purely additive.
u64 seq is effectively infinite (585k years at 1M events/sec/user) — no wrap handling.
In-memory only on the client. state.lastSeq is NOT persisted to localStorage — per-tab-session scope. A full reload = fresh session + cold catch-up on first message list fetch.
Single-tab blip accepted. If a user has ONE tab and it blips (loses network for 5s), the map entry is destroyed on teardown, so buffer is lost. Reconnect → resume.failed: no_buffer → cold catch-up. Could be improved with a 30s "lingering session" timer on teardown (see spec §Single-tab blip case), but accepted for MVP — single-tab users tolerate a catch-up round-trip on blip.

Explicit non-goals (deferred to C-tier work)

Cross-server session resume (multi-instance pub/sub → events.sock → gateway shard).
Disk-persistent buffer across server restarts.
Per-tab seq counters (Discord's model).
Client-side localStorage persistence of lastSeq.

Tests impact

cargo test -p hero_collab_ui: 8 unit (SeqEvent serde + 3 UserSession tests added; path-env test retained) + 3 ignored (session_resume scaffold) + 3 ignored (heartbeat scaffold from #16). Zero regressions.

cargo test -p hero_collab_server: 68 integration (unchanged — no server-side change in B1).

Manual dogfood gate

Primary verification:

Open two tabs for the same user. Close tab 1 (tab 2 keeps session alive).
Have another user send a few messages.
Reopen tab 1. Messages appear immediately; DevTools Network panel shows the WS URL included ?resume_from=N. No catch-up spinner.

Single-tab blip case: reload the page → expect a fresh connect (no resume_from) → cold catch-up via onWsReconnected → all messages reconcile within one RPC round-trip.

## Implementation landed Five commits on `feat/ws-refactor`: - [`2faaba1`](../commit/2faaba1) `feat(ui): add SeqEvent wrapper for per-user sequence numbers` Adds `SeqEvent { seq: u64, #[serde(flatten)] event: UserEvent }` to `events.rs`. Flatten preserves the existing client's `data.type` dispatch unchanged; `data.seq` is the new top-level field. Round-trip test asserts wire shape. - [`3f57565`](../commit/3f57565) `feat(ui): UserSession with seq counter + ring buffer (prep for resume)` Replaces `UserWsMap` value type from `Sender<UserEvent>` to `Arc<UserSession>` where `UserSession { sender: broadcast::Sender<SeqEvent>, next_seq: AtomicU64, buffer: Mutex<VecDeque<SeqEvent>> }`. `RESUME_BUFFER_SIZE = 500`. `fanout_to_users` rewritten to assign seq + append to buffer (evicting FIFO on overflow) + dispatch. 3 new unit tests (monotonic seq, eviction, unknown-user noop). `handle_user_ws` session-acquire block adapted to `Arc<UserSession>`. - [`dc7a17f`](../commit/dc7a17f) `feat(ui): handle_user_ws replays session buffer on ?resume_from` `user_ws_handler` gains `Query<ResumeQuery>` extractor. `handle_user_ws` gains `resume_from: Option<u64>` param + replay block BEFORE `tokio::select!` (subscribe to sender first so live events queue during replay). `too_old` detection via `from + 1 < oldest`. `no_buffer` detection on empty buffer. On failure, sends `{"type":"resume.failed","reason":...}` frame and continues to live subscription (client falls through to cold catch-up). - [`fa4219f`](../commit/fa4219f) `feat(ui): client tracks lastSeq + requests resume on reconnect` `state.lastSeq = 0` init. `connectWebSocket` appends `?resume_from=${state.lastSeq}` when `wsHasConnectedBefore && lastSeq > 0`. `handleWsEvent` updates `lastSeq` on every frame with a `seq` field (BEFORE the switch so ping/pong/resume.failed are correctly untracked). New `case 'resume.failed':` resets lastSeq and calls `onWsReconnected()`. - [`7c636d2`](../commit/7c636d2) `test(ui): session-resume test scaffold — fixture TODO, unit tests + dogfood primary` 3 `#[ignore]`'d integration tests (replay with second-tab-alive, too_old failure, no_buffer failure). Same fixture-deferral strategy as A2.3 heartbeat; shared `hero_collab_ui` integration harness is a future task. ## Architectural decisions pinned - **Per-user, not per-tab, sequence counter.** Slack uses per-session (≈ per-connection); Discord uses per-session with a `session_id`. Per-user collapses to our existing `user_ws` map structure. All tabs for one user agree on the seq ordering — simpler state model. - **Session lifetime = "at least one connected tab".** Matches Slack. When the last tab disconnects, the `user_ws` map entry is removed (current post-refactor behavior from P6.2), buffer + counter die. Fresh session on next connect starts at seq 1. Full-laptop-reopen falls through to cold catch-up via `resume.failed`. - **Buffer size = 500 per user.** At 10 events/sec (pessimistic), ~50s of history — enough for any tab-level blip. Memory: ~250 KB per user worst case. Tunable via the `RESUME_BUFFER_SIZE` constant. - **`rx.subscribe()` ordering is critical.** Must happen BEFORE reading the buffer for replay, so any live events dispatched during replay queue on `rx` and deliver AFTER replay completes. The relevant `session.sender.subscribe()` call is at line ~506 of `routes.rs`; replay block is at ~541. Order preserved. Client-side seq tracking dedups any edge-case overlap. - **`#[serde(flatten)]` on SeqEvent** preserves the pre-existing wire contract — `data.type` still dispatches correctly via the client's switch. `data.seq` is purely additive. - **`u64` seq is effectively infinite** (585k years at 1M events/sec/user) — no wrap handling. - **In-memory only on the client.** `state.lastSeq` is NOT persisted to localStorage — per-tab-session scope. A full reload = fresh session + cold catch-up on first message list fetch. - **Single-tab blip accepted.** If a user has ONE tab and it blips (loses network for 5s), the map entry is destroyed on teardown, so buffer is lost. Reconnect → `resume.failed: no_buffer` → cold catch-up. Could be improved with a 30s "lingering session" timer on teardown (see spec §Single-tab blip case), but accepted for MVP — single-tab users tolerate a catch-up round-trip on blip. ## Explicit non-goals (deferred to C-tier work) - Cross-server session resume (multi-instance pub/sub → `events.sock` → gateway shard). - Disk-persistent buffer across server restarts. - Per-tab seq counters (Discord's model). - Client-side `localStorage` persistence of `lastSeq`. ## Tests impact `cargo test -p hero_collab_ui`: 8 unit (SeqEvent serde + 3 UserSession tests added; path-env test retained) + 3 ignored (session_resume scaffold) + 3 ignored (heartbeat scaffold from #16). Zero regressions. `cargo test -p hero_collab_server`: 68 integration (unchanged — no server-side change in B1). ## Manual dogfood gate Primary verification: 1. Open two tabs for the same user. Close tab 1 (tab 2 keeps session alive). 2. Have another user send a few messages. 3. Reopen tab 1. Messages appear immediately; DevTools Network panel shows the WS URL included `?resume_from=N`. No catch-up spinner. Single-tab blip case: reload the page → expect a fresh connect (no resume_from) → cold catch-up via `onWsReconnected` → all messages reconcile within one RPC round-trip.

sameh-farouk referenced this issue

2026-04-23 21:01:10 +00:00

Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polish #19

sameh-farouk closed this issue

2026-04-24 01:00:09 +00:00