Session resume with sequence numbers + ring buffer (B1 — WS refactor follow-up) #18
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_collab#18
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Follow-up to #13. Closes gap B1 from the post-refactor architectural assessment.
Problem: on a brief WS drop, the refactor's
onWsReconnectedcatch-up refetches unread counts + mentions + current-channel's last 100 messages. But events in other channels during the drop only surface on the next 300spollNotificationstick; non-message events (reactions, pins, presence flips,channel.added) aren't in any catch-up RPC and are just lost until the next natural refresh.Solution: per-user monotonic sequence numbers + in-memory ring buffer (500 events per user). On reconnect, client sends
?resume_from=N; server replays everything withseq > Nfrom the buffer. Buffer miss (too old, or session was destroyed) returns aresume.failedframe; client falls through to the existing cold catch-up.Session lifetime = "at least one connected tab" — matches Slack's model. Full-laptop-reopen falls through to cold catch-up (acceptable).
Docs:
plan/feature-ws-session-resume.mdplan/impl-ws-session-resume.mdSize: ~450 lines production + ~250 lines tests. 5 tasks. Biggest of the five follow-ups.
Branch: lands on
feat/ws-refactor. Must run LAST of the follow-ups — extendshandle_user_ws's mpsc-writer restructure from #A2 and the presence hooks from #B2.Explicitly out of scope (deferred to C-tier work):
localStoragepersistence oflastSeqImplementation landed
Five commits on
feat/ws-refactor:2faaba1feat(ui): add SeqEvent wrapper for per-user sequence numbersAdds
SeqEvent { seq: u64, #[serde(flatten)] event: UserEvent }toevents.rs. Flatten preserves the existing client'sdata.typedispatch unchanged;data.seqis the new top-level field. Round-trip test asserts wire shape.3f57565feat(ui): UserSession with seq counter + ring buffer (prep for resume)Replaces
UserWsMapvalue type fromSender<UserEvent>toArc<UserSession>whereUserSession { sender: broadcast::Sender<SeqEvent>, next_seq: AtomicU64, buffer: Mutex<VecDeque<SeqEvent>> }.RESUME_BUFFER_SIZE = 500.fanout_to_usersrewritten to assign seq + append to buffer (evicting FIFO on overflow) + dispatch. 3 new unit tests (monotonic seq, eviction, unknown-user noop).handle_user_wssession-acquire block adapted toArc<UserSession>.dc7a17ffeat(ui): handle_user_ws replays session buffer on ?resume_fromuser_ws_handlergainsQuery<ResumeQuery>extractor.handle_user_wsgainsresume_from: Option<u64>param + replay block BEFOREtokio::select!(subscribe to sender first so live events queue during replay).too_olddetection viafrom + 1 < oldest.no_bufferdetection on empty buffer. On failure, sends{"type":"resume.failed","reason":...}frame and continues to live subscription (client falls through to cold catch-up).fa4219ffeat(ui): client tracks lastSeq + requests resume on reconnectstate.lastSeq = 0init.connectWebSocketappends?resume_from=${state.lastSeq}whenwsHasConnectedBefore && lastSeq > 0.handleWsEventupdateslastSeqon every frame with aseqfield (BEFORE the switch so ping/pong/resume.failed are correctly untracked). Newcase 'resume.failed':resets lastSeq and callsonWsReconnected().7c636d2test(ui): session-resume test scaffold — fixture TODO, unit tests + dogfood primary3
#[ignore]'d integration tests (replay with second-tab-alive, too_old failure, no_buffer failure). Same fixture-deferral strategy as A2.3 heartbeat; sharedhero_collab_uiintegration harness is a future task.Architectural decisions pinned
session_id. Per-user collapses to our existinguser_wsmap structure. All tabs for one user agree on the seq ordering — simpler state model.user_wsmap entry is removed (current post-refactor behavior from P6.2), buffer + counter die. Fresh session on next connect starts at seq 1. Full-laptop-reopen falls through to cold catch-up viaresume.failed.RESUME_BUFFER_SIZEconstant.rx.subscribe()ordering is critical. Must happen BEFORE reading the buffer for replay, so any live events dispatched during replay queue onrxand deliver AFTER replay completes. The relevantsession.sender.subscribe()call is at line ~506 ofroutes.rs; replay block is at ~541. Order preserved. Client-side seq tracking dedups any edge-case overlap.#[serde(flatten)]on SeqEvent preserves the pre-existing wire contract —data.typestill dispatches correctly via the client's switch.data.seqis purely additive.u64seq is effectively infinite (585k years at 1M events/sec/user) — no wrap handling.state.lastSeqis NOT persisted to localStorage — per-tab-session scope. A full reload = fresh session + cold catch-up on first message list fetch.resume.failed: no_buffer→ cold catch-up. Could be improved with a 30s "lingering session" timer on teardown (see spec §Single-tab blip case), but accepted for MVP — single-tab users tolerate a catch-up round-trip on blip.Explicit non-goals (deferred to C-tier work)
events.sock→ gateway shard).localStoragepersistence oflastSeq.Tests impact
cargo test -p hero_collab_ui: 8 unit (SeqEvent serde + 3 UserSession tests added; path-env test retained) + 3 ignored (session_resume scaffold) + 3 ignored (heartbeat scaffold from #16). Zero regressions.cargo test -p hero_collab_server: 68 integration (unchanged — no server-side change in B1).Manual dogfood gate
Primary verification:
?resume_from=N. No catch-up spinner.Single-tab blip case: reload the page → expect a fresh connect (no resume_from) → cold catch-up via
onWsReconnected→ all messages reconcile within one RPC round-trip.