Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polish #19
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_collab!19
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/ws-refactor"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Full rewrite of the
hero_collabrealtime transport from a browser-driven per-channel broadcast relay to a server-authoritative per-user WebSocket model with session resume, application-level heartbeat, first-class presence, and bounded rate limiting — i.e. the Slack / Discord / Matrix architectural shape.Closes #13. Implements #14, #15, #16, #17, #18.
Bugs this fixes (from dogfooding)
message.sendand broadcast, nobody else learned of the message live.Architecture
Before: browser publishes every event over its own WS;
hero_collab_uiis a dumb per-channel broadcast relay;hero_collab_servernever touches WebSockets. Fragile.After:
hero_collab_serverowns fanout authoritatively. A new internal Unix socketevents.sockcarries length-prefixed JSON envelopes from server tohero_collab_ui.hero_collab_uimaintains a per-userUserSession(broadcast Sender + seq counter + 500-event ring buffer) and relays to a single browser WebSocket per tab at/ws/user/{user_id}. Every RPC handler dispatches via a unifiedfanout::dispatch(state, audience, event)helper after persistence commits.Follow-ups (all bundled on this branch by design — they close gaps the refactor itself opened or accepted as MVP debt)
typing.relayrate limit — closes amplification vector introduced by the new fanout path.since_idfilter onmessage.list— removes reconnect-overfetch TODO.beforeunloadbeacon workarounds.Scale and scope
developmentfanout.rs), five new UI modules (events.rs,event_subscriber.rs,auth.rs, plusUserSessionstruct, plus route). Old per-channel route +AppState.channels/channel_conn_counts/openBackgroundWs/syncBackgroundSubscriptions/ all client-emitted outbound message/huddle events fully removed.Ecosystem surface
openrpc.jsonupdated:since_idonmessage.list;typing.relay,presence.set_status_text,presence.mark_connectionregistered;presence.updatemarked DEPRECATED (kept for one release as alias toset_status_text).openrpc.client.generated.rs) auto-regenerated via proc-macro. New typed input/output structs for all new methods.hero_collabbinary) unchanged — no Presence/Typing subcommands exist today, and existing subcommands (Workspace / User / Channel / Message / Huddle / Canvas / etc.) keep working since follow-ups were additive.hero_skillsrequire no changes.Deferred to future follow-ups
events.sock→ NATS/Redis adapter): only matters once we run multiplehero_collab_serverreplicas. Spec documents the migration path.v: u8on envelopes): only needed when we do a breaking wire change and want lockstep client/server deploys to stop being required.hero_collab_uiintegration-test fixture: two test files ship#[ignore]'d (heartbeat.rs,session_resume.rs). Manual browser dogfood is the current primary gate; shared fixture will amortize when the next reliability feature also needs it.Test plan
Automated
cargo build --workspace— clean (only pre-existinghero_collab_appunused-mutwarnings, unrelated)cargo test --workspace— 132 passing, 0 failed, 6 ignoredhero_collabservice_def: 1hero_collab_appintegration: 2hero_collab_serverunit: 56hero_collab_serverintegration: 68 (was 42 before this branch; +26 for this work)hero_collab_uiunit: 8hero_collab_uiintegration: 6 ignored (tests/heartbeat.rs+tests/session_resume.rs— fixture deferred)hero_collab_sdkdoctest: 1Manual browser dogfood (primary gate for realtime behaviour)
Core realtime flows:
Heartbeat (A2 / #16):
Presence (B2 / #17):
Session resume (B1 / #18):
?resume_from=Non the WS URL. No catch-up spinner.?resume_from(in-memory state lost). Cold catch-up viaonWsReconnectedrepopulates unread counts + mentions + current channel.WS drop recovery:
SDK / CLI smoke
hero_collab healthworks (unchanged CLI path).hero_collab message list --channel-id X --limit 50works (existing path;since_idis optional).🤖 Generated with Claude Code
Centralizes dev-mode state in a new `crates/hero_collab_ui/src/auth.rs` module mirroring the server's `handlers/permissions.rs` pattern: `init_dev_mode` publishes once at boot, `is_dev_mode` reads O(1), fail-closed if uninitialized. Replaces the two `std::env::var ("COLLAB_AUTH_MODE")` reads in `routes.rs` (chat + canvas WS handlers) with `crate::auth::is_dev_mode()` so future WS routes can't drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Wire fanout::dispatch into five more message.* handlers (P7.3): - message.update -> UserEvent::MessageUpdated { message } - message.delete -> UserEvent::MessageDeleted { message_id, channel_id } - message.react -> UserEvent::MessageReacted { message_id, channel_id, reactions } - message.unreact -> UserEvent::MessageReacted { ... } (only when a row was actually removed) - message.toggle_react -> UserEvent::MessageReacted { ... } on both add + remove branches Each dispatches to Audience::ChannelMembers(channel_id) after its persistence completes. The MutexGuard `db` is released (drop(db)) before every dispatch call because fanout::dispatch re-acquires state.db and parking_lot::Mutex is non-reentrant (recursive lock would panic) — same pattern as message.send from P7.2 (cf247ac). For the three reaction handlers the dispatched `reactions` payload is fetched fresh from the DB AFTER the INSERT/DELETE (post-mutation state) via the existing fetch_reactions_for_messages helper, so subscribers see the authoritative post-change array rather than the mutation the client just submitted. Channel_id is extracted from the message blob (update re-uses the already- parsed `obj`; delete re-uses the `ch_id` already read for activity logging; react/unreact/toggle_react query json_extract(data,'$.channel_id') from the messages row once before dropping the guard). Tests: five new integration tests exercise every path against real events.sock envelopes. Each seeds the shared workspace+2-users+channel+msg fixture, drains the seed's message.created envelope, then asserts the expected type and payload. Full server suite: 49 integration + 53 unit, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>When a DM channel is created, inline the `peer_ids` as members and dispatch a `ChannelAdded` envelope to the full member set so Bob's sidebar picks up a DM started by Alice without a refresh (Phase 7.5). - Add `peer_ids: Option<Vec<u64>>` to `ChannelCreate` input. - Insert peer membership rows alongside the creator auto-add, inside the existing db lock. - After drop(db), dispatch `UserEvent::ChannelAdded { channel: obj }` to `Audience::Users(member_ids)` where `member_ids` is read from the just-persisted `channel_members` rows (authoritative, handles the INSERT-OR-IGNORE dedupe case). - Non-DM channels keep their prior discovery semantics — no dispatch.Emit `ReadUpdated { user_id, channel_id, last_read_msg }` via the fanout seam after a successful `read.mark` write, with `Audience::User(user_id)` so only the user's own tabs receive it. This lets multi-tab clients synchronise `state.readCursors[channel_id]` without leaking personal read state to other channel members. DB guard is dropped before the dispatch call (parking_lot is non-reentrant, and `dispatch` re-locks `state.db` internally). Adds one integration test (60 total) that asserts the envelope's recipients vector equals `[alice_id]` and that the event payload carries the expected `user_id`, `channel_id`, and `last_read_msg`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Unblocks cargo test --workspace by catching up four files that still referenced the pre-rename SDK client type (examples + SDK doctest), and removes the typing.relay openrpc entry's 'summary' field which was a single-entry style drift vs the 50+ existing methods. Also adapts examples/integration tests to the current client API surface: .health() no longer exists, so calls route through rpc_health(RpcHealthInput {}) and access the returned Option<String> fields via as_deref(). Restores the SERVER_SERVICE constant that a prior socket-path refactor (26e925f) renamed to SERVICE_NAME but left dangling in zinit commands, so integration.rs compiles again. Regenerates openrpc.client.generated.rs to reflect openrpc.json drift sincedf45bcc(the peer_ids addition ind404f71and the typing.relay summary removal above). No runtime behaviour changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>When the background huddle reaper force-ends a ghost huddle (LiveKit reports the room empty and no late-joiner rows survived the ghost delete), peers in the channel previously had no event to prune the sidebar indicator — they had to wait for their next `huddle.list_active` poll or a page reload. This wires `fanout::dispatch` into the reaper's commit path so `HuddleEnded { room_id }` reaches `ChannelMembers(cid)` as soon as the transaction commits. The db guard is dropped before dispatch (parking_lot is non-reentrant and the ChannelMembers resolver re-locks `state.db`). `channel_id` is read from the room_obj JSON; legacy rows that lack it silently no-op, matching the P7.3/P7.4 channel_id-gate pattern. The existing info! log is preserved. No unit test for this path: `force_end_huddle` takes `&AppState` and building a minimal AppState in-tree requires synthesising ~14 fields (StorageBackend trait object, LiveKitConfig, etc.) — a larger yak than the payoff for a two-line dispatch that mirrors the already-tested `handlers::huddle::leave` branch. The `dispatch` → `channel_members` audience resolution is covered by `fanout::audience_tests`; the full reaper entry path (reconcile_once → list_participants → force_end) is exercised by Phase 10/11 dogfood since it needs a LiveKit stub that reports zero participants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>P8.1 + P8.2. Rewrite connectWebSocket to open a single per-user WS at /ws/user/{user_id} (replacing the previous per-channel /ws/{channel_id} foreground socket). The inbound message handler is extracted into a top-level handleWsEvent dispatcher that routes each data.type to a dedicated on* helper: onMessageCreated / onMessageUpdated / onMessageDeleted / onMessageReacted / onMessagePinUpdated (new) / onTypingEvent / onHuddleEvent / onPresenceUpdate / onReadUpdated / onMentionCreated, plus channel.added/updated/removed/member_left stubs for P8.4. onMessageCreated preserves the round-3 cross-channel render fix by routing off-screen events to unread-count bumps (skipping own sends) and rendering the sidebar instead of the message list. Mention regex deliberately omitted here — P8.7 deletes it once the server push of mention.created is fully trusted. onWsReconnected is a stub; P8.3 wires it to updateUnreadCounts() + pollNotifications() + message.list delta fetch. scheduleReconnect implements 1s→30s exponential backoff, replacing the old fixed 3s retry. Server-side auth closes (1008/1011/4xxx) still suppress retry. bgWs, outbound state.ws.send, and the client-side mention regex remain intact intentionally — dead-code cleanup is P8.5–P8.7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Under the per-user WS model, a transient disconnect stops ALL live delivery (unlike the old per-channel model where a channel's bgWs could blip independently). Without catch-up, a 5s hiccup silently misses messages across every channel. onWsReconnected now fires three steps on every non-first-connect: 1. updateUnreadCounts() — refresh badges across all channels 2. pollNotifications() — re-run mention poll-diff to surface any browser notifications missed while the WS was down 3. Re-fetch last 100 messages of the current channel and append any new ones via id-dedup. Option B was chosen for step 3: the server's message.list RPC does not yet accept a since_id filter, so we re-fetch and dedup. TODO left inline — follow-up task should add since_id: Option<u64> to MessageList input + WHERE id > since_id in the query for efficiency.Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polishto WIP: Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polishTemporary console.log instrumentation on chat-app.js to diagnose the transient 2x unread-count doubling on channel (not DM) message receipt. Tagged [DBG mc] / [DBG he] / [DBG uuc] for easy filtering. Counters window.__mcCount / __heCount / __uucCount increment per handler entry so multi-delivery becomes visible. To run the scenarios: Alice (browser 1) → viewing any channel, NOT the target Bob (browser 2) → viewing any channel, NOT the target 1. DevTools Console → filter: [DBG 2. Reset counters: window.__mcCount=0; window.__heCount=0; window.__uucCount=0 3. Scenario A (channel, expected to double): Alice sends ONE plain-text message (no @mention) to a channel where Bob is a member. Copy Bob's console output. 4. Scenario B (DM, expected correct): Alice sends ONE DM to Bob. Copy Bob's console output. 5. Scenario C (sender self-badge): Alice sends, stays viewing. Wait ~2s. Copy Alice's console. To remove: revert this commit. The fix commit (4b37c1b) stands on its own. This commit is scratch — should NOT reach the PR as a merged artifact; revert or drop before marking the PR ready-for-review.WIP: Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polishto Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polish