Refactor realtime transport: one WS per user + server-side fanout (Slack/Mattermost model) #13
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_collab#13
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
The current realtime model is a dumb per-channel broadcast hub:
/ws/{channel_id}withstate.channels: Mutex<HashMap<String, broadcast::Sender<String>>>—crates/hero_collab_ui/src/routes.rs:418-478. Each channel id has its own broadcast::Sender; subscribers relaymessage.createdto each other.openBackgroundWs/state.bgWs/syncBackgroundSubscriptions—crates/hero_collab_ui/static/js/chat-app.js:660-755.message.createdon the WS (chat-app.js:1667). The server never fans out on its own — the RPC persists, the browser broadcasts.How we got here
5f6166a(despiegk, 2026-03-19) — introduced per-channel broadcast routing on the server.8579db4(mik-tf, 2026-04-12) — added chat-app.js with a single foreground WS.2e8869d(Sameh, 2026-04-18) — added the N-background-WS fanout on the client to make unreads and mentions update live for non-viewed channels.Problems with the current model
message.createdis emitted from the sender's browser. If the sender's WS is closed (bad network, tab backgrounded), the row persists in the DB but no other client ever learns of it live — they see it only on nextmessage.listfetch.user_id === meon bg WS). Any gap in the filter chain produces phantom unreads on the sender's side (see related bug: unread badge showing on sender after their own send —chat-app.js:766counts own messages against the read cursor).Proposed direction (to be brainstormed)
Move to a per-user transport, like Slack / Mattermost:
/ws/user/{user_id}(or similar).message.send(and other state-changing RPCs) do the fanout server-side: on success, look up channel membership and push amessage.createdevent to each member's user connection. Skip the sender.state.ws.sendgoes away.Benefits
Open questions for brainstorm
routes.rsbroadcast infra is reusable vs. rewrite? The tokio broadcast::Sender primitive likely survives, just rekeyed by user id. The HTTP upgrade path + auth gate are reusable. Per-channel membership checks inws_handler(routes.rs:400-410) go away — replaced by a single "is this my user id" check.state.bgWs,openBackgroundWs,syncBackgroundSubscriptionsall collapse into a single connection + a switch on event type.connectWebSocketbecomes connection-once at login, not per-channel-select.Related bugs this would fix / touch
updateUnreadCountsfilter atchat-app.js:766is a simpler independent fix and doesn't need this refactor).Scope
This issue is a placeholder for the refactor discussion. Next step is a brainstorm to decide:
No code changes yet.
Design complete — spec committed on
feat/ws-refactorFull design at
plan/feature-ws-refactor.md(commit62f3060). Summary of the decisions made during brainstorming:Scope + rollout
channel.added/updated/removed/member_leftandmessage.pin_updated). LiveKit's own WebRTC path untouched.broadcast::Senderon each binary, cross-process seam via one Unix socket. Multi-instance scaling is a separate future step.Architecture: per-user
broadcast::Sender+events.sockseam +EventBuswrapperhero_collab_serverownsevents: EventBus(abroadcast::Sender<Envelope>wrapper matching hero_router'sSseBroadcastershape). RPC handlers callfanout::dispatch(state, audience, event)after persistence; dispatch resolves audience against the DB and pushesEnvelope { recipients: Vec<UserId>, event: UserEvent }.events.sockbound byhero_collab_serveralongsiderpc.sock. Length-prefixed JSON envelopes. Internal-only, never reachable via hero_router.hero_collab_uiruns anevent_subscribertask that connects toevents.sock, reads envelopes, and dispatches each event to recipients' per-userbroadcast::Sender<UserEvent>entries in itsuser_ws: RwLock<HashMap<UserId, Sender>>map. Browser WS handler subscribes each tab to its user's Sender./ws/user/{user_id}). Outbound-only from UI to client for every event excepttyping.start/typing.stop, which relay through a newtyping.relayRPC on rpc.sock.Alternatives considered and rejected
fanout::dispatchin RPC handlers (same-process assumption)hero_collab_serverandhero_collab_uican't share an in-processbroadcast::Sender.rpc.sock_fanoutfield piggybacked on RPC responsesrpc_callfor presence. Rate-limiter carve-out for typing. Zero ecosystem precedent for envelope extension. Awkward multi-instance migration path.hero_collab_server_server/_uiconvention (internal-only vs internet-facing separation). Large blast radius._server+_uibinariesEcosystem precedents this design follows (verified, not invented)
_serverbinary, cleaned up in a shared loop —hero_aibroker_server/src/main.rs:175-189already bindsrpc.sockandrest.sockwith afor sock in [...] { create_dir_all + remove_file }loop. hero_collab becomes the second instance.broadcast::Senderwrapper struct —hero_router/src/server/sse.rsexposesSseEvent+SseBroadcaster { tx: broadcast::Sender<String> }withnew(capacity)/subscribe()/send(event). The newEventBusmatches this shape exactly.Key refinements caught in review
Entry::Vacantunder the write lock, not post-subscribereceiver_count() == 1).user_idwith the authenticated path user_id so one user can't forge typing events as another.rate_limit.rsneeds a newexempt_methodsset withtyping.relayin it; otherwise steady-state typing burns the 60/min global cap.channel.update,channel.member.remove(two audiences:ChannelRemovedto removed user +ChannelMemberLeftto remaining members),message.pin/unpin, andhuddle_reaper's ghost-huddle termination. The reaper case is server-initiated fanout, demonstrating the design supports it cleanly.0a17a52).onopenafter the first connect — callsupdateUnreadCounts()+pollNotifications()+ delta-fetch of the current channel's messages. Otherwise a 5s WS blip silently drops events across all channels.kill_other.socketupdate inhero_collab/src/main.rs:215to includeevents.sock, complementing the in-processremove_filecleanup.@mentionregex detection blocks atchat-app.js:711-736and:2378-2410(fromd4e8cef) become redundant once server dispatchesmention.createddirectly.pollNotificationspoll-diff kept as disconnect-recovery safety net, cadence reduced 30s → 300s.Bugs this refactor fixes
message.sendand broadcast, nobody else learns of the message live.What's next
Implementation plan (task-level breakdown, TDD steps, commit boundaries) to be written in a new session via
superpowers:writing-plansagainstplan/feature-ws-refactor.md. Branchfeat/ws-refactoris not yet pushed — push when ready for the plan-phase PR or leave local until the plan is also drafted.Spec lives at:
plan/feature-ws-refactor.mdfeat/ws-refactor(1 ahead oforigin/development, unpushed)62f3060Implementation complete; five follow-ups opened for the same branch
What landed
Branch
feat/ws-refactoris now pushed to remote — 46 commits ahead oforigin/development.All 17 dispatch rows from the spec's §RPC handler dispatch sites table are wired:
message.send(+ per-mentionMentionCreated),update,delete,react/unreact/toggle_react,pin/unpinchannel.create-DM (ChannelAdded),member.add,member.remove(dual dispatch:ChannelRemoved+ChannelMemberLeft),updatestart,join,leave(+ last-participantHuddleEnded),huddle_reaper::force_end_huddleread.mark,presence.update(with documented two-producer race),typing.relay(server-authoritativeuser_id)Verification
cargo test --workspace: 126 tests, 0 failed, 0 ignored (server integration went 42 → 62, +20 dispatch + reconnect tests; UI went 0 → 7; plushero_collabservice-def, SDK doctest,hero_collab_app).cargo build --workspace: clean (pre-existinghero_collab_appunused-mutwarnings only).state.dblock is explicitly released before anyfanout::dispatchcall — the non-reentrant parking_lot mutex would panic otherwise; grep-verified no exception across all 17 sites./ws/user/{user_id}route; old/ws/{channel_id}fully deleted alongsideAppState.channels/channel_conn_counts.chat-app.jsrewritten: one WS per tab,handleWsEventswitch with per-case current-channel branching (preserves the round-3 cross-channel-render fix),onopen-after-reconnect catch-up,bgWs+ outbound message sends + mention regex all deleted.pollNotificationscadence dropped 120s → 300s (push is primary now).Pragmatic adjustments during execution (flagged, justified, no silent scope)
peer_idsonchannel.create(P7.5) — server-side field so DM creation fans out to both members in one RPC. Scoped tokind=="dm"only;group_dmasymmetry reverted. Registered inopenrpc.json. UI still uses the 2-RPC add-member flow;peer_idsis available for a future client migration.hero_collab_serverdev-dep onrusqlite(P7.12) — presence integration test INSERTsworkspace_membersrows directly becauseseed_channel_with_two_membersdoesn't seed workspace membership. Functional; a helper refactor is future work.2d47e28— fixed pre-existingCollabClient→HeroCollabClientrename stragglers inhero_collab_examples+ one SDK doctest socargo test --workspacepasses end-to-end. Unrelated to the refactor but blocked the full test matrix.summaryfield added alongsidedescriptionfortyping.relay, later removed in2d47e28to match the file's 50+-method-entry convention).Five follow-ups opened
After reviewing the transport against Slack / Discord / Matrix at the WebSocket layer, there are five reliability primitives still missing. All of them fit on this branch (same PR, same review unit) per the architectural cohesion argument — especially A3 (security-adjacent: the refactor introduced an amplification path, this closes it) and B2 (fixes a race we explicitly documented as accepted MVP debt).
typing.relay— closes the amplification vector the refactor introducedsince_idfilter onmessage.list— removes the reconnect-overfetch TODO we left in the clientSpecs + impl plans for all five committed in
2c77e94(plan/feature-*.md+plan/impl-*.md). Execution order: A3 → A1 → B2 → A2 → B1 (later plans extendhandle_user_wsrestructures from earlier ones).Estimated additional scope: ~1,000 lines production + ~750 lines tests across 15 tasks. Brings the branch from today's ~7,800 insertions to ~9,500. Still one coherent review unit.
What's next
superpowers:subagent-driven-development(in progress).development; reference and close #13, #14, #15, #16, #17, #18 in the PR body.Do not push or merge from my end. Branch remains on
feat/ws-refactoruntil the owner authorizes the PR.All five follow-ups landed; branch at 63 commits
Branch
feat/ws-refactoris now pushed with the five reliability follow-ups wired. Full workspace test matrix: 132 passing, 0 failed, 6 ignored (6 = two#[ignore]'d integration-test scaffolds pending a sharedhero_collab_uifixture; manual dogfood is the gate for both).What landed per follow-up
ec91bda,f66e952bd19bf7,4187060,32f299cWHERE id > ?filter + dynamic SQL builder + SDK regen; fix commit: client was not actually passing since_id despite plan claim — wired now030ac25,4b8afd6,8415247505d7ef,ab29528,15bd2e7presence.updateintoset_status_text(preserves bit) +mark_connection(server-authoritative); deprecated alias for one release; deleted client-side 60s heartbeat + beforeunload beacon (both workarounds for the race we just fixed)2faaba1,3f57565,dc7a17f,fa4219f,7c636d2SeqEventwrapper, per-userUserSession { sender, next_seq, buffer (500-entry FIFO) },?resume_from=Nreplay, client trackslastSeq,resume.failedfalls through to cold catch-upWhat got fixed late by the implementer that I missed
Three plan errors caught during execution (documenting for retrospective hygiene):
RateLimitedRPC error code was-32029. It's-32005(perrpc_error.rs). Implementer grepped, used the right value.raw_rpcreturned a full JSON-RPC envelope withresult/errorkeys. It actually returnsResult<Value, OpenRpcError>. Implementer matched on the error variant correctly.since_id(per P8.3's intent). P8.3's executor had actually chosen the Option-B fallback and never wired it. A1.2 as briefed would have left A1 inert on the client side — implementer flagged the gap; I dispatched a fix commit (32f299c) to actually wiresince_id: lastIdinto the RPC call.All three were caught by the implementer reading source instead of trusting my plan. Plan docs updated in
b6bb40dso future executors don't repeat #1/#2.Architecture delta vs. pre-follow-up refactor
Closes the gaps from the Slack/Discord/Matrix comparison I did earlier:
Remaining gaps (deferred to C-tier, documented in spec): multi-instance pub/sub fan-out, disk-persistent offline buffer, per-tab seq counters, event-schema versioning. All are upgrade paths, not rewrites —
fanout::EventBus+UserSessionare the right shape to swap their backends.Branch state
origin/development, pushed to remote.What's left before PR
development. Body will reference + close #13, #14, #15, #16, #17, #18.Not pushing a PR from my end; owner authorizes.
Dogfood pass caught three bugs — all fixed on branch
Manual browser-level testing (Alice + Bob across two browsers, dev-seeded DB) surfaced three distinct regressions. All fixed and pushed.
Bug 1: sender sees stale unread badge on own current channel (~2s after send)
Root cause:
sendMessageandonMessageCreated's viewing-match branch both appended messages tostate.messages + DOMbut never advanced the user's read cursor. SubsequentupdateUnreadCountsRPC round-trip (~2s) read a stale cursor from the server, counted the just-seen message as "unread," showed a badge. Affected channels AND DMs.Fix:
4b37c1bfix(ui): advance read cursor on send + receive-while-viewing. Added a sharedadvanceReadCursor(channelId, messageId)helper updatingstate.readCursorsclient-side immediately + firingread.markRPC fire-and-forget. Called from both paths. Cursor-based unread counting is the right architecture; the invariant "every path that appends to DOM must advance the cursor" is now satisfied.Bug 2: receiver sees transient 2x unread on channel (but not DM) messages
Root cause: WebSocket churn during login. Runtime trace confirmed identical
seqdelivered twice per event.connectWebSocket's early-return guard only checkedreadyState === OPEN, notCONNECTING. WhenselectChannelcalledconnectWebSocketwhile an earlier call's handshake was still in flight, the guard missed, the in-flight socket was closed, and a new one opened. Server-side, the oldhandle_user_wshadn't finished teardown (TCP FIN still in flight); briefly the user's session had TWO activebroadcast::Receivers. Events dispatched in that window were delivered via both receivers; client-side, the closing WebSocket'sonmessagestill fired for buffered frames, causing duplicatehandleWsEventexecution.Fix:
9f97db7fix(ui): one WS per user. Three coordinated edits:connectWebSocketearly-return now covers both OPEN and CONNECTING states.selectChannelno longer callsconnectWebSocket— that was legacy per-channel-WS behavior; under the per-user model the WS is session-scoped.pickUsernow callsconnectWebSocketonce at login so it's established before the firstselectChannelruns.DMs weren't affected in testing because by the time DM tests ran the WS state had stabilized on a single connection. The bug was a startup-window race.
Bug 3: DM picker's hover handler threw
ReferenceErrorRoot cause (pre-existing, not introduced by the refactor):
setDmAcActivewas defined but not on thewindow.*export block. Inline HTMLonmouseenter="setDmAcActive(${i})"handlers execute in the global scope; the function was unreachable. Audit of neighboring DM-picker inline handlers (pickDmUser,dmSearchKey,filterDmUsers) confirmed those were already exported — onlysetDmAcActivewas missing.Fix:
ccff348fix(ui): expose DM-picker inline-handler functions on window.State of the PR
Revert "debug(ui): trace instrumentation..."); no[DBG ...]log statements remain.Post-refactor cleanup: upsert-dedup consolidation
Follow-up to commit
809c206(fix(ui): dedup optimistic push against WS beat-to-the-punch). That commit caught a real bug — thread replies double-rendering because the server's WS fanout can arrive at the sender's own socket before themessage.sendRPC response resolves — but fixed it by inlining a three-linesome(id)/pushguard at the two optimistic-push sites. On review the pattern looked workaround-shaped: every future event type would need the same guard, and the 'why' (two-path delivery by design) was only documented in the commit message.Architectural re-check first. My initial reaction was 'the server shouldn't echo to the sender.' That would be wrong: per-user WS fanout is keyed by user, not by tab, so excluding the sender would starve their other tabs of the message. Dedup on server id is the correct client-side behavior for our deliberate two-path delivery (RPC response for the active tab, WS fanout for the user's other tabs). Keeping the fanout to sender; centralizing the dedup.
What landed (
feat/ws-refactor, commitsd4320f5→8046439):upsertMainMessage(msg) → bool/upsertThreadReply(reply) → boolhelpers with a block comment documenting the multi-tab rationale.sendMessage,sendThreadReply,onMessageCreated(both thread and main-feed branches),onWsReconnectedcatch-up delta-fetch.state.messagesonly holds the currently-viewed channel, so checking an off-screen message's id there is tautologically false — the guard could never fire).plan/impl-upsert-dedup.md.Net diff in
chat-app.js: +86 / −62. No server changes, no wire changes, no observable behavior changes — only that any future fanout event can insert via the helpers and get dedup for free.