lhumina_code/hero_collab

Fork 0

Refactor realtime transport: one WS per user + server-side fanout (Slack/Mattermost model) #13

New issue

Closed

opened 2026-04-23 02:35:42 +00:00 by sameh-farouk · 5 comments

sameh-farouk commented

2026-04-23 02:35:42 +00:00

Member

Context

The current realtime model is a dumb per-channel broadcast hub:

Server: /ws/{channel_id} with state.channels: Mutex<HashMap<String, broadcast::Sender<String>>> — crates/hero_collab_ui/src/routes.rs:418-478. Each channel id has its own broadcast::Sender; subscribers relay message.created to each other.
Client: one foreground WS on the viewed channel + one background WS per other channel/DM the user belongs to, via openBackgroundWs / state.bgWs / syncBackgroundSubscriptions — crates/hero_collab_ui/static/js/chat-app.js:660-755.
The sender's browser is what publishes message.created on the WS (chat-app.js:1667). The server never fans out on its own — the RPC persists, the browser broadcasts.

How we got here

5f6166a (despiegk, 2026-03-19) — introduced per-channel broadcast routing on the server.
8579db4 (mik-tf, 2026-04-12) — added chat-app.js with a single foreground WS.
2e8869d (Sameh, 2026-04-18) — added the N-background-WS fanout on the client to make unreads and mentions update live for non-viewed channels.

Problems with the current model

Connection count scales with membership. A user in 20 channels + 15 DMs holds ~35 open WebSockets per tab.
New DMs don't deliver live. When user A creates a DM with B, nothing pushes the channel-list change to B. B has no bg WS on the new DM's channel id, so A's broadcast goes into an empty room. B must refresh to see the conversation at all.
Sender-dependent delivery. message.created is emitted from the sender's browser. If the sender's WS is closed (bad network, tab backgrounded), the row persists in the DB but no other client ever learns of it live — they see it only on next message.list fetch.
Fragile sender semantics. Own-send filtering is done in-browser (conn_id prefix server-side + user_id === me on bg WS). Any gap in the filter chain produces phantom unreads on the sender's side (see related bug: unread badge showing on sender after their own send — chat-app.js:766 counts own messages against the read cursor).

Proposed direction (to be brainstormed)

Move to a per-user transport, like Slack / Mattermost:

Each client opens one WS: /ws/user/{user_id} (or similar).
message.send (and other state-changing RPCs) do the fanout server-side: on success, look up channel membership and push a message.created event to each member's user connection. Skip the sender.
Channel-list changes (new DM, added to channel, etc.) are pushed on the same user connection, so the recipient learns about new DMs without refreshing.
Server becomes authoritative for delivery. Browser-initiated broadcast on state.ws.send goes away.

Benefits

1 WS per client, regardless of channel count.
Fixes DM discovery + first-message delivery.
Fixes sender-tab-closed delivery gap.
Enables server-side read receipts, presence, and cross-channel typing cleanly.
Correct own-send suppression becomes trivial (server knows the sender and excludes them from fanout).

Open questions for brainstorm

How much of the existing routes.rs broadcast infra is reusable vs. rewrite? The tokio broadcast::Sender primitive likely survives, just rekeyed by user id. The HTTP upgrade path + auth gate are reusable. Per-channel membership checks in ws_handler (routes.rs:400-410) go away — replaced by a single "is this my user id" check.
Canvas WS (yrs CRDT sync) is separate and stays per-canvas — doesn't need to change.
Huddles / LiveKit is out of scope.
Migration strategy: keep the per-channel route alive briefly during rollout, or flip atomically?
Rate limiting + fanout cost on large channels (fanout to N members on every send) — needs a membership cache.
Client refactor: state.bgWs, openBackgroundWs, syncBackgroundSubscriptions all collapse into a single connection + a switch on event type. connectWebSocket becomes connection-once at login, not per-channel-select.

DM messages not delivered live to recipient (requires refresh).
New DM conversation not appearing on recipient's sidebar (requires refresh).
Sender sees unread badge on their own conversation (partial — the updateUnreadCounts filter at chat-app.js:766 is a simpler independent fix and doesn't need this refactor).

Scope

This issue is a placeholder for the refactor discussion. Next step is a brainstorm to decide:

Exact event protocol on the user WS.
Rewrite vs. reuse split.
Migration + rollout plan.

No code changes yet.

## Context The current realtime model is a dumb per-channel broadcast hub: - Server: `/ws/{channel_id}` with `state.channels: Mutex<HashMap<String, broadcast::Sender<String>>>` — `crates/hero_collab_ui/src/routes.rs:418-478`. Each channel id has its own broadcast::Sender; subscribers relay `message.created` to each other. - Client: one foreground WS on the viewed channel + one **background** WS per other channel/DM the user belongs to, via `openBackgroundWs` / `state.bgWs` / `syncBackgroundSubscriptions` — `crates/hero_collab_ui/static/js/chat-app.js:660-755`. - The sender's browser is what publishes `message.created` on the WS (`chat-app.js:1667`). The server never fans out on its own — the RPC persists, the browser broadcasts. ### How we got here - `5f6166a` (despiegk, 2026-03-19) — introduced per-channel broadcast routing on the server. - `8579db4` (mik-tf, 2026-04-12) — added chat-app.js with a single foreground WS. - `2e8869d` (Sameh, 2026-04-18) — added the N-background-WS fanout on the client to make unreads and mentions update live for non-viewed channels. ## Problems with the current model 1. **Connection count scales with membership.** A user in 20 channels + 15 DMs holds ~35 open WebSockets per tab. 2. **New DMs don't deliver live.** When user A creates a DM with B, nothing pushes the channel-list change to B. B has no bg WS on the new DM's channel id, so A's broadcast goes into an empty room. B must refresh to see the conversation at all. 3. **Sender-dependent delivery.** `message.created` is emitted from the sender's browser. If the sender's WS is closed (bad network, tab backgrounded), the row persists in the DB but no other client ever learns of it live — they see it only on next `message.list` fetch. 4. **Fragile sender semantics.** Own-send filtering is done in-browser (conn_id prefix server-side + `user_id === me` on bg WS). Any gap in the filter chain produces phantom unreads on the sender's side (see related bug: unread badge showing on sender after their own send — `chat-app.js:766` counts own messages against the read cursor). ## Proposed direction (to be brainstormed) Move to a **per-user** transport, like Slack / Mattermost: - Each client opens **one** WS: `/ws/user/{user_id}` (or similar). - `message.send` (and other state-changing RPCs) do the fanout server-side: on success, look up channel membership and push a `message.created` event to each member's user connection. Skip the sender. - Channel-list changes (new DM, added to channel, etc.) are pushed on the same user connection, so the recipient learns about new DMs without refreshing. - Server becomes authoritative for delivery. Browser-initiated broadcast on `state.ws.send` goes away. ### Benefits - 1 WS per client, regardless of channel count. - Fixes DM discovery + first-message delivery. - Fixes sender-tab-closed delivery gap. - Enables server-side read receipts, presence, and cross-channel typing cleanly. - Correct own-send suppression becomes trivial (server knows the sender and excludes them from fanout). ### Open questions for brainstorm - How much of the existing `routes.rs` broadcast infra is reusable vs. rewrite? The tokio broadcast::Sender primitive likely survives, just rekeyed by user id. The HTTP upgrade path + auth gate are reusable. Per-channel membership checks in `ws_handler` (routes.rs:400-410) go away — replaced by a single "is this my user id" check. - Canvas WS (yrs CRDT sync) is separate and stays per-canvas — doesn't need to change. - Huddles / LiveKit is out of scope. - Migration strategy: keep the per-channel route alive briefly during rollout, or flip atomically? - Rate limiting + fanout cost on large channels (fanout to N members on every send) — needs a membership cache. - Client refactor: `state.bgWs`, `openBackgroundWs`, `syncBackgroundSubscriptions` all collapse into a single connection + a switch on event type. `connectWebSocket` becomes connection-once at login, not per-channel-select. ## Related bugs this would fix / touch - DM messages not delivered live to recipient (requires refresh). - New DM conversation not appearing on recipient's sidebar (requires refresh). - Sender sees unread badge on their own conversation (partial — the `updateUnreadCounts` filter at `chat-app.js:766` is a simpler independent fix and doesn't need this refactor). ## Scope This issue is a placeholder for the refactor discussion. Next step is a brainstorm to decide: - Exact event protocol on the user WS. - Rewrite vs. reuse split. - Migration + rollout plan. No code changes yet.

sameh-farouk commented

2026-04-23 13:39:04 +00:00

Author

Member

Design complete — spec committed on `feat/ws-refactor`

Full design at plan/feature-ws-refactor.md (commit 62f3060). Summary of the decisions made during brainstorming:

Scope + rollout

Scope D — all 13 chat-realtime event types migrate (message., typing., huddle.* signaling, presence, read, mention, plus new channel.added/updated/removed/member_left and message.pin_updated). LiveKit's own WebRTC path untouched.
Rollout α — big-bang in one PR. Deletes per-channel WS infrastructure entirely; no dual-path transitional state.
Single-host — in-process broadcast::Sender on each binary, cross-process seam via one Unix socket. Multi-instance scaling is a separate future step.

Architecture: per-user `broadcast::Sender` + `events.sock` seam + `EventBus` wrapper

hero_collab_server owns events: EventBus (a broadcast::Sender<Envelope> wrapper matching hero_router's SseBroadcaster shape). RPC handlers call fanout::dispatch(state, audience, event) after persistence; dispatch resolves audience against the DB and pushes Envelope { recipients: Vec<UserId>, event: UserEvent }.
New internal Unix socket events.sock bound by hero_collab_server alongside rpc.sock. Length-prefixed JSON envelopes. Internal-only, never reachable via hero_router.
hero_collab_ui runs an event_subscriber task that connects to events.sock, reads envelopes, and dispatches each event to recipients' per-user broadcast::Sender<UserEvent> entries in its user_ws: RwLock<HashMap<UserId, Sender>> map. Browser WS handler subscribes each tab to its user's Sender.
Client holds one WebSocket per tab (/ws/user/{user_id}). Outbound-only from UI to client for every event except typing.start/typing.stop, which relay through a new typing.relay RPC on rpc.sock.

Alternatives considered and rejected

Alternative	Why rejected
Inline `fanout::dispatch` in RPC handlers (same-process assumption)	Didn't account for the two-binary split — `hero_collab_server` and `hero_collab_ui` can't share an in-process `broadcast::Sender`.
NDJSON streaming on `rpc.sock`	Overloads rpc.sock's semantics (JSON-RPC request/response becomes stream + RPC). Muddies debugging. Violates the one-socket-per-purpose convention.
`_fanout` field piggybacked on RPC responses	Couldn't handle server-initiated events (huddle_reaper). Required UI self-talk via internal `rpc_call` for presence. Rate-limiter carve-out for typing. Zero ecosystem precedent for envelope extension. Awkward multi-instance migration path.
Move WebSocket handling into `hero_collab_server`	Violates Hero OS `_server`/`_ui` convention (internal-only vs internet-facing separation). Large blast radius.
Merge `_server` + `_ui` binaries	Off-limits by Hero OS platform convention.

Ecosystem precedents this design follows (verified, not invented)

Two sockets from one _server binary, cleaned up in a shared loop — hero_aibroker_server/src/main.rs:175-189 already binds rpc.sock and rest.sock with a for sock in [...] { create_dir_all + remove_file } loop. hero_collab becomes the second instance.
Typed-enum + broadcast::Sender wrapper struct — hero_router/src/server/sse.rs exposes SseEvent + SseBroadcaster { tx: broadcast::Sender<String> } with new(capacity) / subscribe() / send(event). The new EventBus matches this shape exactly.

TOCTOU fix on first-tab-connect presence detection (use Entry::Vacant under the write lock, not post-subscribe receiver_count() == 1).
Typing auth — server overwrites client-sent user_id with the authenticated path user_id so one user can't forge typing events as another.
Rate-limit allowlist — rate_limit.rs needs a new exempt_methods set with typing.relay in it; otherwise steady-state typing burns the 60/min global cap.
Four additional dispatch sites — channel.update, channel.member.remove (two audiences: ChannelRemoved to removed user + ChannelMemberLeft to remaining members), message.pin/unpin, and huddle_reaper's ghost-huddle termination. The reaper case is server-initiated fanout, demonstrating the design supports it cleanly.
Client handler semantics table — per-event current-channel vs other-channel branching, replacing a misleading "loses all filtering" phrase that would have regressed the round-3 cross-channel-render fix (0a17a52).
WS reconnect catch-up on onopen after the first connect — calls updateUnreadCounts() + pollNotifications() + delta-fetch of the current channel's messages. Otherwise a 5s WS blip silently drops events across all channels.
kill_other.socket update in hero_collab/src/main.rs:215 to include events.sock, complementing the in-process remove_file cleanup.
Dead-code cleanup in same PR — client-side @mention regex detection blocks at chat-app.js:711-736 and :2378-2410 (from d4e8cef) become redundant once server dispatches mention.created directly. pollNotifications poll-diff kept as disconnect-recovery safety net, cadence reduced 30s → 300s.

Bugs this refactor fixes

DM messages not delivering live to recipient (top-of-ticket issue).
New DM conversation not appearing on recipient's sidebar until refresh.
Sender-browser-dependent delivery — if sender's WS closes between message.send and broadcast, nobody else learns of the message live.
N WebSockets per tab (~35 for an active user) collapses to 1 per tab.
Known DM / huddle / pin-state live-update gaps surfaced during rounds 3-5 dogfooding.

What's next

Implementation plan (task-level breakdown, TDD steps, commit boundaries) to be written in a new session via superpowers:writing-plans against plan/feature-ws-refactor.md. Branch feat/ws-refactor is not yet pushed — push when ready for the plan-phase PR or leave local until the plan is also drafted.

Spec lives at:

File: plan/feature-ws-refactor.md
Branch: feat/ws-refactor (1 ahead of origin/development, unpushed)
Commit: 62f3060

## Design complete — spec committed on `feat/ws-refactor` Full design at `plan/feature-ws-refactor.md` (commit `62f3060`). Summary of the decisions made during brainstorming: ### Scope + rollout - **Scope D** — all 13 chat-realtime event types migrate (message.*, typing.*, huddle.* signaling, presence, read, mention, plus new `channel.added`/`updated`/`removed`/`member_left` and `message.pin_updated`). LiveKit's own WebRTC path untouched. - **Rollout α** — big-bang in one PR. Deletes per-channel WS infrastructure entirely; no dual-path transitional state. - **Single-host** — in-process `broadcast::Sender` on each binary, cross-process seam via one Unix socket. Multi-instance scaling is a separate future step. ### Architecture: per-user `broadcast::Sender` + `events.sock` seam + `EventBus` wrapper - `hero_collab_server` owns `events: EventBus` (a `broadcast::Sender<Envelope>` wrapper matching hero_router's `SseBroadcaster` shape). RPC handlers call `fanout::dispatch(state, audience, event)` after persistence; dispatch resolves audience against the DB and pushes `Envelope { recipients: Vec<UserId>, event: UserEvent }`. - New internal Unix socket `events.sock` bound by `hero_collab_server` alongside `rpc.sock`. Length-prefixed JSON envelopes. Internal-only, never reachable via hero_router. - `hero_collab_ui` runs an `event_subscriber` task that connects to `events.sock`, reads envelopes, and dispatches each event to recipients' per-user `broadcast::Sender<UserEvent>` entries in its `user_ws: RwLock<HashMap<UserId, Sender>>` map. Browser WS handler subscribes each tab to its user's Sender. - Client holds **one** WebSocket per tab (`/ws/user/{user_id}`). Outbound-only from UI to client for every event except `typing.start`/`typing.stop`, which relay through a new `typing.relay` RPC on rpc.sock. ### Alternatives considered and rejected | Alternative | Why rejected | |---|---| | Inline `fanout::dispatch` in RPC handlers (same-process assumption) | Didn't account for the two-binary split — `hero_collab_server` and `hero_collab_ui` can't share an in-process `broadcast::Sender`. | | NDJSON streaming on `rpc.sock` | Overloads rpc.sock's semantics (JSON-RPC request/response becomes stream + RPC). Muddies debugging. Violates the one-socket-per-purpose convention. | | `_fanout` field piggybacked on RPC responses | Couldn't handle server-initiated events (huddle_reaper). Required UI self-talk via internal `rpc_call` for presence. Rate-limiter carve-out for typing. Zero ecosystem precedent for envelope extension. Awkward multi-instance migration path. | | Move WebSocket handling into `hero_collab_server` | Violates Hero OS `_server`/`_ui` convention (internal-only vs internet-facing separation). Large blast radius. | | Merge `_server` + `_ui` binaries | Off-limits by Hero OS platform convention. | ### Ecosystem precedents this design follows (verified, not invented) - **Two sockets from one `_server` binary, cleaned up in a shared loop** — `hero_aibroker_server/src/main.rs:175-189` already binds `rpc.sock` and `rest.sock` with a `for sock in [...] { create_dir_all + remove_file }` loop. hero_collab becomes the second instance. - **Typed-enum + `broadcast::Sender` wrapper struct** — `hero_router/src/server/sse.rs` exposes `SseEvent` + `SseBroadcaster { tx: broadcast::Sender<String> }` with `new(capacity)` / `subscribe()` / `send(event)`. The new `EventBus` matches this shape exactly. ### Key refinements caught in review - **TOCTOU fix** on first-tab-connect presence detection (use `Entry::Vacant` under the write lock, not post-subscribe `receiver_count() == 1`). - **Typing auth** — server overwrites client-sent `user_id` with the authenticated path user_id so one user can't forge typing events as another. - **Rate-limit allowlist** — `rate_limit.rs` needs a new `exempt_methods` set with `typing.relay` in it; otherwise steady-state typing burns the 60/min global cap. - **Four additional dispatch sites** — `channel.update`, `channel.member.remove` (two audiences: `ChannelRemoved` to removed user + `ChannelMemberLeft` to remaining members), `message.pin`/`unpin`, and `huddle_reaper`'s ghost-huddle termination. The reaper case is server-initiated fanout, demonstrating the design supports it cleanly. - **Client handler semantics table** — per-event current-channel vs other-channel branching, replacing a misleading "loses all filtering" phrase that would have regressed the round-3 cross-channel-render fix (`0a17a52`). - **WS reconnect catch-up** on `onopen` after the first connect — calls `updateUnreadCounts()` + `pollNotifications()` + delta-fetch of the current channel's messages. Otherwise a 5s WS blip silently drops events across all channels. - **`kill_other.socket` update** in `hero_collab/src/main.rs:215` to include `events.sock`, complementing the in-process `remove_file` cleanup. - **Dead-code cleanup** in same PR — client-side `@mention` regex detection blocks at `chat-app.js:711-736` and `:2378-2410` (from `d4e8cef`) become redundant once server dispatches `mention.created` directly. `pollNotifications` poll-diff kept as disconnect-recovery safety net, cadence reduced 30s → 300s. ### Bugs this refactor fixes - **DM messages not delivering live to recipient** (top-of-ticket issue). - **New DM conversation not appearing on recipient's sidebar until refresh**. - **Sender-browser-dependent delivery** — if sender's WS closes between `message.send` and broadcast, nobody else learns of the message live. - **N WebSockets per tab** (~35 for an active user) collapses to 1 per tab. - Known DM / huddle / pin-state live-update gaps surfaced during rounds 3-5 dogfooding. ### What's next Implementation plan (task-level breakdown, TDD steps, commit boundaries) to be written in a new session via `superpowers:writing-plans` against `plan/feature-ws-refactor.md`. Branch `feat/ws-refactor` is not yet pushed — push when ready for the plan-phase PR or leave local until the plan is also drafted. Spec lives at: - File: `plan/feature-ws-refactor.md` - Branch: `feat/ws-refactor` (1 ahead of `origin/development`, unpushed) - Commit: `62f3060`

sameh-farouk referenced this issue

2026-04-23 19:40:53 +00:00

Per-caller rate limit for typing.relay (A3 — WS refactor follow-up) #14

sameh-farouk referenced this issue

2026-04-23 19:40:54 +00:00

Add since_id filter to message.list (A1 — WS refactor follow-up) #15

sameh-farouk referenced this issue

2026-04-23 19:40:55 +00:00

Application-level WebSocket heartbeat (A2 — WS refactor follow-up) #16

sameh-farouk referenced this issue

2026-04-23 19:40:56 +00:00

First-class presence (eliminate two-producer race) (B2 — WS refactor follow-up) #17

sameh-farouk referenced this issue

2026-04-23 19:40:56 +00:00

Session resume with sequence numbers + ring buffer (B1 — WS refactor follow-up) #18

sameh-farouk commented

2026-04-23 19:42:05 +00:00

Author

Member

Implementation complete; five follow-ups opened for the same branch

What landed

Branch feat/ws-refactor is now pushed to remote — 46 commits ahead of origin/development.

All 17 dispatch rows from the spec's §RPC handler dispatch sites table are wired:

Messages (9 events): message.send (+ per-mention MentionCreated), update, delete, react/unreact/toggle_react, pin/unpin
Channels (5): channel.create-DM (ChannelAdded), member.add, member.remove (dual dispatch: ChannelRemoved + ChannelMemberLeft), update
Huddles (4): start, join, leave (+ last-participant HuddleEnded), huddle_reaper::force_end_huddle
Lifecycle: read.mark, presence.update (with documented two-producer race), typing.relay (server-authoritative user_id)

Verification

cargo test --workspace: 126 tests, 0 failed, 0 ignored (server integration went 42 → 62, +20 dispatch + reconnect tests; UI went 0 → 7; plus hero_collab service-def, SDK doctest, hero_collab_app).
cargo build --workspace: clean (pre-existing hero_collab_app unused-mut warnings only).
Every handler's state.db lock is explicitly released before any fanout::dispatch call — the non-reentrant parking_lot mutex would panic otherwise; grep-verified no exception across all 17 sites.
UI has a single /ws/user/{user_id} route; old /ws/{channel_id} fully deleted alongside AppState.channels / channel_conn_counts.
Client chat-app.js rewritten: one WS per tab, handleWsEvent switch with per-case current-channel branching (preserves the round-3 cross-channel-render fix), onopen-after-reconnect catch-up, bgWs + outbound message sends + mention regex all deleted. pollNotifications cadence dropped 120s → 300s (push is primary now).

Pragmatic adjustments during execution (flagged, justified, no silent scope)

peer_ids on channel.create (P7.5) — server-side field so DM creation fans out to both members in one RPC. Scoped to kind=="dm" only; group_dm asymmetry reverted. Registered in openrpc.json. UI still uses the 2-RPC add-member flow; peer_ids is available for a future client migration.
hero_collab_server dev-dep on rusqlite (P7.12) — presence integration test INSERTs workspace_members rows directly because seed_channel_with_two_members doesn't seed workspace membership. Functional; a helper refactor is future work.
Chore commit 2d47e28 — fixed pre-existing CollabClient → HeroCollabClient rename stragglers in hero_collab_examples + one SDK doctest so cargo test --workspace passes end-to-end. Unrelated to the refactor but blocked the full test matrix.
One minor openrpc style drift (summary field added alongside description for typing.relay, later removed in 2d47e28 to match the file's 50+-method-entry convention).

Five follow-ups opened

After reviewing the transport against Slack / Discord / Matrix at the WebSocket layer, there are five reliability primitives still missing. All of them fit on this branch (same PR, same review unit) per the architectural cohesion argument — especially A3 (security-adjacent: the refactor introduced an amplification path, this closes it) and B2 (fixes a race we explicitly documented as accepted MVP debt).

#14 A3 per-caller rate limit for typing.relay — closes the amplification vector the refactor introduced
#15 A1 since_id filter on message.list — removes the reconnect-overfetch TODO we left in the client
#16 A2 application-level WebSocket heartbeat — dead-peer detection ≤35s instead of 2+ min
#17 B2 first-class presence — server as sole producer of online/offline; eliminates the two-producer race
#18 B1 session resume with sequence numbers + ring buffer — zero event loss on tab-level disconnects

Specs + impl plans for all five committed in 2c77e94 (plan/feature-*.md + plan/impl-*.md). Execution order: A3 → A1 → B2 → A2 → B1 (later plans extend handle_user_ws restructures from earlier ones).

Estimated additional scope: ~1,000 lines production + ~750 lines tests across 15 tasks. Brings the branch from today's ~7,800 insertions to ~9,500. Still one coherent review unit.

What's next

Execute the five follow-ups via superpowers:subagent-driven-development (in progress).
Manual browser dogfood matrix (spec §Testing manual browser, extended for the new features).
Open PR against development; reference and close #13, #14, #15, #16, #17, #18 in the PR body.
Merge is the project owner's call.

Do not push or merge from my end. Branch remains on feat/ws-refactor until the owner authorizes the PR.

## Implementation complete; five follow-ups opened for the same branch ### What landed Branch `feat/ws-refactor` is now pushed to remote — 46 commits ahead of `origin/development`. All 17 dispatch rows from the spec's §RPC handler dispatch sites table are wired: - **Messages (9 events):** `message.send` (+ per-mention `MentionCreated`), `update`, `delete`, `react`/`unreact`/`toggle_react`, `pin`/`unpin` - **Channels (5):** `channel.create`-DM (`ChannelAdded`), `member.add`, `member.remove` (dual dispatch: `ChannelRemoved` + `ChannelMemberLeft`), `update` - **Huddles (4):** `start`, `join`, `leave` (+ last-participant `HuddleEnded`), `huddle_reaper::force_end_huddle` - **Lifecycle:** `read.mark`, `presence.update` (with documented two-producer race), `typing.relay` (server-authoritative `user_id`) ### Verification - `cargo test --workspace`: **126 tests, 0 failed, 0 ignored** (server integration went 42 → 62, +20 dispatch + reconnect tests; UI went 0 → 7; plus `hero_collab` service-def, SDK doctest, `hero_collab_app`). - `cargo build --workspace`: clean (pre-existing `hero_collab_app` unused-`mut` warnings only). - Every handler's `state.db` lock is explicitly released before any `fanout::dispatch` call — the non-reentrant parking_lot mutex would panic otherwise; grep-verified no exception across all 17 sites. - UI has a single `/ws/user/{user_id}` route; old `/ws/{channel_id}` fully deleted alongside `AppState.channels` / `channel_conn_counts`. - Client `chat-app.js` rewritten: one WS per tab, `handleWsEvent` switch with per-case current-channel branching (preserves the round-3 cross-channel-render fix), `onopen`-after-reconnect catch-up, `bgWs` + outbound message sends + mention regex all deleted. `pollNotifications` cadence dropped 120s → 300s (push is primary now). ### Pragmatic adjustments during execution (flagged, justified, no silent scope) - **`peer_ids` on `channel.create`** (P7.5) — server-side field so DM creation fans out to both members in one RPC. Scoped to `kind=="dm"` only; `group_dm` asymmetry reverted. Registered in `openrpc.json`. UI still uses the 2-RPC add-member flow; `peer_ids` is available for a future client migration. - **`hero_collab_server` dev-dep on `rusqlite`** (P7.12) — presence integration test INSERTs `workspace_members` rows directly because `seed_channel_with_two_members` doesn't seed workspace membership. Functional; a helper refactor is future work. - **Chore commit `2d47e28`** — fixed pre-existing `CollabClient` → `HeroCollabClient` rename stragglers in `hero_collab_examples` + one SDK doctest so `cargo test --workspace` passes end-to-end. Unrelated to the refactor but blocked the full test matrix. - **One minor openrpc style drift** (`summary` field added alongside `description` for `typing.relay`, later removed in `2d47e28` to match the file's 50+-method-entry convention). ### Five follow-ups opened After reviewing the transport against Slack / Discord / Matrix at the WebSocket layer, there are five reliability primitives still missing. All of them fit on this branch (same PR, same review unit) per the architectural cohesion argument — especially A3 (security-adjacent: the refactor introduced an amplification path, this closes it) and B2 (fixes a race we explicitly documented as accepted MVP debt). - #14 **A3** per-caller rate limit for `typing.relay` — closes the amplification vector the refactor introduced - #15 **A1** `since_id` filter on `message.list` — removes the reconnect-overfetch TODO we left in the client - #16 **A2** application-level WebSocket heartbeat — dead-peer detection ≤35s instead of 2+ min - #17 **B2** first-class presence — server as sole producer of online/offline; eliminates the two-producer race - #18 **B1** session resume with sequence numbers + ring buffer — zero event loss on tab-level disconnects Specs + impl plans for all five committed in `2c77e94` (`plan/feature-*.md` + `plan/impl-*.md`). Execution order: A3 → A1 → B2 → A2 → B1 (later plans extend `handle_user_ws` restructures from earlier ones). **Estimated additional scope:** ~1,000 lines production + ~750 lines tests across 15 tasks. Brings the branch from today's ~7,800 insertions to ~9,500. Still one coherent review unit. ### What's next 1. Execute the five follow-ups via `superpowers:subagent-driven-development` (in progress). 2. Manual browser dogfood matrix (spec §Testing manual browser, extended for the new features). 3. Open PR against `development`; reference and close #13, #14, #15, #16, #17, #18 in the PR body. 4. Merge is the project owner's call. **Do not push or merge from my end.** Branch remains on `feat/ws-refactor` until the owner authorizes the PR.

sameh-farouk commented

2026-04-23 20:52:27 +00:00

Author

Member

All five follow-ups landed; branch at 63 commits

Branch feat/ws-refactor is now pushed with the five reliability follow-ups wired. Full workspace test matrix: 132 passing, 0 failed, 6 ignored (6 = two #[ignore]'d integration-test scaffolds pending a shared hero_collab_ui fixture; manual dogfood is the gate for both).

What landed per follow-up

Issue	Commits	Summary
#14 A3 typing rate limit	`ec91bda`, `f66e952`	Per-caller typing bucket (60/min default); integration test asserts ~60 ok + ~40 rejected under a 100-call flood
#15 A1 message.list since_id	`bd19bf7`, `4187060`, `32f299c`	Server `WHERE id > ?` filter + dynamic SQL builder + SDK regen; fix commit: client was not actually passing since_id despite plan claim — wired now
#16 A2 heartbeat	`030ac25`, `4b8afd6`, `8415247`	25s ping + 10s pong-grace via mpsc single-writer pattern; client echoes pong; 3 ignored integration tests as scaffold
#17 B2 first-class presence	`505d7ef`, `ab29528`, `15bd2e7`	Split `presence.update` into `set_status_text` (preserves bit) + `mark_connection` (server-authoritative); deprecated alias for one release; deleted client-side 60s heartbeat + beforeunload beacon (both workarounds for the race we just fixed)
#18 B1 session resume	`2faaba1`, `3f57565`, `dc7a17f`, `fa4219f`, `7c636d2`	`SeqEvent` wrapper, per-user `UserSession { sender, next_seq, buffer (500-entry FIFO) }`, `?resume_from=N` replay, client tracks `lastSeq`, `resume.failed` falls through to cold catch-up

What got fixed late by the implementer that I missed

Three plan errors caught during execution (documenting for retrospective hygiene):

A3.2: I said the RateLimited RPC error code was -32029. It's -32005 (per rpc_error.rs). Implementer grepped, used the right value.
A3.2: I assumed raw_rpc returned a full JSON-RPC envelope with result/error keys. It actually returns Result<Value, OpenRpcError>. Implementer matched on the error variant correctly.
A1.2: My plan said the client already passed since_id (per P8.3's intent). P8.3's executor had actually chosen the Option-B fallback and never wired it. A1.2 as briefed would have left A1 inert on the client side — implementer flagged the gap; I dispatched a fix commit (32f299c) to actually wire since_id: lastId into the RPC call.

All three were caught by the implementer reading source instead of trusting my plan. Plan docs updated in b6bb40d so future executors don't repeat #1/#2.

Architecture delta vs. pre-follow-up refactor

Closes the gaps from the Slack/Discord/Matrix comparison I did earlier:

Property	Pre-follow-ups	Post-follow-ups
Per-event inbound rate limit	✗	✓ (A3)
Catch-up uses delta fetch not refetch-100	partial	✓ (A1)
App-level heartbeat	✗	✓ (A2 — 25s ping, 10s pong-grace)
First-class presence (no two-producer race)	✗	✓ (B2)
Session resume with sequence numbers	✗	✓ (B1 — 500-event ring buffer per user)

Remaining gaps (deferred to C-tier, documented in spec): multi-instance pub/sub fan-out, disk-persistent offline buffer, per-tab seq counters, event-schema versioning. All are upgrade paths, not rewrites — fanout::EventBus + UserSession are the right shape to swap their backends.

Branch state

63 commits ahead of origin/development, pushed to remote.
+12,379 insertions / −540 deletions across 42 files.
Manual browser dogfood (spec §Testing manual browser, extended for the new features) is the last functional gate before PR.

What's left before PR

Extended dogfood checklist (original 16 items + heartbeat/reconnect/resume test paths).
PR creation against development. Body will reference + close #13, #14, #15, #16, #17, #18.
Merge — project owner's call.

Not pushing a PR from my end; owner authorizes.

## All five follow-ups landed; branch at 63 commits Branch `feat/ws-refactor` is now pushed with the five reliability follow-ups wired. Full workspace test matrix: **132 passing, 0 failed, 6 ignored** (6 = two `#[ignore]`'d integration-test scaffolds pending a shared `hero_collab_ui` fixture; manual dogfood is the gate for both). ### What landed per follow-up | Issue | Commits | Summary | |---|---|---| | #14 A3 typing rate limit | `ec91bda`, `f66e952` | Per-caller typing bucket (60/min default); integration test asserts ~60 ok + ~40 rejected under a 100-call flood | | #15 A1 message.list since_id | `bd19bf7`, `4187060`, `32f299c` | Server `WHERE id > ?` filter + dynamic SQL builder + SDK regen; **fix commit**: client was not actually passing since_id despite plan claim — wired now | | #16 A2 heartbeat | `030ac25`, `4b8afd6`, `8415247` | 25s ping + 10s pong-grace via mpsc single-writer pattern; client echoes pong; 3 ignored integration tests as scaffold | | #17 B2 first-class presence | `505d7ef`, `ab29528`, `15bd2e7` | Split `presence.update` into `set_status_text` (preserves bit) + `mark_connection` (server-authoritative); deprecated alias for one release; **deleted client-side 60s heartbeat + beforeunload beacon** (both workarounds for the race we just fixed) | | #18 B1 session resume | `2faaba1`, `3f57565`, `dc7a17f`, `fa4219f`, `7c636d2` | `SeqEvent` wrapper, per-user `UserSession { sender, next_seq, buffer (500-entry FIFO) }`, `?resume_from=N` replay, client tracks `lastSeq`, `resume.failed` falls through to cold catch-up | ### What got fixed late by the implementer that I missed Three plan errors caught during execution (documenting for retrospective hygiene): 1. **A3.2**: I said the `RateLimited` RPC error code was `-32029`. It's `-32005` (per `rpc_error.rs`). Implementer grepped, used the right value. 2. **A3.2**: I assumed `raw_rpc` returned a full JSON-RPC envelope with `result`/`error` keys. It actually returns `Result<Value, OpenRpcError>`. Implementer matched on the error variant correctly. 3. **A1.2**: My plan said the client already passed `since_id` (per P8.3's intent). P8.3's executor had actually chosen the Option-B fallback and never wired it. A1.2 as briefed would have left A1 inert on the client side — implementer flagged the gap; I dispatched a fix commit (`32f299c`) to actually wire `since_id: lastId` into the RPC call. All three were caught by the implementer reading source instead of trusting my plan. Plan docs updated in `b6bb40d` so future executors don't repeat #1/#2. ### Architecture delta vs. pre-follow-up refactor Closes the gaps from the Slack/Discord/Matrix comparison I did earlier: | Property | Pre-follow-ups | Post-follow-ups | |---|---|---| | Per-event inbound rate limit | ✗ | ✓ (A3) | | Catch-up uses delta fetch not refetch-100 | partial | ✓ (A1) | | App-level heartbeat | ✗ | ✓ (A2 — 25s ping, 10s pong-grace) | | First-class presence (no two-producer race) | ✗ | ✓ (B2) | | Session resume with sequence numbers | ✗ | ✓ (B1 — 500-event ring buffer per user) | Remaining gaps (deferred to C-tier, documented in spec): multi-instance pub/sub fan-out, disk-persistent offline buffer, per-tab seq counters, event-schema versioning. All are upgrade paths, not rewrites — `fanout::EventBus` + `UserSession` are the right shape to swap their backends. ### Branch state - **63 commits** ahead of `origin/development`, pushed to remote. - **+12,379 insertions / −540 deletions** across 42 files. - Manual browser dogfood (spec §Testing manual browser, extended for the new features) is the last functional gate before PR. ### What's left before PR 1. Extended dogfood checklist (original 16 items + heartbeat/reconnect/resume test paths). 2. PR creation against `development`. Body will reference + close #13, #14, #15, #16, #17, #18. 3. Merge — project owner's call. Not pushing a PR from my end; owner authorizes.

sameh-farouk referenced this issue from a pull request that will close it,

2026-04-23 21:01:10 +00:00

Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polish #19

sameh-farouk commented

2026-04-23 22:51:38 +00:00

Author

Member

Dogfood pass caught three bugs — all fixed on branch

Manual browser-level testing (Alice + Bob across two browsers, dev-seeded DB) surfaced three distinct regressions. All fixed and pushed.

Bug 1: sender sees stale unread badge on own current channel (~2s after send)

Root cause: sendMessage and onMessageCreated's viewing-match branch both appended messages to state.messages + DOM but never advanced the user's read cursor. Subsequent updateUnreadCounts RPC round-trip (~2s) read a stale cursor from the server, counted the just-seen message as "unread," showed a badge. Affected channels AND DMs.

Fix: 4b37c1b fix(ui): advance read cursor on send + receive-while-viewing. Added a shared advanceReadCursor(channelId, messageId) helper updating state.readCursors client-side immediately + firing read.mark RPC fire-and-forget. Called from both paths. Cursor-based unread counting is the right architecture; the invariant "every path that appends to DOM must advance the cursor" is now satisfied.

Bug 2: receiver sees transient 2x unread on channel (but not DM) messages

Root cause: WebSocket churn during login. Runtime trace confirmed identical seq delivered twice per event. connectWebSocket's early-return guard only checked readyState === OPEN, not CONNECTING. When selectChannel called connectWebSocket while an earlier call's handshake was still in flight, the guard missed, the in-flight socket was closed, and a new one opened. Server-side, the old handle_user_ws hadn't finished teardown (TCP FIN still in flight); briefly the user's session had TWO active broadcast::Receivers. Events dispatched in that window were delivered via both receivers; client-side, the closing WebSocket's onmessage still fired for buffered frames, causing duplicate handleWsEvent execution.

Fix: 9f97db7 fix(ui): one WS per user. Three coordinated edits:

connectWebSocket early-return now covers both OPEN and CONNECTING states.
selectChannel no longer calls connectWebSocket — that was legacy per-channel-WS behavior; under the per-user model the WS is session-scoped.
pickUser now calls connectWebSocket once at login so it's established before the first selectChannel runs.

DMs weren't affected in testing because by the time DM tests ran the WS state had stabilized on a single connection. The bug was a startup-window race.

Bug 3: DM picker's hover handler threw `ReferenceError`

Root cause (pre-existing, not introduced by the refactor): setDmAcActive was defined but not on the window.* export block. Inline HTML onmouseenter="setDmAcActive(${i})" handlers execute in the global scope; the function was unreachable. Audit of neighboring DM-picker inline handlers (pickDmUser, dmSearchKey, filterDmUsers) confirmed those were already exported — only setDmAcActive was missing.

Fix: ccff348 fix(ui): expose DM-picker inline-handler functions on window.

State of the PR

Trace instrumentation from the diagnostic phase has been reverted (see Revert "debug(ui): trace instrumentation..."); no [DBG ...] log statements remain.
All three follow-up WS reliability issues (#14/A3, #15/A1, #16/A2) plus the first-class presence rework (#17/B2) and session resume (#18/B1) are in place, verified via workspace test matrix (132 passing).
Local dogfood matrix passed: presence lifecycle, channel + DM message send/receive/edit/delete/react/pin, reconnect after server kill, typing indicators, per-tab independent catch-up.
Final branch-level review still pending before the PR (#19) flips from draft to ready-for-review.

## Dogfood pass caught three bugs — all fixed on branch Manual browser-level testing (Alice + Bob across two browsers, dev-seeded DB) surfaced three distinct regressions. All fixed and pushed. ### Bug 1: sender sees stale unread badge on own current channel (~2s after send) **Root cause:** `sendMessage` and `onMessageCreated`'s viewing-match branch both appended messages to `state.messages + DOM` but never advanced the user's read cursor. Subsequent `updateUnreadCounts` RPC round-trip (~2s) read a stale cursor from the server, counted the just-seen message as "unread," showed a badge. Affected channels AND DMs. **Fix:** [`4b37c1b`](../commit/4b37c1b) fix(ui): advance read cursor on send + receive-while-viewing. Added a shared `advanceReadCursor(channelId, messageId)` helper updating `state.readCursors` client-side immediately + firing `read.mark` RPC fire-and-forget. Called from both paths. Cursor-based unread counting is the right architecture; the invariant "every path that appends to DOM must advance the cursor" is now satisfied. ### Bug 2: receiver sees transient 2x unread on channel (but not DM) messages **Root cause:** WebSocket churn during login. Runtime trace confirmed identical `seq` delivered twice per event. `connectWebSocket`'s early-return guard only checked `readyState === OPEN`, not `CONNECTING`. When `selectChannel` called `connectWebSocket` while an earlier call's handshake was still in flight, the guard missed, the in-flight socket was closed, and a new one opened. Server-side, the old `handle_user_ws` hadn't finished teardown (TCP FIN still in flight); briefly the user's session had TWO active `broadcast::Receiver`s. Events dispatched in that window were delivered via both receivers; client-side, the closing WebSocket's `onmessage` still fired for buffered frames, causing duplicate `handleWsEvent` execution. **Fix:** [`9f97db7`](../commit/9f97db7) fix(ui): one WS per user. Three coordinated edits: 1. `connectWebSocket` early-return now covers both OPEN and CONNECTING states. 2. `selectChannel` no longer calls `connectWebSocket` — that was legacy per-channel-WS behavior; under the per-user model the WS is session-scoped. 3. `pickUser` now calls `connectWebSocket` once at login so it's established before the first `selectChannel` runs. DMs weren't affected in testing because by the time DM tests ran the WS state had stabilized on a single connection. The bug was a startup-window race. ### Bug 3: DM picker's hover handler threw `ReferenceError` **Root cause (pre-existing, not introduced by the refactor):** `setDmAcActive` was defined but not on the `window.*` export block. Inline HTML `onmouseenter="setDmAcActive(${i})"` handlers execute in the global scope; the function was unreachable. Audit of neighboring DM-picker inline handlers (`pickDmUser`, `dmSearchKey`, `filterDmUsers`) confirmed those were already exported — only `setDmAcActive` was missing. **Fix:** [`ccff348`](../commit/ccff348) fix(ui): expose DM-picker inline-handler functions on window. ### State of the PR - Trace instrumentation from the diagnostic phase has been reverted (see `Revert "debug(ui): trace instrumentation..."`); no `[DBG ...]` log statements remain. - All three follow-up WS reliability issues (#14/A3, #15/A1, #16/A2) plus the first-class presence rework (#17/B2) and session resume (#18/B1) are in place, verified via workspace test matrix (132 passing). - Local dogfood matrix passed: presence lifecycle, channel + DM message send/receive/edit/delete/react/pin, reconnect after server kill, typing indicators, per-tab independent catch-up. - Final branch-level review still pending before the PR (#19) flips from draft to ready-for-review.

sameh-farouk commented

2026-04-24 00:43:56 +00:00

Author

Member

Post-refactor cleanup: upsert-dedup consolidation

Follow-up to commit 809c206 (fix(ui): dedup optimistic push against WS beat-to-the-punch). That commit caught a real bug — thread replies double-rendering because the server's WS fanout can arrive at the sender's own socket before the message.send RPC response resolves — but fixed it by inlining a three-line some(id)/push guard at the two optimistic-push sites. On review the pattern looked workaround-shaped: every future event type would need the same guard, and the 'why' (two-path delivery by design) was only documented in the commit message.

Architectural re-check first. My initial reaction was 'the server shouldn't echo to the sender.' That would be wrong: per-user WS fanout is keyed by user, not by tab, so excluding the sender would starve their other tabs of the message. Dedup on server id is the correct client-side behavior for our deliberate two-path delivery (RPC response for the active tab, WS fanout for the user's other tabs). Keeping the fanout to sender; centralizing the dedup.

What landed (feat/ws-refactor, commits d4320f5 → 8046439):

upsertMainMessage(msg) → bool / upsertThreadReply(reply) → bool helpers with a block comment documenting the multi-tab rationale.
Five call sites migrated behind the helpers: sendMessage, sendThreadReply, onMessageCreated (both thread and main-feed branches), onWsReconnected catch-up delta-fetch.
Controller-side self-review caught and removed an inert dedup guard I'd accidentally introduced in the main-feed off-screen branch (state.messages only holds the currently-viewed channel, so checking an off-screen message's id there is tautologically false — the guard could never fire).
Plan document at plan/impl-upsert-dedup.md.

Net diff in chat-app.js: +86 / −62. No server changes, no wire changes, no observable behavior changes — only that any future fanout event can insert via the helpers and get dedup for free.

### Post-refactor cleanup: upsert-dedup consolidation Follow-up to commit `809c206` (`fix(ui): dedup optimistic push against WS beat-to-the-punch`). That commit caught a real bug — thread replies double-rendering because the server's WS fanout can arrive at the sender's own socket before the `message.send` RPC response resolves — but fixed it by inlining a three-line `some(id)`/`push` guard at the two optimistic-push sites. On review the pattern looked workaround-shaped: every future event type would need the same guard, and the 'why' (two-path delivery by design) was only documented in the commit message. **Architectural re-check first.** My initial reaction was 'the server shouldn't echo to the sender.' That would be wrong: per-user WS fanout is keyed by user, not by tab, so excluding the sender would starve their other tabs of the message. Dedup on server id is the correct client-side behavior for our deliberate two-path delivery (RPC response for the active tab, WS fanout for the user's other tabs). Keeping the fanout to sender; centralizing the dedup. **What landed** (`feat/ws-refactor`, commits `d4320f5` → `8046439`): - `upsertMainMessage(msg) → bool` / `upsertThreadReply(reply) → bool` helpers with a block comment documenting the multi-tab rationale. - Five call sites migrated behind the helpers: `sendMessage`, `sendThreadReply`, `onMessageCreated` (both thread and main-feed branches), `onWsReconnected` catch-up delta-fetch. - Controller-side self-review caught and removed an inert dedup guard I'd accidentally introduced in the main-feed off-screen branch (`state.messages` only holds the currently-viewed channel, so checking an off-screen message's id there is tautologically false — the guard could never fire). - Plan document at `plan/impl-upsert-dedup.md`. Net diff in `chat-app.js`: +86 / −62. No server changes, no wire changes, no observable behavior changes — only that any future fanout event can insert via the helpers and get dedup for free.

sameh-farouk closed this issue

2026-04-24 00:54:59 +00:00

sameh-farouk referenced this issue from a commit

2026-04-24 00:55:01 +00:00

docs(plan): feature-ws-refactor — one WS per user + server-side fanout