Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polish #19

Merged
sameh-farouk merged 77 commits from feat/ws-refactor into development 2026-04-24 00:54:59 +00:00
Member

Summary

Full rewrite of the hero_collab realtime transport from a browser-driven per-channel broadcast relay to a server-authoritative per-user WebSocket model with session resume, application-level heartbeat, first-class presence, and bounded rate limiting — i.e. the Slack / Discord / Matrix architectural shape.

Closes #13. Implements #14, #15, #16, #17, #18.

Bugs this fixes (from dogfooding)

  • DM messages not delivering live to the recipient (top of the original ticket).
  • New DM conversation not appearing on recipient's sidebar until refresh.
  • Sender-browser-dependent delivery — if the sender's WS closed between message.send and broadcast, nobody else learned of the message live.
  • ~35 WebSockets per active user tab collapses to 1.
  • Live-update gaps on channel rename, member add/remove, pin toggle (tagalongs).

Architecture

Before: browser publishes every event over its own WS; hero_collab_ui is a dumb per-channel broadcast relay; hero_collab_server never touches WebSockets. Fragile.

After: hero_collab_server owns fanout authoritatively. A new internal Unix socket events.sock carries length-prefixed JSON envelopes from server to hero_collab_ui. hero_collab_ui maintains a per-user UserSession (broadcast Sender + seq counter + 500-event ring buffer) and relays to a single browser WebSocket per tab at /ws/user/{user_id}. Every RPC handler dispatches via a unified fanout::dispatch(state, audience, event) helper after persistence commits.

Follow-ups (all bundled on this branch by design — they close gaps the refactor itself opened or accepted as MVP debt)

  • A3 (#14) per-caller typing.relay rate limit — closes amplification vector introduced by the new fanout path.
  • A1 (#15) since_id filter on message.list — removes reconnect-overfetch TODO.
  • A2 (#16) 25s application-level WebSocket heartbeat — dead-peer detection in ≤35s.
  • B2 (#17) first-class presence — eliminates the two-producer race we explicitly documented as accepted MVP debt; deletes client-side 60s heartbeat + beforeunload beacon workarounds.
  • B1 (#18) session resume with per-user sequence numbers + 500-event ring buffer.

Scale and scope

  • 63 commits since development
  • +12,379 insertions / −540 deletions across 42 files
  • Two new Unix socket files, one new server module (fanout.rs), five new UI modules (events.rs, event_subscriber.rs, auth.rs, plus UserSession struct, plus route). Old per-channel route + AppState.channels / channel_conn_counts / openBackgroundWs / syncBackgroundSubscriptions / all client-emitted outbound message/huddle events fully removed.

Ecosystem surface

  • openrpc.json updated: since_id on message.list; typing.relay, presence.set_status_text, presence.mark_connection registered; presence.update marked DEPRECATED (kept for one release as alias to set_status_text).
  • SDK (openrpc.client.generated.rs) auto-regenerated via proc-macro. New typed input/output structs for all new methods.
  • CLI (hero_collab binary) unchanged — no Presence/Typing subcommands exist today, and existing subcommands (Workspace / User / Channel / Message / Huddle / Canvas / etc.) keep working since follow-ups were additive.
  • Existing examples + hero_skills require no changes.

Deferred to future follow-ups

  • C1 cross-instance fan-out (events.sock → NATS/Redis adapter): only matters once we run multiple hero_collab_server replicas. Spec documents the migration path.
  • C2 persistent offline event buffer: relies on disk or Redis; current in-memory ring buffer covers the tab-blip case, not the full-laptop-reopen case (which falls through to cold catch-up correctly).
  • Event schema versioning (v: u8 on envelopes): only needed when we do a breaking wire change and want lockstep client/server deploys to stop being required.
  • Full hero_collab_ui integration-test fixture: two test files ship #[ignore]'d (heartbeat.rs, session_resume.rs). Manual browser dogfood is the current primary gate; shared fixture will amortize when the next reliability feature also needs it.

Test plan

Automated

  • cargo build --workspace — clean (only pre-existing hero_collab_app unused-mut warnings, unrelated)
  • cargo test --workspace132 passing, 0 failed, 6 ignored
    • hero_collab service_def: 1
    • hero_collab_app integration: 2
    • hero_collab_server unit: 56
    • hero_collab_server integration: 68 (was 42 before this branch; +26 for this work)
    • hero_collab_ui unit: 8
    • hero_collab_ui integration: 6 ignored (tests/heartbeat.rs + tests/session_resume.rs — fixture deferred)
    • hero_collab_sdk doctest: 1

Manual browser dogfood (primary gate for realtime behaviour)

Core realtime flows:

  • Two browsers, two users. DM creation from A delivers the channel row + messages live to B without refresh.
  • Messages in existing DMs and channels deliver live bidirectionally.
  • Edit / delete / react / pin-toggle propagate live.
  • Typing indicators appear / disappear bidirectionally. Spam typing ~100 events/sec in DevTools: rate limit rejects after ~60 and others can still send messages.
  • Multi-tab sync: same user in two tabs. Send in tab A, appears in tab B without a network roundtrip.
  • Huddle start / join / leave updates sidebar for other channel members.
  • Read cursor updates propagate to other tabs of the same user.
  • Channel rename / member-add / member-remove propagate live.
  • Mention: toast + bell badge fire from server push (not client regex).

Heartbeat (A2 / #16):

  • DevTools → Network → Offline on one tab. Within ~35s the tab's connection indicator flips to "disconnected." Re-enable network; tab reconnects.

Presence (B2 / #17):

  • User A opens chat in browser 1. User B sees A's presence dot turn online.
  • User A closes all tabs. Within ~seconds, user B sees A's dot turn offline. No flicker.
  • User A sets status text "In a meeting." User B sees the text. User A's online dot stays on.

Session resume (B1 / #18):

  • Two tabs for user A. Close tab 1 (tab 2 keeps session alive). User B sends 3 messages. Reopen tab 1 — messages appear immediately; DevTools Network shows ?resume_from=N on the WS URL. No catch-up spinner.
  • Single tab for user A. Reload. New connection opens without ?resume_from (in-memory state lost). Cold catch-up via onWsReconnected repopulates unread counts + mentions + current channel.

WS drop recovery:

  • Kill hero_collab_server. UI shows disconnect. Restart server; UI reconnects via backoff. Messages sent after reconnect deliver.

SDK / CLI smoke

  • hero_collab health works (unchanged CLI path).
  • hero_collab message list --channel-id X --limit 50 works (existing path; since_id is optional).

🤖 Generated with Claude Code

## Summary Full rewrite of the `hero_collab` realtime transport from a browser-driven per-channel broadcast relay to a server-authoritative per-user WebSocket model with session resume, application-level heartbeat, first-class presence, and bounded rate limiting — i.e. the Slack / Discord / Matrix architectural shape. Closes #13. Implements #14, #15, #16, #17, #18. ### Bugs this fixes (from dogfooding) - **DM messages not delivering live to the recipient** (top of the original ticket). - **New DM conversation not appearing on recipient's sidebar until refresh.** - **Sender-browser-dependent delivery** — if the sender's WS closed between `message.send` and broadcast, nobody else learned of the message live. - **~35 WebSockets per active user tab** collapses to **1**. - Live-update gaps on channel rename, member add/remove, pin toggle (tagalongs). ### Architecture Before: browser publishes every event over its own WS; `hero_collab_ui` is a dumb per-channel broadcast relay; `hero_collab_server` never touches WebSockets. Fragile. After: `hero_collab_server` owns fanout authoritatively. A new internal Unix socket `events.sock` carries length-prefixed JSON envelopes from server to `hero_collab_ui`. `hero_collab_ui` maintains a per-user `UserSession` (broadcast Sender + seq counter + 500-event ring buffer) and relays to a single browser WebSocket per tab at `/ws/user/{user_id}`. Every RPC handler dispatches via a unified `fanout::dispatch(state, audience, event)` helper after persistence commits. ### Follow-ups (all bundled on this branch by design — they close gaps the refactor itself opened or accepted as MVP debt) - **A3 (#14)** per-caller `typing.relay` rate limit — closes amplification vector introduced by the new fanout path. - **A1 (#15)** `since_id` filter on `message.list` — removes reconnect-overfetch TODO. - **A2 (#16)** 25s application-level WebSocket heartbeat — dead-peer detection in ≤35s. - **B2 (#17)** first-class presence — eliminates the two-producer race we explicitly documented as accepted MVP debt; deletes client-side 60s heartbeat + `beforeunload` beacon workarounds. - **B1 (#18)** session resume with per-user sequence numbers + 500-event ring buffer. ### Scale and scope - **63 commits** since `development` - **+12,379 insertions / −540 deletions** across **42 files** - Two new Unix socket files, one new server module (`fanout.rs`), five new UI modules (`events.rs`, `event_subscriber.rs`, `auth.rs`, plus `UserSession` struct, plus route). Old per-channel route + `AppState.channels` / `channel_conn_counts` / `openBackgroundWs` / `syncBackgroundSubscriptions` / all client-emitted outbound message/huddle events fully removed. ### Ecosystem surface - **`openrpc.json`** updated: `since_id` on `message.list`; `typing.relay`, `presence.set_status_text`, `presence.mark_connection` registered; `presence.update` marked DEPRECATED (kept for one release as alias to `set_status_text`). - **SDK (`openrpc.client.generated.rs`)** auto-regenerated via proc-macro. New typed input/output structs for all new methods. - **CLI (`hero_collab` binary)** unchanged — no Presence/Typing subcommands exist today, and existing subcommands (Workspace / User / Channel / Message / Huddle / Canvas / etc.) keep working since follow-ups were additive. - **Existing examples + `hero_skills`** require no changes. ### Deferred to future follow-ups - **C1** cross-instance fan-out (`events.sock` → NATS/Redis adapter): only matters once we run multiple `hero_collab_server` replicas. Spec documents the migration path. - **C2** persistent offline event buffer: relies on disk or Redis; current in-memory ring buffer covers the tab-blip case, not the full-laptop-reopen case (which falls through to cold catch-up correctly). - **Event schema versioning** (`v: u8` on envelopes): only needed when we do a breaking wire change and want lockstep client/server deploys to stop being required. - **Full `hero_collab_ui` integration-test fixture**: two test files ship `#[ignore]`'d (`heartbeat.rs`, `session_resume.rs`). Manual browser dogfood is the current primary gate; shared fixture will amortize when the next reliability feature also needs it. ## Test plan ### Automated - [ ] `cargo build --workspace` — clean (only pre-existing `hero_collab_app` unused-`mut` warnings, unrelated) - [ ] `cargo test --workspace` — **132 passing, 0 failed, 6 ignored** - `hero_collab` service_def: 1 - `hero_collab_app` integration: 2 - `hero_collab_server` unit: 56 - `hero_collab_server` integration: 68 (was 42 before this branch; +26 for this work) - `hero_collab_ui` unit: 8 - `hero_collab_ui` integration: 6 ignored (`tests/heartbeat.rs` + `tests/session_resume.rs` — fixture deferred) - `hero_collab_sdk` doctest: 1 ### Manual browser dogfood (primary gate for realtime behaviour) **Core realtime flows:** - [ ] Two browsers, two users. DM creation from A delivers the channel row + messages live to B without refresh. - [ ] Messages in existing DMs and channels deliver live bidirectionally. - [ ] Edit / delete / react / pin-toggle propagate live. - [ ] Typing indicators appear / disappear bidirectionally. Spam typing ~100 events/sec in DevTools: rate limit rejects after ~60 and others can still send messages. - [ ] Multi-tab sync: same user in two tabs. Send in tab A, appears in tab B without a network roundtrip. - [ ] Huddle start / join / leave updates sidebar for other channel members. - [ ] Read cursor updates propagate to other tabs of the same user. - [ ] Channel rename / member-add / member-remove propagate live. - [ ] Mention: toast + bell badge fire from server push (not client regex). **Heartbeat (A2 / #16):** - [ ] DevTools → Network → Offline on one tab. Within ~35s the tab's connection indicator flips to "disconnected." Re-enable network; tab reconnects. **Presence (B2 / #17):** - [ ] User A opens chat in browser 1. User B sees A's presence dot turn online. - [ ] User A closes all tabs. Within ~seconds, user B sees A's dot turn offline. No flicker. - [ ] User A sets status text "In a meeting." User B sees the text. User A's online dot stays on. **Session resume (B1 / #18):** - [ ] Two tabs for user A. Close tab 1 (tab 2 keeps session alive). User B sends 3 messages. Reopen tab 1 — messages appear immediately; DevTools Network shows `?resume_from=N` on the WS URL. No catch-up spinner. - [ ] Single tab for user A. Reload. New connection opens without `?resume_from` (in-memory state lost). Cold catch-up via `onWsReconnected` repopulates unread counts + mentions + current channel. **WS drop recovery:** - [ ] Kill hero_collab_server. UI shows disconnect. Restart server; UI reconnects via backoff. Messages sent after reconnect deliver. ### SDK / CLI smoke - [ ] `hero_collab health` works (unchanged CLI path). - [ ] `hero_collab message list --channel-id X --limit 50` works (existing path; `since_id` is optional). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Design spec for the realtime transport refactor tracked in issue #13.

Moves hero_collab from its current per-channel broadcast hub (N+1
WebSockets per client, sender-dependent delivery) to a single per-user
WebSocket with server-authoritative fanout. Audience resolution lives in
hero_collab_server (DB access); socket fanout lives in hero_collab_ui
(WebSocket owner). A new internal events.sock provides the cross-process
seam — second instance of the two-socket-per-binary pattern already
established by hero_aibroker_server. The EventBus wrapper struct matches
hero_router's SseBroadcaster shape for ecosystem consistency.

Scope D (all chat-realtime events except LiveKit WebRTC), big-bang
rollout, single-host. Deletes per-channel WS infrastructure entirely.
Handles server-initiated fanout cleanly (huddle_reaper case) — a
capability envelope-piggyback designs cannot express.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add two helper functions to fanout.rs for resolving event audiences from the
DB: channel_members returns all user_ids for a given channel, and
presence_audience returns the DISTINCT union of workspace co-members for a
given user. Includes four hermetic unit tests (TDD — tests written first).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire the fanout bus into AppState so handlers can push Envelope values
via state.events. Capacity 1024 (vs the old per-channel 256) accounts
for per-user subscriptions seeing N× events across many channels.
Includes a compile-only test in fanout::appstate_tests that ensures
future refactors cannot silently drop the field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add dispatch_inner (pub(crate)), dispatch (pub), and resolve_audience
(private) to fanout.rs. dispatch_inner resolves the Audience enum via
the DB, drops silently on empty audience, logs and drops on error, then
pushes an Envelope onto the EventBus. Two tokio tests cover the
happy-path (ChannelMembers → envelope) and the silent-noop (unknown
channel → no send).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Renames `socket_path` → `rpc_sock` (mechanical) and binds a second
events.sock UDS in the same cleanup loop (hero_aibroker two-socket
idiom). On each accept, spawns a dedicated events_conn_writer task
that owns its own broadcast::Receiver and writes length-prefixed
JSON envelopes (4-byte BE length + body). Per-task receivers keep a
lagging subscriber from stalling others; 100ms sleep on accept
error prevents busy-looping. Integration test verifies events.sock
is bound and connectable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the typing-relay path (Task P3.3): browser → hero_collab_ui →
typing.relay handler → fanout::dispatch(Audience::ChannelMembers). The
event's user_id is always the authenticated caller_id — never a
client-supplied field — so Bob cannot spoof typing as Alice. Missing
caller_id returns Unauthenticated; bad channel_id/kind returns
Validation. Added an end-to-end integration test that connects to
events.sock before invoking the RPC and asserts the envelope carries
the server-authoritative user_id plus both channel members as
recipients.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tracks each task from Phase 1 (server fanout) through Phase 11 (branch
finalization) with exact file paths, test code, and commit messages.
Referenced by superpowers:subagent-driven-development during execution.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Centralizes dev-mode state in a new `crates/hero_collab_ui/src/auth.rs`
module mirroring the server's `handlers/permissions.rs` pattern:
`init_dev_mode` publishes once at boot, `is_dev_mode` reads O(1),
fail-closed if uninitialized. Replaces the two `std::env::var
("COLLAB_AUTH_MODE")` reads in `routes.rs` (chat + canvas WS handlers)
with `crate::auth::is_dev_mode()` so future WS routes can't drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire fanout::dispatch into message.send (P7.2 — first Phase 7 handler
wiring). Emits MessageCreated to all channel members, and one
MentionCreated per mention target (self-mentions skipped).

Ordering is strict: dispatch runs only after tx.commit() succeeds, so
subscribers never see events for rolled-back rows. The MutexGuard `db`
is explicitly dropped before dispatch because fanout::dispatch
re-acquires state.db internally and parking_lot::Mutex is non-reentrant
(a recursive lock would panic). Mention resolution uses a fresh inner
scope so its guard drops before the dispatch loop runs.

Note: `tx.commit()` consumes the Transaction by value, so only `db` is
dropped explicitly (an extra `drop(tx)` would be a use-after-move).

Tests: two new integration tests exercise both paths against real
events.sock envelopes. Full server suite: 44 integration + 53 unit,
all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire fanout::dispatch into five more message.* handlers (P7.3):

- message.update     -> UserEvent::MessageUpdated  { message }
- message.delete     -> UserEvent::MessageDeleted  { message_id, channel_id }
- message.react      -> UserEvent::MessageReacted  { message_id, channel_id, reactions }
- message.unreact    -> UserEvent::MessageReacted  { ... } (only when a row was actually removed)
- message.toggle_react -> UserEvent::MessageReacted { ... } on both add + remove branches

Each dispatches to Audience::ChannelMembers(channel_id) after its persistence
completes. The MutexGuard `db` is released (drop(db)) before every dispatch
call because fanout::dispatch re-acquires state.db and parking_lot::Mutex is
non-reentrant (recursive lock would panic) — same pattern as message.send
from P7.2 (cf247ac).

For the three reaction handlers the dispatched `reactions` payload is
fetched fresh from the DB AFTER the INSERT/DELETE (post-mutation state) via
the existing fetch_reactions_for_messages helper, so subscribers see the
authoritative post-change array rather than the mutation the client just
submitted.

Channel_id is extracted from the message blob (update re-uses the already-
parsed `obj`; delete re-uses the `ch_id` already read for activity logging;
react/unreact/toggle_react query json_extract(data,'$.channel_id') from the
messages row once before dropping the guard).

Tests: five new integration tests exercise every path against real
events.sock envelopes. Each seeds the shared workspace+2-users+channel+msg
fixture, drains the seed's message.created envelope, then asserts the
expected type and payload. Full server suite: 49 integration + 53 unit,
all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire fanout::dispatch into the pin and unpin handlers so clients
subscribed to a channel see pin-state flips as message.pin_updated
events. Matches the post-persistence pattern established for
message.react/unreact in 80199cb: channel_id is sourced from the
stored message row (json_extract on data), and the db MutexGuard
is dropped before calling dispatch — parking_lot is non-reentrant
and dispatch re-acquires state.db internally.

Adds two integration tests (51 total, was 49):
- message_pin_dispatches_message_pin_updated   (pinned=true)
- message_unpin_dispatches_message_pin_updated (pinned=false)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a DM channel is created, inline the `peer_ids` as members and
dispatch a `ChannelAdded` envelope to the full member set so Bob's
sidebar picks up a DM started by Alice without a refresh (Phase 7.5).

- Add `peer_ids: Option<Vec<u64>>` to `ChannelCreate` input.
- Insert peer membership rows alongside the creator auto-add, inside
  the existing db lock.
- After drop(db), dispatch `UserEvent::ChannelAdded { channel: obj }`
  to `Audience::Users(member_ids)` where `member_ids` is read from the
  just-persisted `channel_members` rows (authoritative, handles the
  INSERT-OR-IGNORE dedupe case).
- Non-DM channels keep their prior discovery semantics — no dispatch.
When a user is newly added to a channel via `channel.member.add`, emit
a `ChannelAdded` user-event to the added user so their sidebar picks
up the channel live via the per-user WS (P7.6).

- Gate dispatch on `INSERT OR IGNORE` actually inserting (skip on
  already-member dedupe to avoid spurious re-emit).
- Audience is `Audience::User(user_id)` — only the new member; existing
  members already have the channel in their sidebar.
- Fetch the full channel payload from the `channels.data` column (same
  source as `channel.get`) before dropping the db lock; dispatch runs
  after drop per the P7.2/P7.5 non-reentrant-lock precedent.

Integration: adds `channel_member_add_dispatches_channel_added_to_new_member`;
suite 52 → 53.
P7.7: wire two fanout dispatches into `channel.member.remove` so member
removal propagates live to both the removed user (their sidebar drops
the row) and the remaining channel members (their members list refreshes).

- `ChannelRemoved` -> Audience::User(removed_user_id): targets the
  removed user directly; does not query channel_members.
- `ChannelMemberLeft` -> Audience::ChannelMembers(channel_id): runs
  AFTER the DELETE so the audience resolution excludes the removed
  user. Remaining members receive; the removed user does not.

Both dispatches are gated on `affected > 0` — if the user wasn't a
member, no envelopes fire. The db guard is dropped before dispatch
since parking_lot is non-reentrant and `dispatch` re-acquires the
lock for audience resolution.

Integration coverage: `channel_member_remove_dispatches_channel_removed_and_member_left`
seeds 3 members (alice, bob, carol), opens the events stream AFTER
all `channel.member.add` calls to avoid capturing their unrelated
`ChannelAdded` envelopes, then asserts both `channel.removed` (recipients
= [bob]) and `channel.member_left` (recipients = [alice, carol], event.user_id
= bob) arrive with correct fields. Suite: 53 -> 54.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire fanout::dispatch into the `channel.update` handler so every member
of an updated channel receives a `channel.updated` envelope with the
POST-update payload — their sidebar / header swaps in the new
name/description live without a follow-up `channel.get`.

Follows the P7.6 `member_add` pattern: fetch the updated payload while
the db guard is held, then drop the guard before dispatch (parking_lot
is non-reentrant and dispatch re-acquires `state.db` internally to
resolve `Audience::ChannelMembers`).

Integration suite: 54 -> 55.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P7.9: wire fanout::dispatch into huddle.start and huddle.join so every
channel member gets a WS push when a huddle is started or joined. Both
paths release the parking_lot db guard before dispatch (non-reentrant)
and broadcast the full room blob as Audience::ChannelMembers. join
deliberately excludes the per-caller LiveKit token from the fanout
payload — tokens are signed per-identity bearer creds for the SFU and
must never leak to other members.

Two integration tests added (spawn_huddle fixture populates LIVEKIT_*
env vars; JWT signing is local, no SFU contact) covering start + join
envelope recipients, room shape, and the token-leak guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P7.10: wire fanout::dispatch into the huddle.leave handler. ParticipantLeft
ALWAYS fires on a successful happy-path leave so peers can prune the leaver
from their huddle HUD immediately. HuddleEnded fires ADDITIONALLY — and only
— when that leave tripped the auto-end branch (the last participant left).
The parking_lot db guard is released before every dispatch call (mutex is
not reentrant; dispatch re-acquires state.db to resolve the channel
audience).

The already-ended no-op branch (non-participant calling leave on a stale
huddle) is intentionally silent — no dispatch. The originating end-leave
already dispatched both events, and the reaper's force-end path (P7.13) is
disjoint.

Two integration tests reuse the P7.9 ServerFixture::spawn_huddle helper:
one asserts Alice-leaves-while-Bob-stays emits ParticipantLeft only; the
other asserts last-participant leave emits both ParticipantLeft AND
HuddleEnded with both channel members as recipients. Start/join envelopes
are drained before the leave RPC so assertions land on post-leave traffic
in isolation. Suite: 57 → 59.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Emit `ReadUpdated { user_id, channel_id, last_read_msg }` via the fanout
seam after a successful `read.mark` write, with `Audience::User(user_id)`
so only the user's own tabs receive it. This lets multi-tab clients
synchronise `state.readCursors[channel_id]` without leaking personal
read state to other channel members.

DB guard is dropped before the dispatch call (parking_lot is
non-reentrant, and `dispatch` re-locks `state.db` internally).

Adds one integration test (60 total) that asserts the envelope's
recipients vector equals `[alice_id]` and that the event payload carries
the expected `user_id`, `channel_id`, and `last_read_msg`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unblocks cargo test --workspace by catching up four files that still
referenced the pre-rename SDK client type (examples + SDK doctest),
and removes the typing.relay openrpc entry's 'summary' field which
was a single-entry style drift vs the 50+ existing methods.

Also adapts examples/integration tests to the current client API
surface: .health() no longer exists, so calls route through
rpc_health(RpcHealthInput {}) and access the returned Option<String>
fields via as_deref(). Restores the SERVER_SERVICE constant that a
prior socket-path refactor (26e925f) renamed to SERVICE_NAME but
left dangling in zinit commands, so integration.rs compiles again.

Regenerates openrpc.client.generated.rs to reflect openrpc.json
drift since df45bcc (the peer_ids addition in d404f71 and the
typing.relay summary removal above).

No runtime behaviour changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires `fanout::dispatch` into the `presence.update` handler. After the DB
UPDATE succeeds, resolves the audience via `fanout::presence_audience`
(union of every `workspace_members` row across every workspace the target
user belongs to) inside the same db-lock scope, drops the guard, then
dispatches a `PresenceUpdate` envelope so all co-workspace members (and
the user's own other tabs) stay in sync live.

Unblocks the dead-code warning on `presence_audience` added in `f6da3b1`.

Adds an integration test + a test-local `ensure_workspace_members`
helper that opens the server's SQLite file as a second reader (WAL mode)
to seed the two rows; no public RPC today mirrors into
`workspace_members`, so we reach into the DB rather than paper over the
missing surface. Brings the integration suite to 61 tests.

Documents the known two-producer presence race (WS-lifecycle vs
RPC-driven) in a comment above the dispatch — accepted for MVP per
plan/feature-ws-refactor.md §Error handling.
When the background huddle reaper force-ends a ghost huddle (LiveKit
reports the room empty and no late-joiner rows survived the ghost
delete), peers in the channel previously had no event to prune the
sidebar indicator — they had to wait for their next `huddle.list_active`
poll or a page reload. This wires `fanout::dispatch` into the reaper's
commit path so `HuddleEnded { room_id }` reaches `ChannelMembers(cid)`
as soon as the transaction commits.

The db guard is dropped before dispatch (parking_lot is non-reentrant
and the ChannelMembers resolver re-locks `state.db`). `channel_id` is
read from the room_obj JSON; legacy rows that lack it silently no-op,
matching the P7.3/P7.4 channel_id-gate pattern. The existing info! log
is preserved.

No unit test for this path: `force_end_huddle` takes `&AppState` and
building a minimal AppState in-tree requires synthesising ~14 fields
(StorageBackend trait object, LiveKitConfig, etc.) — a larger yak than
the payoff for a two-line dispatch that mirrors the already-tested
`handlers::huddle::leave` branch. The `dispatch` → `channel_members`
audience resolution is covered by `fanout::audience_tests`; the full
reaper entry path (reconcile_once → list_participants → force_end) is
exercised by Phase 10/11 dogfood since it needs a LiveKit stub that
reports zero participants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P2.4 broadened pub on ServiceFlags + build_service_definition so a
separate integration test file could re-import the binary via
#[path = ...]. Moving the test into an inline #[cfg(test)] mod tests
inside main.rs gives it direct access to private items, so the pub
qualifiers are no longer needed.

Test behaviour is unchanged — asserts both rpc.sock and events.sock
are present in the server action's kill_other.socket vec.
P8.1 + P8.2. Rewrite connectWebSocket to open a single per-user WS at
/ws/user/{user_id} (replacing the previous per-channel /ws/{channel_id}
foreground socket). The inbound message handler is extracted into a
top-level handleWsEvent dispatcher that routes each data.type to a
dedicated on* helper: onMessageCreated / onMessageUpdated /
onMessageDeleted / onMessageReacted / onMessagePinUpdated (new) /
onTypingEvent / onHuddleEvent / onPresenceUpdate / onReadUpdated /
onMentionCreated, plus channel.added/updated/removed/member_left stubs
for P8.4.

onMessageCreated preserves the round-3 cross-channel render fix by
routing off-screen events to unread-count bumps (skipping own sends)
and rendering the sidebar instead of the message list. Mention
regex deliberately omitted here — P8.7 deletes it once the server
push of mention.created is fully trusted.

onWsReconnected is a stub; P8.3 wires it to updateUnreadCounts() +
pollNotifications() + message.list delta fetch. scheduleReconnect
implements 1s→30s exponential backoff, replacing the old fixed 3s
retry. Server-side auth closes (1008/1011/4xxx) still suppress
retry.

bgWs, outbound state.ws.send, and the client-side mention regex
remain intact intentionally — dead-code cleanup is P8.5–P8.7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under the per-user WS model, a transient disconnect stops ALL live
delivery (unlike the old per-channel model where a channel's bgWs
could blip independently). Without catch-up, a 5s hiccup silently
misses messages across every channel.

onWsReconnected now fires three steps on every non-first-connect:
 1. updateUnreadCounts() — refresh badges across all channels
 2. pollNotifications() — re-run mention poll-diff to surface any
    browser notifications missed while the WS was down
 3. Re-fetch last 100 messages of the current channel and append any
    new ones via id-dedup.

Option B was chosen for step 3: the server's message.list RPC does
not yet accept a since_id filter, so we re-fetch and dedup. TODO
left inline — follow-up task should add since_id: Option<u64> to
MessageList input + WHERE id > since_id in the query for efficiency.
Adds specs + implementation plans for the Slack-grade WS reliability
gaps identified in the ws-refactor architectural comparison:

- A3 per-caller typing.relay rate limit (closes amplification vector
  introduced by the refactor; bucket default 60/min)
- A1 since_id filter on message.list (reconnect catch-up delta fetch,
  removes the P8.3 TODO)
- A2 application-level WebSocket heartbeat (25s ping + 10s pong-grace,
  dead-peer detection in <=35s)
- B2 first-class presence (server-authoritative online/offline, fixes
  the two-producer race accepted as MVP debt in the refactor)
- B1 session resume with per-user sequence numbers + ring buffer
  (tab-blip zero event loss; falls through to cold catch-up on
  buffer miss)

Execution order: A3 -> A1 -> B2 -> A2 -> B1 (later plans depend on
handle_user_ws restructures from earlier).

All lands on feat/ws-refactor per scope decision: the refactor's
"complete Slack-grade transport" story benefits from shipping these
together. Total added scope: ~1,000 lines production + ~750 lines
tests across ~15 tasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the A3.1 per-caller typing rate limit: fire 100 rapid typing.relay
calls and assert ~60 succeed (default typing bucket = 60/min, starts full)
and the remainder reject with RpcError::RateLimited (-32005). Any other
error code is treated as a regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #15. Adds optional `since_id: Option<u64>` to `message.list` (strict
`id > ?`), composing with the existing `thread_id IS NULL` / `before` /
`limit` filters. Once the WS refactor's reconnect catch-up lands, this
turns the current 100-message refetch + dedup into a literal delta fetch.

The `list()` handler's two hard-coded SQL branches (with-before vs.
without) are replaced by a small dynamic builder so adding a third
optional filter doesn't double the branch count again. `ORDER BY
created_at DESC` and the `thread_id IS NULL` top-level filter are
preserved exactly.

openrpc.json registers the new param; the SDK mirror
(`openrpc.client.generated.rs`) is regenerated from the proc-macro
output (verified via `cargo expand -p hero_collab_sdk --lib`) so
`MessageListInput.since_id: Option<i64>` is visible to grep.

Note: the task brief referenced a `MessageList` typed input struct in
`handlers/inputs.rs`, but `message.list` still uses raw `params: Value`
extraction (same idiom as the existing `channel_id` / `limit` /
`before` fields). The new `since_id` extraction follows that idiom
rather than introducing a half-migration to the typed-input system.

Tested via the new `message_list_filters_by_since_id` integration test
(three shapes: no filter → all 5; `since_id = msg3` → exactly msg4 +
msg5 with msg3 excluded; `since_id` past the newest → empty). Full
suite: 64 integration tests passing (was 63).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implementer on A3.2 found these plan errors:
- RateLimited JSON-RPC code is -32005 (per rpc_error.rs), not -32029
  (I assumed a generic JSON-RPC reserved range; real code is specific).
- raw_rpc returns Result<Value, OpenRpcError>, not a JSON-RPC envelope
  with .result / .error fields.

Plan now matches reality. No code change — A3.2 test already landed
correctly by reading source directly.
Split `handlers::presence::update` into three functions to close the two-
producer race documented in the ws-refactor spec (accepted MVP debt):

- `set_status_text` — client-facing status text edits. Writes
  `users.data.status_text` + `last_seen`. Explicitly preserves the
  existing `users.data.status` bit and dispatches PresenceUpdate with
  whatever value is already stored.
- `mark_connection` — server-authoritative WS-lifecycle path. Accepts a
  boolean `online` flag (never a client-controlled status string), writes
  `users.data.status` + `last_seen`, and dispatches PresenceUpdate with
  the new status. hero_collab_ui's `handle_user_ws` will call this on
  first-tab-online and last-tab-offline.
- `update` — deprecated alias. Logs a `tracing::warn!` deprecation and
  routes to `set_status_text`. Any client-supplied `status` field is
  silently ignored — closes the race where a stale client could flip
  the online bit concurrently with the WS lifecycle.

Adds four integration tests (`presence_mark_connection_flips_status_…`,
`presence_set_status_text_preserves_status_bit`,
`presence_update_routes_to_set_status_text_with_deprecation`,
`presence_set_status_text_rejects_overlong_text`) and migrates the P7.12
test to `presence.mark_connection` so its "online bit was set" assertion
still holds after the split.

Integration suite: 68 passing. Closes #17.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sameh-farouk changed title from Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polish to WIP: Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polish 2026-04-23 21:01:51 +00:00
Fixes sender (and by extension receiver-while-viewing) seeing an unread
badge on their own currently-viewed channel. Root cause: sendMessage
and onMessageCreated's viewing-match branch both appended messages to
state.messages + DOM but never advanced the user's read cursor. After
any subsequent updateUnreadCounts run (login's pollNotifications, the
300s interval, a mention.created arrival, etc.) the user's stale
cursor caused the server to report those just-seen messages as unread.

The fix: a shared advanceReadCursor(channelId, messageId) helper
updates state.readCursors client-side immediately (so racing refreshes
see the fresh cursor) and fires read.mark fire-and-forget. Called from
both the send-path and the receive-while-viewing path.

Cursor-based unread counting is the right architecture for this
codebase (portable across tabs, survives reload). The correctness
invariant is 'any path that appends a message to DOM must also advance
the cursor' — this patch satisfies that.
Temporary console.log instrumentation on chat-app.js to diagnose the
transient 2x unread-count doubling on channel (not DM) message receipt.
Tagged [DBG mc] / [DBG he] / [DBG uuc] for easy filtering. Counters
window.__mcCount / __heCount / __uucCount increment per handler entry
so multi-delivery becomes visible.

To run the scenarios:

  Alice (browser 1)  → viewing any channel, NOT the target
  Bob   (browser 2)  → viewing any channel, NOT the target
  1. DevTools Console → filter: [DBG
  2. Reset counters:
       window.__mcCount=0; window.__heCount=0; window.__uucCount=0
  3. Scenario A (channel, expected to double):
       Alice sends ONE plain-text message (no @mention) to a channel
       where Bob is a member. Copy Bob's console output.
  4. Scenario B (DM, expected correct):
       Alice sends ONE DM to Bob. Copy Bob's console output.
  5. Scenario C (sender self-badge):
       Alice sends, stays viewing. Wait ~2s. Copy Alice's console.

To remove: revert this commit. The fix commit (4b37c1b) stands on its
own. This commit is scratch — should NOT reach the PR as a merged
artifact; revert or drop before marking the PR ready-for-review.
Fixes Bug #1 (transient 2x channel unread count on message arrival).

Root cause confirmed via runtime trace on feat/ws-refactor:
- Under the new per-user WS model, the WS is session-scoped — one
  connection per tab for the whole login. selectChannel() inherited
  a connectWebSocket() call from the old per-channel-WS model where
  switching channels meant opening a new WS.
- Two rapid selectChannel calls during navigation (initial route
  resolution + default-channel selection) both invoked
  connectWebSocket().
- The early-return guard only checked readyState === OPEN, not
  CONNECTING, so the second call closed the in-flight socket and
  opened a new one. Server-side, the old handle_user_ws hadn't
  finished teardown yet; briefly both had broadcast::Receivers on the
  user's session. An event dispatched in that window was delivered
  via both. Client-side, the closing WS's onmessage still fired for
  buffered frames — handleWsEvent ran twice, onMessageCreated bumped
  twice, unread count doubled.
- Transient because updateUnreadCounts's subsequent refetch replaced
  the state with the server-authoritative count. Refresh corrected
  because a fresh page load has no stale socket race.

Three-edit fix:
1. connectWebSocket early-return now also covers CONNECTING, not
   just OPEN — a second call during handshake is a no-op instead of
   closing + reopening.
2. selectChannel no longer calls connectWebSocket — the legacy
   per-channel pattern is gone; the WS is established once at login.
3. pickUser now calls connectWebSocket after setting currentUser,
   so the session's WS is up BEFORE the first selectChannel runs.

DMs were NOT affected in user testing because the DM test happened
after the WS state had stabilized on a single connection; the bug
was a startup-window race, not a steady-state problem.
DOM inline event attributes (onmouseenter, onmousedown, etc.) execute
in the global scope. Functions referenced there must be reachable from
window. setDmAcActive was defined but not exported, so the DM picker's
hover state threw ReferenceError on every onmouseenter.

Pre-existing bug, independent of the WS refactor; surfaced during
dogfood testing of feat/ws-refactor. Audits neighboring DM-picker
inline handlers and adds any missing window.* exports in the same
pass so this doesn't recur.
This reverts commit 428b1a4648.
Bundles six cleanup items surfaced by the final ws-refactor branch
audit. Applied together because all are tiny edits and they travel
through the same dogfood matrix.

A.1 fix(ui): typing.start/typing.stop payloads now include channel_id

  The client was sending typing events without channel_id; handle_inbound
  in hero_collab_ui requires it and silently dropped each event. Typing
  indicators were fully non-functional on the branch. Three call sites
  (sendMessage post-send typing.stop, notifyTyping typing.start,
  notifyTyping typing.stop) fixed to send channel_id: state.currentChannel.id.
  Caught by static cross-check of handle_inbound's required-field list
  against the client payload shape.

A.2 refactor(ui): delete dead state.readCursors writes + field

  state.readCursors was written by advanceReadCursor and onReadUpdated
  but never read. Multi-tab read-cursor 'sync' was persistence-only
  with no client-side consumer. Deleting removes dead state and clarifies
  that the server-side cursor is the source of truth; if we later want
  to use the cached cursor for rendering or to short-circuit an
  updateUnreadCounts refresh, re-adding it is trivial.

A.3 refactor(ui): remove orphaned state._heartbeatInterval cleanup

  B2's deletion of the 60s client-side presence heartbeat left behind
  a dangling clearInterval call in destroy(). The setInterval that
  matched it was removed; the clear guards against undefined and is
  a no-op. One-line delete.

A.4 docs(ui): correct stale comment on pollNotifications cadence

  Comment referenced the pre-refactor channel-scoped WS model and the
  30s/120s poll cadence. Both are wrong now — WS is per-user, cadence
  is 300s. Rewritten to match reality.

A.5 refactor(server): delete dead RateLimiter::exempt_methods mechanism

  exempt_methods + with_exempt builder became dead after A3's typing
  bucket took over the only concrete use case. Dedicated typing.relay
  branch in check_method returns Ok(()) BEFORE the exempt check, so
  even if operators re-added methods to the set the code would never
  consult it. Removed field, builder, the unreachable exempt branch,
  and the two tests that only exercised them. typing.relay routing
  is unchanged.

A.6 docs(ui): correct stale rate-limit comment at user_ws handler

  Comment said typing.relay 'bypasses global rate-limit' — true
  pre-A3, not true post-A3 (it's rate-limited by the dedicated typing
  bucket, 60/min per caller). Fixed the wording.

No change to WS transport, fanout, event dispatch, session resume,
or presence semantics. All existing tests pass; typing_bucket_*
tests still cover rate-limit behaviour. Typing indicator regression
(A.1) verified restored by manual dogfood.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-this-branch, thread replies were invisible to other users in real
time (sendThreadReply did not emit to the channel broadcast — only the
sender's thread panel saw them). The ws-refactor's server-authoritative
fanout now correctly pushes MessageCreated for every persisted message
to all channel members — including replies with thread_id set.

Client-side onMessageCreated was written for the pre-refactor assumption
that only top-level messages arrive via WS, so it unconditionally pushed
data.message into state.messages and appended to the main DOM. Result:
thread replies appeared duplicated in both the thread panel AND the
parent channel's main feed until the user reloaded — at which point
message.list's WHERE thread_id IS NULL filter cleaned up.

Fix: early-branch in onMessageCreated when data.message.thread_id is
set. If the thread panel is open for that parent, append to it (with
dedup against state.threadReplies so the sender's optimistic push is
respected). Always bump state.threadCounts[parentId] so the reply-count
badge on the parent message reflects reality without a reload. Own-send
filter on the count bump: the sender's optimistic sendThreadReply
already accounts for their own reply.

No server-side change. No wire-format change. Dogfood verified.
sendMessage + sendThreadReply push the RPC response to state.messages /
state.threadReplies after await returns. In parallel the server fanout
dispatches MessageCreated over the WS, and onMessageCreated's dedup-
then-push path ALSO writes to the same array. The WS delivery rides an
already-open socket while the RPC needs a full round-trip — in local
dev the WS can win the race, so the WS handler pushes first (dedup
sees an empty array), then the optimistic push in the send function
runs unconditionally and duplicates.

User saw this in the thread panel as 'two replies, one with content
and one empty' on sender side (refresh corrected because thread.replies
RPC is the source of truth).

Fix: mirror the dedup check already present in onMessageCreated, but
now on the optimistic-push side too. Scroll-to-bottom + advanceReadCursor
stay unconditional — they're idempotent and reflect the 'I just sent'
user intent regardless of which side did the push.

Applies to both sendMessage and sendThreadReply since the same race
exists for both — main-feed doubling hadn't surfaced in testing but
would with the same conditions.
Centralize the insert-if-new pattern for state.messages and state.threadReplies.
No call sites wired yet — following commits migrate sendMessage, sendThreadReply,
onMessageCreated, and onWsReconnected catch-up onto these helpers.
sendMessage and sendThreadReply both await the RPC response then render.
Previously each inlined a "has this id already?" guard to dedup against
the server fanout arriving first; now they delegate to upsertMainMessage /
upsertThreadReply. Same semantics, one gate.
onMessageCreated (both thread and main branches) and onWsReconnected's
catch-up delta-fetch now use upsertMainMessage / upsertThreadReply.
All five upsert sites identified in the plan now go through the same gate.
Code quality review of 3330dbc flagged the relocated dedup in the
off-screen unread branch as a tautological no-op: state.messages only
holds the currently-viewed channel's messages, so checking for an
off-screen message's id there is always false. The guard has never
fired. Remove it along with the misleading comment. The viewing branch
is still protected by upsertMainMessage's own dedup — that's where
double-render actually needs preventing.
Companion to commits d4320f5..8f54197. Documents the five call sites
consolidated behind upsertMainMessage / upsertThreadReply, the multi-tab
reasoning that rules out server-side sender exclusion, and the
self-review pass that caught the inert off-screen guard.
sameh-farouk changed title from WIP: Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polish to Refactor realtime transport: single WS per user + server-authoritative fanout + reliability polish 2026-04-24 00:44:18 +00:00
Pulls in 3eab733 (fix(cli): --stop now unregisters from hero_proc) so
our branch is current with the base before review lands. No textual
conflict — 3eab733 modifies self_stop() while our WS-refactor changes
touch build_service_definition() and add a test module in the same file.
sameh-farouk merged commit b654410e84 into development 2026-04-24 00:54:59 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_collab!19
No description provided.