Page-load RPC burst self-throttles in proxy mode (2N unread fan-out + flat 60/min budget) #24

New issue

Closed

opened 2026-04-26 17:19:53 +00:00 by sameh-farouk · 0 comments

sameh-farouk commented

2026-04-26 17:19:53 +00:00

Member

Two coupled performance + ergonomics issues in the page-load RPC burst

Running collab in proxy mode revealed a self-throttling pattern that's invisible in dev mode (because dev mode's unauthenticated calls bypass rate limiting entirely). Two issues, independently real, that compound:

Half 1 — `updateUnreadCounts` fires `2N` RPCs per page load (B1)

chat-app.js::updateUnreadCounts:

await Promise.all(state.channels.map(async (ch) => {
    const cursor = await rpc('read.get', { channel_id: ch.id, ... });
    if (cursor && cursor.last_read_msg) {
        const msgs = await rpc('message.list', { channel_id: ch.id, limit: 100 });
        const unread = (msgs || []).filter(m => m.id > cursor.last_read_msg).length;
        if (unread > 0) counts[ch.id] = unread;
    } else {
        const msgs = await rpc('message.list', { channel_id: ch.id, limit: 100 });
        if (msgs && msgs.length > 0) counts[ch.id] = msgs.length;
    }
}));

For a workspace with N channels, this fires:

N parallel read.get calls
N parallel message.list calls (one per channel, even if cursor exists)

So 2N RPCs every time updateUnreadCounts runs — which is on page load, on workspace switch, and on every reconnect catch-up. Plus message.list returns up to 100 messages per channel that are then filtered client-side, when all the server actually needs to return is a single integer per channel.

Server-side cost: 2N round-trips, each acquiring state.db.lock() (parking_lot Mutex), serializing on the database. For a user with 30 channels, that's 60 sequential lock acquisitions in a ~2s window.

Half 2 — Default rate limits too tight for normal interactive use (C2)

main.rs:389: rate_limiter: rate_limit::RateLimiter::new(60, 10) — 60 RPC/min total per caller, with message.send getting an extra inner bucket of 10/min.

Combined with Half 1: 30-channel page load fires ~60 RPCs from updateUnreadCounts alone. Plus user.me, workspace.list, channel.list, mention.list, huddle.list, read.mark on selectChannel, channel.member.list, etc. — easily 70+ RPCs in a 2-second burst. Bucket exhausted. Subsequent calls return Rate limited — retry in 167ms.

This was invisible in dev mode (no caller_id → rate limiting skipped) — flipping to proxy mode revealed a previously-dormant constraint that fights normal interactive use.

For comparison, Slack's API tiers: interactive reads (conversations.history, channels.list) are Tier 4 = 100+/min PER METHOD. Collab's 60/min was total-across-everything, ~10× tighter than industry norms.

Why coupled

If we only fix Half 1, the next read-heavy feature with a per-channel pattern hits the same wall again. If we only fix Half 2, the wasteful 2N pattern stays masked by a higher budget — but server still does 2N round-trips per page load, multiplied by every active user.

Both are needed: B1 eliminates the fan-out, C2 makes the remaining traffic correctly tiered.

Verification

# Pre-fix, in browser network tab on workspace with 30 channels:
#   60+ RPCs in ~2s, including parallel read.get + message.list per channel
#   Console: "Rate limited — retry in 167 ms" warnings on subsequent navigation

# Post-fix:
#   1 read.unread_counts call instead of 60 (read.get + message.list)
#   No rate-limit retry warnings even after 5 hard-refreshes in a row

Fix shape

B1 — Batched `read.unread_counts(workspace_id, user_id)` endpoint

New handler in handlers/read.rs. Single SQL query:

SELECT cm.channel_id, COUNT(m.id) AS unread
  FROM channel_members cm
  JOIN channels c ON c.id = cm.channel_id
  LEFT JOIN read_cursors rc
    ON rc.channel_id = cm.channel_id AND rc.user_id = cm.user_id
  LEFT JOIN messages m
    ON m.channel_id = cm.channel_id
   AND m.id > COALESCE(CAST(json_extract(rc.data, '$.last_read_msg') AS INTEGER), 0)
   AND json_extract(m.data, '$.deleted_at') IS NULL
   AND (m.thread_id IS NULL OR m.thread_id = 0)
 WHERE cm.user_id = ?1
   AND c.workspace_id = ?2
   AND json_extract(c.data, '$.archived_at') IS NULL
 GROUP BY cm.channel_id

Returns {channel_id: count, ...}. Hits existing idx_messages_channel(channel_id, created_at) and the read_cursors PK. Filters via json_extract (mirrors existing message::list and channel::list_workspace patterns since deleted_at / archived_at live inside the data JSON, not as columns).

Client: replace the Promise.all(N) loop with one await rpc('read.unread_counts', {workspace_id, user_id}). Coerce string keys to numbers so the badge-render lookup keeps working. Failure path preserves last good badges (degraded display > blank).

C2 — Tiered rate limits per method class

New MethodClass enum in rate_limit.rs: Read / Write / MessageSend / Typing. New classify_method(method) function — heuristic on trailing verb after the last . (e.g. *.list / *.get → Read; default → Write — conservative under-classification), plus two specific overrides (typing.relay, message.send).

PerCallerLimits gains a read bucket. RateLimiter gains a read_per_min field and a new_tiered(read, global, send, typing) constructor. Backward-compat new(global, send) preserved (equates read = global) so existing 11 unit tests pass unchanged.

check_method dispatches via classify_method to charge the right bucket(s):

Read → charges read only (own bucket; doesn't touch global)
Typing → charges typing only (existing behavior)
Write → charges global (existing behavior for non-message methods)
MessageSend → charges global + send atomically (existing behavior)

Production budget: RateLimiter::new_tiered(600, 60, 60, 60). Reads 10× the old budget; write/send/typing unchanged.

Behaviour after fix

Scenario	Today	After fix
Page load with 30 channels	60 RPCs from updateUnreadCounts alone (2N)	1 RPC for unread_counts (O(1))
5 hard-refreshes in a minute, proxy mode	bucket exhausted, "Rate limited" warnings flood console	clean, all RPCs succeed
Read-heavy navigation (channel switching)	charges global bucket → competes with writes	charges read bucket → independent of writes
Burst of 100 message.send	charges global+send	unchanged — both buckets enforce their limits
Typing while heavy reads in flight	could deplete global bucket	typing has own bucket, unaffected
Dev mode (unauthenticated)	rate limiting bypassed	unchanged — bypass only applies when no caller_id

PR

Fix in feat/batched-unread-tiered-rate-limits. ~230 LOC added across:

handlers/read.rs: +52 (new unread_counts handler with single-query SQL)
rpc.rs: +1 (dispatch wiring)
rate_limit.rs: +159 (MethodClass, classify_method, read bucket, new_tiered constructor, charge_read/charge_typing helpers)
main.rs: +13 (switch from new(60, 10) to new_tiered(600, 60, 60, 60))
chat-app.js: +20/-30 (single batched call replaces fan-out loop)

No schema change. All 11 existing rate_limit unit tests still pass.

Follow-ups (separate, low-priority)

Add replies and ping to the Read verb whitelist in classify_method — minor over-throttle today (e.g. thread.replies and system.ping are reads but classified as Write under the conservative default).
WebSocket-pushed unread updates (UserEvent::UnreadUpdate emitted on message.create / read.mark) would eliminate polling entirely. Scope much bigger; this PR is the right minimal first step.

## Two coupled performance + ergonomics issues in the page-load RPC burst Running collab in proxy mode revealed a self-throttling pattern that's invisible in dev mode (because dev mode's unauthenticated calls bypass rate limiting entirely). Two issues, independently real, that compound: ## Half 1 — `updateUnreadCounts` fires `2N` RPCs per page load (B1) `chat-app.js::updateUnreadCounts`: ```js await Promise.all(state.channels.map(async (ch) => { const cursor = await rpc('read.get', { channel_id: ch.id, ... }); if (cursor && cursor.last_read_msg) { const msgs = await rpc('message.list', { channel_id: ch.id, limit: 100 }); const unread = (msgs || []).filter(m => m.id > cursor.last_read_msg).length; if (unread > 0) counts[ch.id] = unread; } else { const msgs = await rpc('message.list', { channel_id: ch.id, limit: 100 }); if (msgs && msgs.length > 0) counts[ch.id] = msgs.length; } })); ``` For a workspace with N channels, this fires: - N parallel `read.get` calls - N parallel `message.list` calls (one per channel, even if cursor exists) So **2N RPCs every time `updateUnreadCounts` runs** — which is on page load, on workspace switch, and on every reconnect catch-up. Plus message.list returns up to 100 messages per channel that are then filtered client-side, when all the server actually needs to return is a single integer per channel. Server-side cost: 2N round-trips, each acquiring `state.db.lock()` (parking_lot Mutex), serializing on the database. For a user with 30 channels, that's 60 sequential lock acquisitions in a ~2s window. ## Half 2 — Default rate limits too tight for normal interactive use (C2) `main.rs:389`: `rate_limiter: rate_limit::RateLimiter::new(60, 10)` — **60 RPC/min total per caller**, with `message.send` getting an extra inner bucket of 10/min. Combined with Half 1: 30-channel page load fires ~60 RPCs from `updateUnreadCounts` alone. Plus `user.me`, `workspace.list`, `channel.list`, `mention.list`, `huddle.list`, `read.mark` on selectChannel, `channel.member.list`, etc. — easily 70+ RPCs in a 2-second burst. Bucket exhausted. Subsequent calls return `Rate limited — retry in 167ms`. This was invisible in dev mode (no `caller_id` → rate limiting skipped) — flipping to proxy mode revealed a previously-dormant constraint that fights normal interactive use. For comparison, Slack's API tiers: interactive reads (`conversations.history`, `channels.list`) are Tier 4 = 100+/min PER METHOD. Collab's 60/min was total-across-everything, ~10× tighter than industry norms. ## Why coupled If we only fix Half 1, the next read-heavy feature with a per-channel pattern hits the same wall again. If we only fix Half 2, the wasteful `2N` pattern stays masked by a higher budget — but server still does 2N round-trips per page load, multiplied by every active user. Both are needed: B1 eliminates the fan-out, C2 makes the remaining traffic correctly tiered. ## Verification ```bash # Pre-fix, in browser network tab on workspace with 30 channels: # 60+ RPCs in ~2s, including parallel read.get + message.list per channel # Console: "Rate limited — retry in 167 ms" warnings on subsequent navigation # Post-fix: # 1 read.unread_counts call instead of 60 (read.get + message.list) # No rate-limit retry warnings even after 5 hard-refreshes in a row ``` ## Fix shape ### B1 — Batched `read.unread_counts(workspace_id, user_id)` endpoint New handler in `handlers/read.rs`. Single SQL query: ```sql SELECT cm.channel_id, COUNT(m.id) AS unread FROM channel_members cm JOIN channels c ON c.id = cm.channel_id LEFT JOIN read_cursors rc ON rc.channel_id = cm.channel_id AND rc.user_id = cm.user_id LEFT JOIN messages m ON m.channel_id = cm.channel_id AND m.id > COALESCE(CAST(json_extract(rc.data, '$.last_read_msg') AS INTEGER), 0) AND json_extract(m.data, '$.deleted_at') IS NULL AND (m.thread_id IS NULL OR m.thread_id = 0) WHERE cm.user_id = ?1 AND c.workspace_id = ?2 AND json_extract(c.data, '$.archived_at') IS NULL GROUP BY cm.channel_id ``` Returns `{channel_id: count, ...}`. Hits existing `idx_messages_channel(channel_id, created_at)` and the `read_cursors` PK. Filters via `json_extract` (mirrors existing `message::list` and `channel::list_workspace` patterns since `deleted_at` / `archived_at` live inside the `data` JSON, not as columns). Client: replace the `Promise.all(N)` loop with one `await rpc('read.unread_counts', {workspace_id, user_id})`. Coerce string keys to numbers so the badge-render lookup keeps working. Failure path preserves last good badges (degraded display > blank). ### C2 — Tiered rate limits per method class New `MethodClass` enum in `rate_limit.rs`: `Read` / `Write` / `MessageSend` / `Typing`. New `classify_method(method)` function — heuristic on trailing verb after the last `.` (e.g. `*.list` / `*.get` → Read; default → Write — conservative under-classification), plus two specific overrides (`typing.relay`, `message.send`). `PerCallerLimits` gains a `read` bucket. `RateLimiter` gains a `read_per_min` field and a `new_tiered(read, global, send, typing)` constructor. Backward-compat `new(global, send)` preserved (equates `read = global`) so existing 11 unit tests pass unchanged. `check_method` dispatches via `classify_method` to charge the right bucket(s): - `Read` → charges `read` only (own bucket; doesn't touch global) - `Typing` → charges `typing` only (existing behavior) - `Write` → charges `global` (existing behavior for non-message methods) - `MessageSend` → charges `global + send` atomically (existing behavior) Production budget: `RateLimiter::new_tiered(600, 60, 60, 60)`. Reads 10× the old budget; write/send/typing unchanged. ## Behaviour after fix | Scenario | Today | After fix | |---|---|---| | Page load with 30 channels | 60 RPCs from updateUnreadCounts alone (2N) | 1 RPC for unread_counts (O(1)) | | 5 hard-refreshes in a minute, proxy mode | bucket exhausted, "Rate limited" warnings flood console | clean, all RPCs succeed | | Read-heavy navigation (channel switching) | charges global bucket → competes with writes | charges read bucket → independent of writes | | Burst of 100 message.send | charges global+send | unchanged — both buckets enforce their limits | | Typing while heavy reads in flight | could deplete global bucket | typing has own bucket, unaffected | | Dev mode (unauthenticated) | rate limiting bypassed | unchanged — bypass only applies when no caller_id | ## PR Fix in `feat/batched-unread-tiered-rate-limits`. ~230 LOC added across: - `handlers/read.rs`: +52 (new `unread_counts` handler with single-query SQL) - `rpc.rs`: +1 (dispatch wiring) - `rate_limit.rs`: +159 (MethodClass, classify_method, read bucket, new_tiered constructor, charge_read/charge_typing helpers) - `main.rs`: +13 (switch from `new(60, 10)` to `new_tiered(600, 60, 60, 60)`) - `chat-app.js`: +20/-30 (single batched call replaces fan-out loop) No schema change. All 11 existing rate_limit unit tests still pass. ## Follow-ups (separate, low-priority) - Add `replies` and `ping` to the Read verb whitelist in `classify_method` — minor over-throttle today (e.g. `thread.replies` and `system.ping` are reads but classified as Write under the conservative default). - WebSocket-pushed unread updates (`UserEvent::UnreadUpdate` emitted on message.create / read.mark) would eliminate polling entirely. Scope much bigger; this PR is the right minimal first step.