Page-load RPC burst self-throttles in proxy mode (2N unread fan-out + flat 60/min budget) #24
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_collab#24
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Two coupled performance + ergonomics issues in the page-load RPC burst
Running collab in proxy mode revealed a self-throttling pattern that's invisible in dev mode (because dev mode's unauthenticated calls bypass rate limiting entirely). Two issues, independently real, that compound:
Half 1 —
updateUnreadCountsfires2NRPCs per page load (B1)chat-app.js::updateUnreadCounts:For a workspace with N channels, this fires:
read.getcallsmessage.listcalls (one per channel, even if cursor exists)So 2N RPCs every time
updateUnreadCountsruns — which is on page load, on workspace switch, and on every reconnect catch-up. Plus message.list returns up to 100 messages per channel that are then filtered client-side, when all the server actually needs to return is a single integer per channel.Server-side cost: 2N round-trips, each acquiring
state.db.lock()(parking_lot Mutex), serializing on the database. For a user with 30 channels, that's 60 sequential lock acquisitions in a ~2s window.Half 2 — Default rate limits too tight for normal interactive use (C2)
main.rs:389:rate_limiter: rate_limit::RateLimiter::new(60, 10)— 60 RPC/min total per caller, withmessage.sendgetting an extra inner bucket of 10/min.Combined with Half 1: 30-channel page load fires ~60 RPCs from
updateUnreadCountsalone. Plususer.me,workspace.list,channel.list,mention.list,huddle.list,read.markon selectChannel,channel.member.list, etc. — easily 70+ RPCs in a 2-second burst. Bucket exhausted. Subsequent calls returnRate limited — retry in 167ms.This was invisible in dev mode (no
caller_id→ rate limiting skipped) — flipping to proxy mode revealed a previously-dormant constraint that fights normal interactive use.For comparison, Slack's API tiers: interactive reads (
conversations.history,channels.list) are Tier 4 = 100+/min PER METHOD. Collab's 60/min was total-across-everything, ~10× tighter than industry norms.Why coupled
If we only fix Half 1, the next read-heavy feature with a per-channel pattern hits the same wall again. If we only fix Half 2, the wasteful
2Npattern stays masked by a higher budget — but server still does 2N round-trips per page load, multiplied by every active user.Both are needed: B1 eliminates the fan-out, C2 makes the remaining traffic correctly tiered.
Verification
Fix shape
B1 — Batched
read.unread_counts(workspace_id, user_id)endpointNew handler in
handlers/read.rs. Single SQL query:Returns
{channel_id: count, ...}. Hits existingidx_messages_channel(channel_id, created_at)and theread_cursorsPK. Filters viajson_extract(mirrors existingmessage::listandchannel::list_workspacepatterns sincedeleted_at/archived_atlive inside thedataJSON, not as columns).Client: replace the
Promise.all(N)loop with oneawait rpc('read.unread_counts', {workspace_id, user_id}). Coerce string keys to numbers so the badge-render lookup keeps working. Failure path preserves last good badges (degraded display > blank).C2 — Tiered rate limits per method class
New
MethodClassenum inrate_limit.rs:Read/Write/MessageSend/Typing. Newclassify_method(method)function — heuristic on trailing verb after the last.(e.g.*.list/*.get→ Read; default → Write — conservative under-classification), plus two specific overrides (typing.relay,message.send).PerCallerLimitsgains areadbucket.RateLimitergains aread_per_minfield and anew_tiered(read, global, send, typing)constructor. Backward-compatnew(global, send)preserved (equatesread = global) so existing 11 unit tests pass unchanged.check_methoddispatches viaclassify_methodto charge the right bucket(s):Read→ chargesreadonly (own bucket; doesn't touch global)Typing→ chargestypingonly (existing behavior)Write→ chargesglobal(existing behavior for non-message methods)MessageSend→ chargesglobal + sendatomically (existing behavior)Production budget:
RateLimiter::new_tiered(600, 60, 60, 60). Reads 10× the old budget; write/send/typing unchanged.Behaviour after fix
PR
Fix in
feat/batched-unread-tiered-rate-limits. ~230 LOC added across:handlers/read.rs: +52 (newunread_countshandler with single-query SQL)rpc.rs: +1 (dispatch wiring)rate_limit.rs: +159 (MethodClass, classify_method, read bucket, new_tiered constructor, charge_read/charge_typing helpers)main.rs: +13 (switch fromnew(60, 10)tonew_tiered(600, 60, 60, 60))chat-app.js: +20/-30 (single batched call replaces fan-out loop)No schema change. All 11 existing rate_limit unit tests still pass.
Follow-ups (separate, low-priority)
repliesandpingto the Read verb whitelist inclassify_method— minor over-throttle today (e.g.thread.repliesandsystem.pingare reads but classified as Write under the conservative default).UserEvent::UnreadUpdateemitted on message.create / read.mark) would eliminate polling entirely. Scope much bigger; this PR is the right minimal first step.