Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#124
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The defect
crates/hero_proc_server/src/db/factory.rsexposes the DB viaconn: Arc<std::sync::Mutex<rusqlite::Connection>>. 57 call sites across that file dolet conn = self.conn.lock().unwrap();then a synchronous SQLite operation — all from inside async RPC handlers. None of the hot-path callers wrap intokio::task::spawn_blocking.This is the canonical Tokio anti-pattern. From tokio's own docs:
Under load, every async worker thread that hits a DB op blocks the OS thread holding it. Tokio can't reschedule unrelated tasks onto that thread until the lock releases. Once all N workers are blocked, the entire runtime stalls — RPC accept, timer ticks, supervisor poll, everything.
Empirical evidence this is real, not theoretical
service.statuscapped concurrent parallelism at ~9× under load. The fix landed ase833dc9 feat(rpc): add service.status_all bulk RPC— a workaround that bundles N status calls into one mutex acquisition. The underlying serialization point was never removed.hero_proc_serverwedges infutex_do_waitafter several hours uptime, all RPCs hang. Worker-pool saturation from blocking-in-async is exactly the kind of symptom this pattern can produce. Without the next stack trace I can't confirm causation, but it's one of the leading candidates.What "right" looks like — real-world consensus
Production Rust services that need SQLite under async universally use one of:
deadpool-sqlite(connection pool, fully async API)tokio-rusqlite(single-thread executor wrapper)sqlxwith SQLite backendr2d2-sqlite(sync pool) + every call wrapped inspawn_blockingHero_proc's current shape —
Arc<std::sync::Mutex<Connection>>accessed directly from async tasks — appears in tutorials labeled "what NOT to do." It works in low-traffic dev environments and degrades sharply at production load. That degradation is exactly what #121 measured.Proposed fix — Option C: connection pool
Migrate to
deadpool-sqlite(orr2d2-sqlite+spawn_blockingif a sync-pool style is preferred). The shape per call site:Mechanical change × 57 call sites, plus:
factory.rs::HeroProcDb::newsecrets/SPECS.mdto drop the old "spawn_blocking for long ops" prescription — replaced by the pool guaranteeReads can fan out across multiple connections (WAL allows N concurrent readers + 1 writer). Writes serialize behind a single writer. The #121 latency cap goes away; the #122 worker-starvation candidate goes away.
Alternative — Option B (cheaper but partial)
Wrap each of the 57 call sites in
spawn_blocking. Removes worker-starvation but keeps the single-mutex serialization point — #121's cap stays. Use only if pool migration is judged too invasive.Scope
factory.rs— pool init + 57 call site signaturesrpc/*.rs,supervisor/*.rs, etc.) — propagate theawaitandpool.get()Cargo.toml— adddeadpool-sqlitedependencysecrets/SPECS.md— drop the spawn_blocking convention note (no longer applies)Estimated: ~1-2 days for one developer including verifying
hero_proc_testbaseline stays green.RFC: db/factory.rs Arc<std::sync::Mutex<Connection>> + 57 hot-path lock-in-async sites — keep as-is, wrap, or move to pool?to db/factory.rs Arc<std::sync::Mutex<Connection>> + 57 hot-path lock-in-async sites is a real defect, not a trade-off (see #121 evidence; #122 candidate)Migration landed in two squash commits on
origin/development:a58bdf0— main migration (Arc<std::sync::Mutex<Connection>>→deadpool_sqlite::Pool, sub-API sync→async across ~410 caller sites, conditional UPDATE primitive for cancel-race fix, complete_wipe row-count fix, env::set_var soundness fix, Result-propagation in running_pids family). 4 rounds of external architect review (final: APPROVE).ef6d564— follow-ups closing #128 + #129 (adoptdb.transaction()in 3 multi-step handlers + adddelete_manybatch primitives + restorespawn_blockingaround git CLI in secret RPCs). 1 round of review (APPROVE) + cleanup commit for 2 real findings.Lib regression: 232/0 baseline → 242/0. Integration
--basicbaseline preserved (62 PASS / 7 FAIL identical, all pre-existing env-only). Stage 6 on a running release daemon: WAL+synchronous=NORMAL pragmas applied via post_create hook, 50× parallelservice.status_allruns in ~120–300ms (vs the ~9× serial cap measured in #121),wipe_allreturns truthful row counts, daemon log clean.Closes #128, #129 (separate filings tracking specific atomicity + batch-delete follow-ups already closed by
ef6d564).Remaining follow-up: #130 (dead
db/rpc.rsdispatcher) — deferred as separate hygiene concern, not made worse by this migration.Empirical confirmation of fix for #121 / #122 hypothesis #1 needs the next several-hours-uptime soak test under load. Tracking in #122 separately.
Post-merge perf verification on kristof5 (2026-05-25) — moving this here from #122 comment 36896 where it was off-topic for that issue (a wedge-debug thread).
Apples-to-apples benchmark: pre-#124 binary (
b29a3dc) vs post-#124 binary (3dafd6c), same DB snapshot, 100 supervisedtop -b -d 5services.Phase 1 / 2 — pure-read fan-out (no concurrent writes)
service.status_allservice.statusservice.statusOn this shape the pool's per-call overhead (~50-200 μs for acquire +
interactspawn-blocking handoff) exceeds the mutex's contention cost, because per-call DB work is sub-ms and there's no concurrent write pressure. #121's 9× cap did NOT reproduce in this workload on either binary (both achieved ~14× speedup), suggesting the cap lives in the next shape, not in pure reads.Phase 3 — mixed read-write (the workload that matters)
run.submit(write-heavy, multi-step txn)system.wipe_all+ 100× concurrentservice.statusreadswipe_allreportedrows_deletedSELECT changes()post-execute_batch)The
wipe_all+ concurrent-reads case is the smoking gun for where #121's cap actually lived. Pre-#124, every reader blocks for the FULL ~3.6 s duration of the wipe transaction becausestd::sync::Mutex<Connection>serializes all access. Post-#124, SQLite WAL lets readers proceed against the pre-wipe snapshot while the writer holds its own pool connection.The supervisor's normal operation has continuous background writes (executor state transitions, scheduler ticks, stats persist) — so the mutex serialization tax was being paid on every cockpit fan-out in production, even when individual writes were tiny.
Net verdict
transaction()helper consolidating 4 sub-API mutex acquisitions into one connection acquire + interact + commit)wipe_allitself + truthful row countPlus the 6 correctness fixes shipped alongside (truthful
complete_wipecount,env::set_varsoundness, cancel-vs-exit-status TOCTOU, multi-step handler atomicity, batchdelete_many,running_pidsResult propagation).Methodology lesson noted for future benchmarks: pure-read or pure-write shapes miss the dominant production load pattern. Always include mixed read-write under realistic concurrent write pressure. A "9× cap" almost always means reads waiting behind a write lock, not reads waiting behind reads.
hero_proc.dbhas no retention or VACUUM scheduling — grows unboundedly on long-lived daemons #131