db/factory.rs Arc<std::sync::Mutex<Connection>> + 57 hot-path lock-in-async sites is a real defect, not a trade-off (see #121 evidence; #122 candidate) #124

Closed
opened 2026-05-24 10:14:10 +00:00 by sameh-farouk · 2 comments
Member

The defect

crates/hero_proc_server/src/db/factory.rs exposes the DB via conn: Arc<std::sync::Mutex<rusqlite::Connection>>. 57 call sites across that file do let conn = self.conn.lock().unwrap(); then a synchronous SQLite operation — all from inside async RPC handlers. None of the hot-path callers wrap in tokio::task::spawn_blocking.

This is the canonical Tokio anti-pattern. From tokio's own docs:

The Tokio runtime relies on the futures being polled cooperatively. If a future does not yield control regularly, [...] it may block the entire runtime.

Under load, every async worker thread that hits a DB op blocks the OS thread holding it. Tokio can't reschedule unrelated tasks onto that thread until the lock releases. Once all N workers are blocked, the entire runtime stalls — RPC accept, timer ticks, supervisor poll, everything.

Empirical evidence this is real, not theoretical

  • #121 (closed) — service.status capped concurrent parallelism at ~9× under load. The fix landed as e833dc9 feat(rpc): add service.status_all bulk RPC — a workaround that bundles N status calls into one mutex acquisition. The underlying serialization point was never removed.
  • #122 (open) — hero_proc_server wedges in futex_do_wait after several hours uptime, all RPCs hang. Worker-pool saturation from blocking-in-async is exactly the kind of symptom this pattern can produce. Without the next stack trace I can't confirm causation, but it's one of the leading candidates.

What "right" looks like — real-world consensus

Production Rust services that need SQLite under async universally use one of:

Pattern Used by
deadpool-sqlite (connection pool, fully async API) Common in axum / actix services
tokio-rusqlite (single-thread executor wrapper) Smaller services
sqlx with SQLite backend Schema-first services
r2d2-sqlite (sync pool) + every call wrapped in spawn_blocking Older but valid

Hero_proc's current shape — Arc<std::sync::Mutex<Connection>> accessed directly from async tasks — appears in tutorials labeled "what NOT to do." It works in low-traffic dev environments and degrades sharply at production load. That degradation is exactly what #121 measured.

Proposed fix — Option C: connection pool

Migrate to deadpool-sqlite (or r2d2-sqlite + spawn_blocking if a sync-pool style is preferred). The shape per call site:

// before
let conn = self.conn.lock().unwrap();
some_sqlite_fn(&conn, ...)

// after (deadpool-sqlite style)
let conn = self.pool.get().await?;
conn.interact(|conn| some_sqlite_fn(conn, ...)).await??

Mechanical change × 57 call sites, plus:

  • Pool initialization in factory.rs::HeroProcDb::new
  • SQLite WAL mode (already needed for concurrent reads)
  • Update secrets/SPECS.md to drop the old "spawn_blocking for long ops" prescription — replaced by the pool guarantee

Reads can fan out across multiple connections (WAL allows N concurrent readers + 1 writer). Writes serialize behind a single writer. The #121 latency cap goes away; the #122 worker-starvation candidate goes away.

Alternative — Option B (cheaper but partial)

Wrap each of the 57 call sites in spawn_blocking. Removes worker-starvation but keeps the single-mutex serialization point — #121's cap stays. Use only if pool migration is judged too invasive.

Scope

  • factory.rs — pool init + 57 call site signatures
  • All callers of the touched fns (rpc/*.rs, supervisor/*.rs, etc.) — propagate the await and pool.get()
  • Cargo.toml — add deadpool-sqlite dependency
  • secrets/SPECS.md — drop the spawn_blocking convention note (no longer applies)
  • Tests — should mostly pass unchanged once async-await propagation lands

Estimated: ~1-2 days for one developer including verifying hero_proc_test baseline stays green.

## The defect `crates/hero_proc_server/src/db/factory.rs` exposes the DB via `conn: Arc<std::sync::Mutex<rusqlite::Connection>>`. **57 call sites** across that file do `let conn = self.conn.lock().unwrap();` then a synchronous SQLite operation — all from inside async RPC handlers. None of the hot-path callers wrap in `tokio::task::spawn_blocking`. This is the canonical Tokio anti-pattern. From [tokio's own docs](https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use): > The Tokio runtime relies on the futures being polled cooperatively. If a future does not yield control regularly, [...] it may block the entire runtime. Under load, every async worker thread that hits a DB op blocks the OS thread holding it. Tokio can't reschedule unrelated tasks onto that thread until the lock releases. Once all N workers are blocked, the entire runtime stalls — RPC accept, timer ticks, supervisor poll, everything. ## Empirical evidence this is real, not theoretical - **#121** (closed) — `service.status` capped concurrent parallelism at **~9×** under load. The fix landed as `e833dc9 feat(rpc): add service.status_all bulk RPC` — a *workaround* that bundles N status calls into one mutex acquisition. The underlying serialization point was never removed. - **#122** (open) — `hero_proc_server` wedges in `futex_do_wait` after several hours uptime, all RPCs hang. Worker-pool saturation from blocking-in-async is exactly the kind of symptom this pattern can produce. Without the next stack trace I can't confirm causation, but it's one of the leading candidates. ## What "right" looks like — real-world consensus Production Rust services that need SQLite under async universally use one of: | Pattern | Used by | |---|---| | `deadpool-sqlite` (connection pool, fully async API) | Common in axum / actix services | | `tokio-rusqlite` (single-thread executor wrapper) | Smaller services | | `sqlx` with SQLite backend | Schema-first services | | `r2d2-sqlite` (sync pool) + every call wrapped in `spawn_blocking` | Older but valid | Hero_proc's current shape — `Arc<std::sync::Mutex<Connection>>` accessed directly from async tasks — appears in tutorials labeled "what NOT to do." It works in low-traffic dev environments and degrades sharply at production load. That degradation is exactly what #121 measured. ## Proposed fix — Option C: connection pool Migrate to `deadpool-sqlite` (or `r2d2-sqlite` + `spawn_blocking` if a sync-pool style is preferred). The shape per call site: ```rust // before let conn = self.conn.lock().unwrap(); some_sqlite_fn(&conn, ...) // after (deadpool-sqlite style) let conn = self.pool.get().await?; conn.interact(|conn| some_sqlite_fn(conn, ...)).await?? ``` Mechanical change × 57 call sites, plus: - Pool initialization in `factory.rs::HeroProcDb::new` - SQLite WAL mode (already needed for concurrent reads) - Update `secrets/SPECS.md` to drop the old "spawn_blocking for long ops" prescription — replaced by the pool guarantee Reads can fan out across multiple connections (WAL allows N concurrent readers + 1 writer). Writes serialize behind a single writer. The #121 latency cap goes away; the #122 worker-starvation candidate goes away. ## Alternative — Option B (cheaper but partial) Wrap each of the 57 call sites in `spawn_blocking`. Removes worker-starvation but keeps the single-mutex serialization point — #121's cap stays. Use only if pool migration is judged too invasive. ## Scope - `factory.rs` — pool init + 57 call site signatures - All callers of the touched fns (`rpc/*.rs`, `supervisor/*.rs`, etc.) — propagate the `await` and `pool.get()` - `Cargo.toml` — add `deadpool-sqlite` dependency - `secrets/SPECS.md` — drop the spawn_blocking convention note (no longer applies) - Tests — should mostly pass unchanged once async-await propagation lands Estimated: ~1-2 days for one developer including verifying `hero_proc_test` baseline stays green.
sameh-farouk changed title from RFC: db/factory.rs Arc<std::sync::Mutex<Connection>> + 57 hot-path lock-in-async sites — keep as-is, wrap, or move to pool? to db/factory.rs Arc<std::sync::Mutex<Connection>> + 57 hot-path lock-in-async sites is a real defect, not a trade-off (see #121 evidence; #122 candidate) 2026-05-24 10:15:52 +00:00
Author
Member

Migration landed in two squash commits on origin/development:

  • a58bdf0 — main migration (Arc<std::sync::Mutex<Connection>>deadpool_sqlite::Pool, sub-API sync→async across ~410 caller sites, conditional UPDATE primitive for cancel-race fix, complete_wipe row-count fix, env::set_var soundness fix, Result-propagation in running_pids family). 4 rounds of external architect review (final: APPROVE).
  • ef6d564 — follow-ups closing #128 + #129 (adopt db.transaction() in 3 multi-step handlers + add delete_many batch primitives + restore spawn_blocking around git CLI in secret RPCs). 1 round of review (APPROVE) + cleanup commit for 2 real findings.

Lib regression: 232/0 baseline → 242/0. Integration --basic baseline preserved (62 PASS / 7 FAIL identical, all pre-existing env-only). Stage 6 on a running release daemon: WAL+synchronous=NORMAL pragmas applied via post_create hook, 50× parallel service.status_all runs in ~120–300ms (vs the ~9× serial cap measured in #121), wipe_all returns truthful row counts, daemon log clean.

Closes #128, #129 (separate filings tracking specific atomicity + batch-delete follow-ups already closed by ef6d564).

Remaining follow-up: #130 (dead db/rpc.rs dispatcher) — deferred as separate hygiene concern, not made worse by this migration.

Empirical confirmation of fix for #121 / #122 hypothesis #1 needs the next several-hours-uptime soak test under load. Tracking in #122 separately.

Migration landed in two squash commits on `origin/development`: - **`a58bdf0`** — main migration (`Arc<std::sync::Mutex<Connection>>` → `deadpool_sqlite::Pool`, sub-API sync→async across ~410 caller sites, conditional UPDATE primitive for cancel-race fix, complete_wipe row-count fix, env::set_var soundness fix, Result-propagation in running_pids family). 4 rounds of external architect review (final: APPROVE). - **`ef6d564`** — follow-ups closing #128 + #129 (adopt `db.transaction()` in 3 multi-step handlers + add `delete_many` batch primitives + restore `spawn_blocking` around git CLI in secret RPCs). 1 round of review (APPROVE) + cleanup commit for 2 real findings. Lib regression: 232/0 baseline → 242/0. Integration `--basic` baseline preserved (62 PASS / 7 FAIL identical, all pre-existing env-only). Stage 6 on a running release daemon: WAL+synchronous=NORMAL pragmas applied via post_create hook, 50× parallel `service.status_all` runs in ~120–300ms (vs the ~9× serial cap measured in #121), `wipe_all` returns truthful row counts, daemon log clean. Closes #128, #129 (separate filings tracking specific atomicity + batch-delete follow-ups already closed by `ef6d564`). Remaining follow-up: **#130** (dead `db/rpc.rs` dispatcher) — deferred as separate hygiene concern, not made worse by this migration. Empirical confirmation of fix for #121 / #122 hypothesis #1 needs the next several-hours-uptime soak test under load. Tracking in #122 separately.
Author
Member

Post-merge perf verification on kristof5 (2026-05-25) — moving this here from #122 comment 36896 where it was off-topic for that issue (a wedge-debug thread).

Apples-to-apples benchmark: pre-#124 binary (b29a3dc) vs post-#124 binary (3dafd6c), same DB snapshot, 100 supervised top -b -d 5 services.

Phase 1 / 2 — pure-read fan-out (no concurrent writes)

Workload Pre-#124 Post-#124 Delta
100× concurrent service.status_all (baseline) ~5-7% slower small regression
100× concurrent service.status ~14× speedup vs sequential ~14× speedup vs sequential parity
100× sequential service.status (baseline) ~5% slower small regression

On this shape the pool's per-call overhead (~50-200 μs for acquire + interact spawn-blocking handoff) exceeds the mutex's contention cost, because per-call DB work is sub-ms and there's no concurrent write pressure. #121's 9× cap did NOT reproduce in this workload on either binary (both achieved ~14× speedup), suggesting the cap lives in the next shape, not in pure reads.

Phase 3 — mixed read-write (the workload that matters)

Workload Pre-#124 Post-#124 Delta
100× concurrent run.submit (write-heavy, multi-step txn) 1.76-1.81 s 0.14-0.38 s ~10× faster
system.wipe_all + 100× concurrent service.status reads wipe 3.78s; reads p50 3674 ms, p99 3699 ms wipe 0.84s; reads p50 10 ms, p99 40 ms ~90× faster read p99 during writes; wipe ~4.5× faster
wipe_all reported rows_deleted 2 (broken SELECT changes() post-execute_batch) 600 (truthful) correctness fix validated end-to-end

The wipe_all + concurrent-reads case is the smoking gun for where #121's cap actually lived. Pre-#124, every reader blocks for the FULL ~3.6 s duration of the wipe transaction because std::sync::Mutex<Connection> serializes all access. Post-#124, SQLite WAL lets readers proceed against the pre-wipe snapshot while the writer holds its own pool connection.

The supervisor's normal operation has continuous background writes (executor state transitions, scheduler ticks, stats persist) — so the mutex serialization tax was being paid on every cockpit fan-out in production, even when individual writes were tiny.

Net verdict

  • ~5-7% slower on the narrow corner case of pure-read fan-out with no concurrent writes
  • ~10× faster on write-heavy multi-step transactions (transaction() helper consolidating 4 sub-API mutex acquisitions into one connection acquire + interact + commit)
  • ~90× faster read p99 during concurrent writes (WAL working as intended)
  • ~4.5× faster on wipe_all itself + truthful row count

Plus the 6 correctness fixes shipped alongside (truthful complete_wipe count, env::set_var soundness, cancel-vs-exit-status TOCTOU, multi-step handler atomicity, batch delete_many, running_pids Result propagation).

Methodology lesson noted for future benchmarks: pure-read or pure-write shapes miss the dominant production load pattern. Always include mixed read-write under realistic concurrent write pressure. A "9× cap" almost always means reads waiting behind a write lock, not reads waiting behind reads.

**Post-merge perf verification on kristof5 (2026-05-25)** — moving this here from [#122 comment 36896](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/122#issuecomment-36896) where it was off-topic for that issue (a wedge-debug thread). Apples-to-apples benchmark: pre-#124 binary (`b29a3dc`) vs post-#124 binary (`3dafd6c`), same DB snapshot, 100 supervised `top -b -d 5` services. ### Phase 1 / 2 — pure-read fan-out (no concurrent writes) | Workload | Pre-#124 | Post-#124 | Delta | |---|---|---|---| | 100× concurrent `service.status_all` | (baseline) | ~5-7% slower | small regression | | 100× concurrent `service.status` | ~14× speedup vs sequential | ~14× speedup vs sequential | parity | | 100× sequential `service.status` | (baseline) | ~5% slower | small regression | On this shape the pool's per-call overhead (~50-200 μs for acquire + `interact` spawn-blocking handoff) exceeds the mutex's contention cost, because per-call DB work is sub-ms and there's no concurrent write pressure. **#121's 9× cap did NOT reproduce in this workload on either binary** (both achieved ~14× speedup), suggesting the cap lives in the next shape, not in pure reads. ### Phase 3 — mixed read-write (the workload that matters) | Workload | Pre-#124 | Post-#124 | Delta | |---|---|---|---| | 100× concurrent `run.submit` (write-heavy, multi-step txn) | 1.76-1.81 s | **0.14-0.38 s** | **~10× faster** | | `system.wipe_all` + 100× concurrent `service.status` reads | wipe 3.78s; reads p50 3674 ms, p99 3699 ms | wipe 0.84s; reads p50 **10 ms**, p99 **40 ms** | **~90× faster read p99 during writes; wipe ~4.5× faster** | | `wipe_all` reported `rows_deleted` | 2 (broken `SELECT changes()` post-`execute_batch`) | 600 (truthful) | correctness fix validated end-to-end | The `wipe_all` + concurrent-reads case is the smoking gun for where #121's cap actually lived. Pre-#124, every reader blocks for the FULL ~3.6 s duration of the wipe transaction because `std::sync::Mutex<Connection>` serializes all access. Post-#124, SQLite WAL lets readers proceed against the pre-wipe snapshot while the writer holds its own pool connection. The supervisor's normal operation has continuous background writes (executor state transitions, scheduler ticks, stats persist) — so the mutex serialization tax was being paid on every cockpit fan-out in production, even when individual writes were tiny. ### Net verdict - **~5-7% slower** on the narrow corner case of pure-read fan-out with no concurrent writes - **~10× faster** on write-heavy multi-step transactions (`transaction()` helper consolidating 4 sub-API mutex acquisitions into one connection acquire + interact + commit) - **~90× faster read p99** during concurrent writes (WAL working as intended) - **~4.5× faster** on `wipe_all` itself + truthful row count Plus the 6 correctness fixes shipped alongside (truthful `complete_wipe` count, `env::set_var` soundness, cancel-vs-exit-status TOCTOU, multi-step handler atomicity, batch `delete_many`, `running_pids` Result propagation). Methodology lesson noted for future benchmarks: pure-read or pure-write shapes miss the dominant production load pattern. Always include mixed read-write under realistic concurrent write pressure. A "9× cap" almost always means reads waiting behind a write lock, not reads waiting behind reads.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#124
No description provided.