lhumina_code/hero_proc

Fork 0

+ 57 hot-path lock-in-async sites is a real defect, not a trade-off (see #121 evidence; #122 candidate) #124

New issue

Closed

opened 2026-05-24 10:14:10 +00:00 by sameh-farouk · 2 comments

sameh-farouk commented

2026-05-24 10:14:10 +00:00

Member

The defect

crates/hero_proc_server/src/db/factory.rs exposes the DB via conn: Arc<std::sync::Mutex<rusqlite::Connection>>. 57 call sites across that file do let conn = self.conn.lock().unwrap(); then a synchronous SQLite operation — all from inside async RPC handlers. None of the hot-path callers wrap in tokio::task::spawn_blocking.

This is the canonical Tokio anti-pattern. From tokio's own docs:

The Tokio runtime relies on the futures being polled cooperatively. If a future does not yield control regularly, [...] it may block the entire runtime.

Under load, every async worker thread that hits a DB op blocks the OS thread holding it. Tokio can't reschedule unrelated tasks onto that thread until the lock releases. Once all N workers are blocked, the entire runtime stalls — RPC accept, timer ticks, supervisor poll, everything.

Empirical evidence this is real, not theoretical

#121 (closed) — service.status capped concurrent parallelism at ~9× under load. The fix landed as e833dc9 feat(rpc): add service.status_all bulk RPC — a workaround that bundles N status calls into one mutex acquisition. The underlying serialization point was never removed.
#122 (open) — hero_proc_server wedges in futex_do_wait after several hours uptime, all RPCs hang. Worker-pool saturation from blocking-in-async is exactly the kind of symptom this pattern can produce. Without the next stack trace I can't confirm causation, but it's one of the leading candidates.

What "right" looks like — real-world consensus

Production Rust services that need SQLite under async universally use one of:

Pattern	Used by
`deadpool-sqlite` (connection pool, fully async API)	Common in axum / actix services
`tokio-rusqlite` (single-thread executor wrapper)	Smaller services
`sqlx` with SQLite backend	Schema-first services
`r2d2-sqlite` (sync pool) + every call wrapped in `spawn_blocking`	Older but valid

Hero_proc's current shape — Arc<std::sync::Mutex<Connection>> accessed directly from async tasks — appears in tutorials labeled "what NOT to do." It works in low-traffic dev environments and degrades sharply at production load. That degradation is exactly what #121 measured.

Proposed fix — Option C: connection pool

Migrate to deadpool-sqlite (or r2d2-sqlite + spawn_blocking if a sync-pool style is preferred). The shape per call site:

// before
let conn = self.conn.lock().unwrap();
some_sqlite_fn(&conn, ...)

// after (deadpool-sqlite style)
let conn = self.pool.get().await?;
conn.interact(|conn| some_sqlite_fn(conn, ...)).await??

Mechanical change × 57 call sites, plus:

Pool initialization in factory.rs::HeroProcDb::new
SQLite WAL mode (already needed for concurrent reads)
Update secrets/SPECS.md to drop the old "spawn_blocking for long ops" prescription — replaced by the pool guarantee

Reads can fan out across multiple connections (WAL allows N concurrent readers + 1 writer). Writes serialize behind a single writer. The #121 latency cap goes away; the #122 worker-starvation candidate goes away.

Alternative — Option B (cheaper but partial)

Wrap each of the 57 call sites in spawn_blocking. Removes worker-starvation but keeps the single-mutex serialization point — #121's cap stays. Use only if pool migration is judged too invasive.

Scope

factory.rs — pool init + 57 call site signatures
All callers of the touched fns (rpc/*.rs, supervisor/*.rs, etc.) — propagate the await and pool.get()
Cargo.toml — add deadpool-sqlite dependency
secrets/SPECS.md — drop the spawn_blocking convention note (no longer applies)
Tests — should mostly pass unchanged once async-await propagation lands

Estimated: ~1-2 days for one developer including verifying hero_proc_test baseline stays green.

## The defect `crates/hero_proc_server/src/db/factory.rs` exposes the DB via `conn: Arc<std::sync::Mutex<rusqlite::Connection>>`. **57 call sites** across that file do `let conn = self.conn.lock().unwrap();` then a synchronous SQLite operation — all from inside async RPC handlers. None of the hot-path callers wrap in `tokio::task::spawn_blocking`. This is the canonical Tokio anti-pattern. From [tokio's own docs](https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use): > The Tokio runtime relies on the futures being polled cooperatively. If a future does not yield control regularly, [...] it may block the entire runtime. Under load, every async worker thread that hits a DB op blocks the OS thread holding it. Tokio can't reschedule unrelated tasks onto that thread until the lock releases. Once all N workers are blocked, the entire runtime stalls — RPC accept, timer ticks, supervisor poll, everything. ## Empirical evidence this is real, not theoretical - **#121** (closed) — `service.status` capped concurrent parallelism at **~9×** under load. The fix landed as `e833dc9 feat(rpc): add service.status_all bulk RPC` — a *workaround* that bundles N status calls into one mutex acquisition. The underlying serialization point was never removed. - **#122** (open) — `hero_proc_server` wedges in `futex_do_wait` after several hours uptime, all RPCs hang. Worker-pool saturation from blocking-in-async is exactly the kind of symptom this pattern can produce. Without the next stack trace I can't confirm causation, but it's one of the leading candidates. ## What "right" looks like — real-world consensus Production Rust services that need SQLite under async universally use one of: | Pattern | Used by | |---|---| | `deadpool-sqlite` (connection pool, fully async API) | Common in axum / actix services | | `tokio-rusqlite` (single-thread executor wrapper) | Smaller services | | `sqlx` with SQLite backend | Schema-first services | | `r2d2-sqlite` (sync pool) + every call wrapped in `spawn_blocking` | Older but valid | Hero_proc's current shape — `Arc<std::sync::Mutex<Connection>>` accessed directly from async tasks — appears in tutorials labeled "what NOT to do." It works in low-traffic dev environments and degrades sharply at production load. That degradation is exactly what #121 measured. ## Proposed fix — Option C: connection pool Migrate to `deadpool-sqlite` (or `r2d2-sqlite` + `spawn_blocking` if a sync-pool style is preferred). The shape per call site: ```rust // before let conn = self.conn.lock().unwrap(); some_sqlite_fn(&conn, ...) // after (deadpool-sqlite style) let conn = self.pool.get().await?; conn.interact(|conn| some_sqlite_fn(conn, ...)).await?? ``` Mechanical change × 57 call sites, plus: - Pool initialization in `factory.rs::HeroProcDb::new` - SQLite WAL mode (already needed for concurrent reads) - Update `secrets/SPECS.md` to drop the old "spawn_blocking for long ops" prescription — replaced by the pool guarantee Reads can fan out across multiple connections (WAL allows N concurrent readers + 1 writer). Writes serialize behind a single writer. The #121 latency cap goes away; the #122 worker-starvation candidate goes away. ## Alternative — Option B (cheaper but partial) Wrap each of the 57 call sites in `spawn_blocking`. Removes worker-starvation but keeps the single-mutex serialization point — #121's cap stays. Use only if pool migration is judged too invasive. ## Scope - `factory.rs` — pool init + 57 call site signatures - All callers of the touched fns (`rpc/*.rs`, `supervisor/*.rs`, etc.) — propagate the `await` and `pool.get()` - `Cargo.toml` — add `deadpool-sqlite` dependency - `secrets/SPECS.md` — drop the spawn_blocking convention note (no longer applies) - Tests — should mostly pass unchanged once async-await propagation lands Estimated: ~1-2 days for one developer including verifying `hero_proc_test` baseline stays green.

sameh-farouk changed title from ~~RFC: db/factory.rs Arc<std::sync::Mutex<Connection>> + 57 hot-path lock-in-async sites — keep as-is, wrap, or move to pool?~~ to db/factory.rs Arc<std::sync::Mutex<Connection>> + 57 hot-path lock-in-async sites is a real defect, not a trade-off (see #121 evidence; #122 candidate)

2026-05-24 10:15:52 +00:00

sameh-farouk referenced this issue

2026-05-24 10:55:47 +00:00

hero_proc_server wedges in futex_do_wait after several hours, drops all RPCs (2nd recurrence) #122

sameh-farouk referenced this issue from a commit

2026-05-25 03:12:17 +00:00

feat(db): #124 migrate HeroProcDb from Arc<Mutex<Connection>> to deadpool-sqlite pool

sameh-farouk referenced this issue

2026-05-25 03:42:06 +00:00

Multi-step RPC handlers (run.submit, create_with_jobs, service_define) are non-atomic — adopt db.transaction() helper #128

sameh-farouk referenced this issue

2026-05-25 03:42:50 +00:00

Add JobsApi/RunsApi/ActionsApi::delete_many for batch cleanup paths (N round-trips today) #129

sameh-farouk referenced this issue

2026-05-25 03:43:22 +00:00

Remove dead duplicate dispatcher: crates/hero_proc_server/src/db/rpc.rs (no callers) #130

sameh-farouk referenced this issue from a commit

2026-05-25 04:48:03 +00:00

fix(db): #128 #129 follow-ups — adopt transaction() in submit/create_with_jobs/define + delete_many + spawn_blocking for git CLI

sameh-farouk referenced this issue

2026-05-25 04:48:38 +00:00

Multi-step RPC handlers (run.submit, create_with_jobs, service_define) are non-atomic — adopt db.transaction() helper #128

sameh-farouk referenced this issue

2026-05-25 04:48:39 +00:00

Add JobsApi/RunsApi/ActionsApi::delete_many for batch cleanup paths (N round-trips today) #129

sameh-farouk commented

2026-05-25 10:30:06 +00:00

Author

Member

Migration landed in two squash commits on origin/development:

a58bdf0 — main migration (Arc<std::sync::Mutex<Connection>> → deadpool_sqlite::Pool, sub-API sync→async across ~410 caller sites, conditional UPDATE primitive for cancel-race fix, complete_wipe row-count fix, env::set_var soundness fix, Result-propagation in running_pids family). 4 rounds of external architect review (final: APPROVE).
ef6d564 — follow-ups closing #128 + #129 (adopt db.transaction() in 3 multi-step handlers + add delete_many batch primitives + restore spawn_blocking around git CLI in secret RPCs). 1 round of review (APPROVE) + cleanup commit for 2 real findings.

Lib regression: 232/0 baseline → 242/0. Integration --basic baseline preserved (62 PASS / 7 FAIL identical, all pre-existing env-only). Stage 6 on a running release daemon: WAL+synchronous=NORMAL pragmas applied via post_create hook, 50× parallel service.status_all runs in ~120–300ms (vs the ~9× serial cap measured in #121), wipe_all returns truthful row counts, daemon log clean.

Closes #128, #129 (separate filings tracking specific atomicity + batch-delete follow-ups already closed by ef6d564).

Remaining follow-up: #130 (dead db/rpc.rs dispatcher) — deferred as separate hygiene concern, not made worse by this migration.

Empirical confirmation of fix for #121 / #122 hypothesis #1 needs the next several-hours-uptime soak test under load. Tracking in #122 separately.

Migration landed in two squash commits on `origin/development`: - **`a58bdf0`** — main migration (`Arc<std::sync::Mutex<Connection>>` → `deadpool_sqlite::Pool`, sub-API sync→async across ~410 caller sites, conditional UPDATE primitive for cancel-race fix, complete_wipe row-count fix, env::set_var soundness fix, Result-propagation in running_pids family). 4 rounds of external architect review (final: APPROVE). - **`ef6d564`** — follow-ups closing #128 + #129 (adopt `db.transaction()` in 3 multi-step handlers + add `delete_many` batch primitives + restore `spawn_blocking` around git CLI in secret RPCs). 1 round of review (APPROVE) + cleanup commit for 2 real findings. Lib regression: 232/0 baseline → 242/0. Integration `--basic` baseline preserved (62 PASS / 7 FAIL identical, all pre-existing env-only). Stage 6 on a running release daemon: WAL+synchronous=NORMAL pragmas applied via post_create hook, 50× parallel `service.status_all` runs in ~120–300ms (vs the ~9× serial cap measured in #121), `wipe_all` returns truthful row counts, daemon log clean. Closes #128, #129 (separate filings tracking specific atomicity + batch-delete follow-ups already closed by `ef6d564`). Remaining follow-up: **#130** (dead `db/rpc.rs` dispatcher) — deferred as separate hygiene concern, not made worse by this migration. Empirical confirmation of fix for #121 / #122 hypothesis #1 needs the next several-hours-uptime soak test under load. Tracking in #122 separately.

sameh-farouk closed this issue

2026-05-25 10:30:10 +00:00

sameh-farouk referenced this issue

2026-05-25 10:30:45 +00:00

hero_proc_server wedges in futex_do_wait after several hours, drops all RPCs (2nd recurrence) #122

sameh-farouk referenced this issue

2026-05-25 10:31:17 +00:00

service.status RPC latency under concurrent load limits effective parallelism to ~9x #121

sameh-farouk referenced this issue

2026-05-25 13:38:06 +00:00

hero_proc_server wedges in futex_do_wait after several hours, drops all RPCs (2nd recurrence) #122

sameh-farouk commented

2026-05-26 00:38:10 +00:00

Author

Member

Post-merge perf verification on kristof5 (2026-05-25) — moving this here from #122 comment 36896 where it was off-topic for that issue (a wedge-debug thread).

Apples-to-apples benchmark: pre-#124 binary (b29a3dc) vs post-#124 binary (3dafd6c), same DB snapshot, 100 supervised top -b -d 5 services.

Phase 1 / 2 — pure-read fan-out (no concurrent writes)

Workload	Pre-#124	Post-#124	Delta
100× concurrent `service.status_all`	(baseline)	~5-7% slower	small regression
100× concurrent `service.status`	~14× speedup vs sequential	~14× speedup vs sequential	parity
100× sequential `service.status`	(baseline)	~5% slower	small regression

On this shape the pool's per-call overhead (~50-200 μs for acquire + interact spawn-blocking handoff) exceeds the mutex's contention cost, because per-call DB work is sub-ms and there's no concurrent write pressure. #121's 9× cap did NOT reproduce in this workload on either binary (both achieved ~14× speedup), suggesting the cap lives in the next shape, not in pure reads.

Phase 3 — mixed read-write (the workload that matters)

Workload	Pre-#124	Post-#124	Delta
100× concurrent `run.submit` (write-heavy, multi-step txn)	1.76-1.81 s	0.14-0.38 s	~10× faster
`system.wipe_all` + 100× concurrent `service.status` reads	wipe 3.78s; reads p50 3674 ms, p99 3699 ms	wipe 0.84s; reads p50 10 ms, p99 40 ms	~90× faster read p99 during writes; wipe ~4.5× faster
`wipe_all` reported `rows_deleted`	2 (broken `SELECT changes()` post-`execute_batch`)	600 (truthful)	correctness fix validated end-to-end

The wipe_all + concurrent-reads case is the smoking gun for where #121's cap actually lived. Pre-#124, every reader blocks for the FULL ~3.6 s duration of the wipe transaction because std::sync::Mutex<Connection> serializes all access. Post-#124, SQLite WAL lets readers proceed against the pre-wipe snapshot while the writer holds its own pool connection.

The supervisor's normal operation has continuous background writes (executor state transitions, scheduler ticks, stats persist) — so the mutex serialization tax was being paid on every cockpit fan-out in production, even when individual writes were tiny.

Net verdict

~5-7% slower on the narrow corner case of pure-read fan-out with no concurrent writes
~10× faster on write-heavy multi-step transactions (transaction() helper consolidating 4 sub-API mutex acquisitions into one connection acquire + interact + commit)
~90× faster read p99 during concurrent writes (WAL working as intended)
~4.5× faster on wipe_all itself + truthful row count

Plus the 6 correctness fixes shipped alongside (truthful complete_wipe count, env::set_var soundness, cancel-vs-exit-status TOCTOU, multi-step handler atomicity, batch delete_many, running_pids Result propagation).

Methodology lesson noted for future benchmarks: pure-read or pure-write shapes miss the dominant production load pattern. Always include mixed read-write under realistic concurrent write pressure. A "9× cap" almost always means reads waiting behind a write lock, not reads waiting behind reads.

**Post-merge perf verification on kristof5 (2026-05-25)** — moving this here from [#122 comment 36896](https://forge.ourworld.tf/lhumina_code/hero_proc/issues/122#issuecomment-36896) where it was off-topic for that issue (a wedge-debug thread). Apples-to-apples benchmark: pre-#124 binary (`b29a3dc`) vs post-#124 binary (`3dafd6c`), same DB snapshot, 100 supervised `top -b -d 5` services. ### Phase 1 / 2 — pure-read fan-out (no concurrent writes) | Workload | Pre-#124 | Post-#124 | Delta | |---|---|---|---| | 100× concurrent `service.status_all` | (baseline) | ~5-7% slower | small regression | | 100× concurrent `service.status` | ~14× speedup vs sequential | ~14× speedup vs sequential | parity | | 100× sequential `service.status` | (baseline) | ~5% slower | small regression | On this shape the pool's per-call overhead (~50-200 μs for acquire + `interact` spawn-blocking handoff) exceeds the mutex's contention cost, because per-call DB work is sub-ms and there's no concurrent write pressure. **#121's 9× cap did NOT reproduce in this workload on either binary** (both achieved ~14× speedup), suggesting the cap lives in the next shape, not in pure reads. ### Phase 3 — mixed read-write (the workload that matters) | Workload | Pre-#124 | Post-#124 | Delta | |---|---|---|---| | 100× concurrent `run.submit` (write-heavy, multi-step txn) | 1.76-1.81 s | **0.14-0.38 s** | **~10× faster** | | `system.wipe_all` + 100× concurrent `service.status` reads | wipe 3.78s; reads p50 3674 ms, p99 3699 ms | wipe 0.84s; reads p50 **10 ms**, p99 **40 ms** | **~90× faster read p99 during writes; wipe ~4.5× faster** | | `wipe_all` reported `rows_deleted` | 2 (broken `SELECT changes()` post-`execute_batch`) | 600 (truthful) | correctness fix validated end-to-end | The `wipe_all` + concurrent-reads case is the smoking gun for where #121's cap actually lived. Pre-#124, every reader blocks for the FULL ~3.6 s duration of the wipe transaction because `std::sync::Mutex<Connection>` serializes all access. Post-#124, SQLite WAL lets readers proceed against the pre-wipe snapshot while the writer holds its own pool connection. The supervisor's normal operation has continuous background writes (executor state transitions, scheduler ticks, stats persist) — so the mutex serialization tax was being paid on every cockpit fan-out in production, even when individual writes were tiny. ### Net verdict - **~5-7% slower** on the narrow corner case of pure-read fan-out with no concurrent writes - **~10× faster** on write-heavy multi-step transactions (`transaction()` helper consolidating 4 sub-API mutex acquisitions into one connection acquire + interact + commit) - **~90× faster read p99** during concurrent writes (WAL working as intended) - **~4.5× faster** on `wipe_all` itself + truthful row count Plus the 6 correctness fixes shipped alongside (truthful `complete_wipe` count, `env::set_var` soundness, cancel-vs-exit-status TOCTOU, multi-step handler atomicity, batch `delete_many`, `running_pids` Result propagation). Methodology lesson noted for future benchmarks: pure-read or pure-write shapes miss the dominant production load pattern. Always include mixed read-write under realistic concurrent write pressure. A "9× cap" almost always means reads waiting behind a write lock, not reads waiting behind reads.

sameh-farouk referenced this issue

2026-05-26 01:15:50 +00:00

Supervisor hero_proc.db has no retention or VACUUM scheduling — grows unboundedly on long-lived daemons #131