Multi-step RPC handlers (run.submit, create_with_jobs, service_define) are non-atomic — adopt db.transaction() helper #128

Closed
opened 2026-05-25 03:42:06 +00:00 by sameh-farouk · 1 comment
Member

What

After #124 landed (a58bdf0), several RPC handlers that issue multiple sequential db.*.*().await calls are non-atomic. The race class existed before #124 (the old Arc<Mutex<Connection>> was acquired per sub-API call, not per handler), but the pool migration widened the race window from "scheduler-overhead nanoseconds between sync sub-API calls" to "any task can run between .await points". The probability of a real-world interleaving is now orders of magnitude higher.

Concrete handlers affected

  • crates/hero_proc_server/src/rpc/run.rs::handle_submit (lines ~5826+): inserts a Run, then loops over actions issuing db.actions.get().await, db.jobs.set().await, db.runs.add_job().await. A concurrent service.stop or another run.submit can interleave at any .await boundary. Has a hand-rolled rollback_submit async helper but the rollback itself can race.
  • crates/hero_proc_server/src/rpc/run.rs::handle_create_with_jobs: same shape.
  • crates/hero_proc_server/src/service/define.rs::service_define_inner: writes 1-3 actions then 1 service. A concurrent reader can observe a partially-defined service (actions exist, service doesn't yet, or vice versa).

User-visible failure mode

run.submit returns success while the run's child jobs failed to insert → UI shows "submitted run with no jobs". Worse: orphaned Jobs whose Run was rolled back but the rollback closure raced with another writer.

Suggested fix

#124 shipped HeroProcDb::transaction(|tx| ...) specifically as the fix vehicle for this race class. Each handler's body should be moved inside a transaction closure so the entire multi-step write is one atomic SQLite transaction on one pooled connection. The closure body becomes sync rusqlite (no .await inside interact) — error responses are built in the async caller from the closure's Result.

Scope: ~500-1000 LOC of restructuring across the 3 handlers + regression tests for racing concurrent submits.

References

  • Found during external architect review of #124 (re-review v4); reviewer marked non-blocking with transaction() helper as the fix vehicle.
  • Related (handler-level race class, different symptom): #106 (service add-job overwrites existing actions).
## What After #124 landed (`a58bdf0`), several RPC handlers that issue multiple sequential `db.*.*().await` calls are non-atomic. The race class existed before #124 (the old `Arc<Mutex<Connection>>` was acquired per sub-API call, not per handler), but the pool migration widened the race window from "scheduler-overhead nanoseconds between sync sub-API calls" to "any task can run between `.await` points". The probability of a real-world interleaving is now orders of magnitude higher. ## Concrete handlers affected - `crates/hero_proc_server/src/rpc/run.rs::handle_submit` (lines ~5826+): inserts a Run, then loops over actions issuing `db.actions.get().await`, `db.jobs.set().await`, `db.runs.add_job().await`. A concurrent `service.stop` or another `run.submit` can interleave at any `.await` boundary. Has a hand-rolled `rollback_submit` async helper but the rollback itself can race. - `crates/hero_proc_server/src/rpc/run.rs::handle_create_with_jobs`: same shape. - `crates/hero_proc_server/src/service/define.rs::service_define_inner`: writes 1-3 actions then 1 service. A concurrent reader can observe a partially-defined service (actions exist, service doesn't yet, or vice versa). ## User-visible failure mode `run.submit` returns success while the run's child jobs failed to insert → UI shows "submitted run with no jobs". Worse: orphaned Jobs whose Run was rolled back but the rollback closure raced with another writer. ## Suggested fix #124 shipped `HeroProcDb::transaction(|tx| ...)` specifically as the fix vehicle for this race class. Each handler's body should be moved inside a transaction closure so the entire multi-step write is one atomic SQLite transaction on one pooled connection. The closure body becomes sync rusqlite (no `.await` inside `interact`) — error responses are built in the async caller from the closure's `Result`. Scope: ~500-1000 LOC of restructuring across the 3 handlers + regression tests for racing concurrent submits. ## References - Found during external architect review of #124 (re-review v4); reviewer marked non-blocking with `transaction()` helper as the fix vehicle. - Related (handler-level race class, different symptom): #106 (service add-job overwrites existing actions).
Author
Member

Closed by ef6d564 on origin/development.

Summary in commit body: adopted db.transaction() in run.submit / run.create_with_jobs / service_define_inner (#128); added JobsApi/RunsApi/ActionsApi::delete_many and swapped 2 known loop sites (#129); also restored spawn_blocking around git CLI in do_pull_secrets/do_push_secrets (worker-stall regression from #124).

Verification: lib regression 242/0; Stage 6 release daemon clean (run.submit + wipe_all + 50× parallel status_all + 20-job churn all green); integration baseline preserved (62 PASS / 7 FAIL, all pre-existing env-only).

External architect review: APPROVE, 5 non-blocking weaknesses (2 fixed in b2fd17c cleanup before squash: dead actions_model alias, now_ms clock-skew window).

Closed by `ef6d564` on `origin/development`. Summary in commit body: adopted `db.transaction()` in `run.submit` / `run.create_with_jobs` / `service_define_inner` (#128); added `JobsApi/RunsApi/ActionsApi::delete_many` and swapped 2 known loop sites (#129); also restored `spawn_blocking` around git CLI in `do_pull_secrets`/`do_push_secrets` (worker-stall regression from #124). Verification: lib regression 242/0; Stage 6 release daemon clean (run.submit + wipe_all + 50× parallel status_all + 20-job churn all green); integration baseline preserved (62 PASS / 7 FAIL, all pre-existing env-only). External architect review: APPROVE, 5 non-blocking weaknesses (2 fixed in `b2fd17c` cleanup before squash: dead actions_model alias, now_ms clock-skew window).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#128
No description provided.