fix(embedder_server): retry daemon probe with backoff and use spawn_blocking on startup #24

Merged
salmaelsoly merged 1 commit from development_embedder_startup_race into development 2026-04-27 12:09:12 +00:00
Member

Summary

Fixes #23. hero_embedder_server now reliably comes up on a cold service_embedder start --reset without a manual hero_proc job retry, even when hero_embedderd takes several seconds to mmap its ONNX models.

Two coordinated changes in crates/hero_embedder_server/src/main.rs:

  1. tokio::task::spawn_blocking around discover_embedderd. discover_embedderd builds a reqwest::blocking::Client and probes /health synchronously. Calling it directly from the #[tokio::main] async runtime panics on drop of its internal blocking runtime: Cannot drop a runtime in a context where blocking is not allowed. Pushing the call onto a blocking thread fixes the panic and is the standard pattern when reusing the existing sync client API.
  2. Retry-with-backoff in discover_embedderd. The probe now retries is_reachable() for STARTUP_PROBE_BUDGET = 30s with exponential backoff (200ms → 2s cap). Covers a cold ONNX model load (~5–8s on this hardware) with significant headroom. After the budget expires the existing actionable error message is preserved so genuine misconfiguration still surfaces.

Verified end-to-end

service_embedder start --reset from a fully stopped state on a machine with bge-small, bge-base, bge-reranker-base (~2 GB total) installed:

$ hero_proc job list hero_embedder
ID   ACTION                  PHASE    PID       SERVICE
218  hero_embedder_ui        running  1223409   hero_embedder
217  hero_embedder_server    running  1223410   hero_embedder
216  hero_embedderd          running  1223475   hero_embedder

$ ls ~/hero/var/sockets/hero_embedder/
rpc.sock  ui.sock

$ curl -sS -X POST -H 'content-type: application/json' \
    -d '{"jsonrpc":"2.0","id":1,"method":"info","params":{}}' \
    http://127.0.0.1:9151/hero_embedder/ui/rpc
{"jsonrpc":"2.0","id":1,"result":{"corpus_count":0,"models_loaded":"remote(embedderd)","namespace_count":1,"reranker_available":true,"total_doc_count":0}}

All three jobs running on first try, no manual job retry, rpc.sock present, dashboard panels populate.

Test plan

  • cargo check --workspace --bins clean
  • Cold service_embedder start --reset brings all three jobs to running without manual intervention
  • ~/hero/var/sockets/hero_embedder/rpc.sock appears within ~10 s of the start command
  • POST /rpc info returns proper JSON-RPC result, not Socket 'rpc.sock' not found
  • Dashboard panels populate (no Backend unavailable alerts when backend is healthy)
  • Negative path: with no daemon at all, server still emits the existing actionable error after the 30s budget — verified by reading the code path; not separately exercised on this machine since the normal path is the load-bearing case

Notes

  • 30 s budget chosen with significant headroom over the observed ~5–8 s daemon model-load time on this hardware. Tunable with STARTUP_PROBE_BUDGET if a different deployment needs a different ceiling.
  • Builds on the merged #22 (UI fail-soft templates). Together: the dashboard handles backend-down gracefully, and the backend now stays up by itself.
## Summary Fixes #23. `hero_embedder_server` now reliably comes up on a cold `service_embedder start --reset` without a manual `hero_proc job retry`, even when `hero_embedderd` takes several seconds to mmap its ONNX models. Two coordinated changes in `crates/hero_embedder_server/src/main.rs`: 1. **`tokio::task::spawn_blocking` around `discover_embedderd`.** `discover_embedderd` builds a `reqwest::blocking::Client` and probes `/health` synchronously. Calling it directly from the `#[tokio::main]` async runtime panics on drop of its internal blocking runtime: `Cannot drop a runtime in a context where blocking is not allowed`. Pushing the call onto a blocking thread fixes the panic and is the standard pattern when reusing the existing sync client API. 2. **Retry-with-backoff in `discover_embedderd`.** The probe now retries `is_reachable()` for `STARTUP_PROBE_BUDGET = 30s` with exponential backoff (`200ms → 2s` cap). Covers a cold ONNX model load (~5–8s on this hardware) with significant headroom. After the budget expires the existing actionable error message is preserved so genuine misconfiguration still surfaces. ## Verified end-to-end `service_embedder start --reset` from a fully stopped state on a machine with `bge-small`, `bge-base`, `bge-reranker-base` (~2 GB total) installed: ``` $ hero_proc job list hero_embedder ID ACTION PHASE PID SERVICE 218 hero_embedder_ui running 1223409 hero_embedder 217 hero_embedder_server running 1223410 hero_embedder 216 hero_embedderd running 1223475 hero_embedder $ ls ~/hero/var/sockets/hero_embedder/ rpc.sock ui.sock $ curl -sS -X POST -H 'content-type: application/json' \ -d '{"jsonrpc":"2.0","id":1,"method":"info","params":{}}' \ http://127.0.0.1:9151/hero_embedder/ui/rpc {"jsonrpc":"2.0","id":1,"result":{"corpus_count":0,"models_loaded":"remote(embedderd)","namespace_count":1,"reranker_available":true,"total_doc_count":0}} ``` All three jobs `running` on first try, no manual `job retry`, `rpc.sock` present, dashboard panels populate. ## Test plan - [x] `cargo check --workspace --bins` clean - [x] Cold `service_embedder start --reset` brings all three jobs to `running` without manual intervention - [x] `~/hero/var/sockets/hero_embedder/rpc.sock` appears within ~10 s of the start command - [x] `POST /rpc info` returns proper JSON-RPC `result`, not `Socket 'rpc.sock' not found` - [x] Dashboard panels populate (no `Backend unavailable` alerts when backend is healthy) - [ ] Negative path: with no daemon at all, server still emits the existing actionable error after the 30s budget — verified by reading the code path; not separately exercised on this machine since the normal path is the load-bearing case ## Notes - 30 s budget chosen with significant headroom over the observed ~5–8 s daemon model-load time on this hardware. Tunable with `STARTUP_PROBE_BUDGET` if a different deployment needs a different ceiling. - Builds on the merged #22 (UI fail-soft templates). Together: the dashboard handles backend-down gracefully, *and* the backend now stays up by itself.
fix(embedder_server): retry daemon probe with backoff and use spawn_blocking on startup
All checks were successful
Test / test (pull_request) Successful in 3m25s
83e4f752fd
salmaelsoly merged commit b1381c4435 into development 2026-04-27 12:09:12 +00:00
salmaelsoly deleted branch development_embedder_startup_race 2026-04-27 12:09:17 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_embedder!24
No description provided.