fix(embedder_server): retry daemon probe with backoff and use spawn_blocking on startup #24
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_embedder!24
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "development_embedder_startup_race"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Fixes #23.
hero_embedder_servernow reliably comes up on a coldservice_embedder start --resetwithout a manualhero_proc job retry, even whenhero_embedderdtakes several seconds to mmap its ONNX models.Two coordinated changes in
crates/hero_embedder_server/src/main.rs:tokio::task::spawn_blockingarounddiscover_embedderd.discover_embedderdbuilds areqwest::blocking::Clientand probes/healthsynchronously. Calling it directly from the#[tokio::main]async runtime panics on drop of its internal blocking runtime:Cannot drop a runtime in a context where blocking is not allowed. Pushing the call onto a blocking thread fixes the panic and is the standard pattern when reusing the existing sync client API.discover_embedderd. The probe now retriesis_reachable()forSTARTUP_PROBE_BUDGET = 30swith exponential backoff (200ms → 2scap). Covers a cold ONNX model load (~5–8s on this hardware) with significant headroom. After the budget expires the existing actionable error message is preserved so genuine misconfiguration still surfaces.Verified end-to-end
service_embedder start --resetfrom a fully stopped state on a machine withbge-small,bge-base,bge-reranker-base(~2 GB total) installed:All three jobs
runningon first try, no manualjob retry,rpc.sockpresent, dashboard panels populate.Test plan
cargo check --workspace --binscleanservice_embedder start --resetbrings all three jobs torunningwithout manual intervention~/hero/var/sockets/hero_embedder/rpc.sockappears within ~10 s of the start commandPOST /rpc inforeturns proper JSON-RPCresult, notSocket 'rpc.sock' not foundBackend unavailablealerts when backend is healthy)Notes
STARTUP_PROBE_BUDGETif a different deployment needs a different ceiling.