embedder_server: startup race vs hero_embedderd model load + tokio panic on async-context drop of blocking client #23

New issue

Closed

opened 2026-04-27 11:29:01 +00:00 by salmaelsoly · 0 comments

salmaelsoly commented

2026-04-27 11:29:01 +00:00

Member

Summary

hero_embedder_server racks up startup retries and is left in a permanent failed state when hero_embedderd takes longer than ~3 seconds to start serving its /health endpoint. On a fresh boot the daemon needs to mmap ~2 GB of ONNX models (bge-small, bge-base, bge-reranker-base) before it starts listening on 127.0.0.1:8092; until then the server's startup probe (is_reachable(), 3 s connect timeout) fails. hero_proc respawns the server several times in quick succession, exhausts the retry budget, and then stops — at which point manual hero_proc job retry hero_embedder hero_embedder_server is required to bring the system up.

The server fails on every attempt with:

Error: hero_embedderd is required but not reachable

Caused by:
    HERO_EMBEDDERD_URL='http://127.0.0.1:8092' is set but the daemon is not reachable.
    Start hero_embedderd or unset the variable to fall back to the loopback default.

while in parallel the daemon is healthy a few seconds later:

[hero_embedderd | running] Listening on http://127.0.0.1:8092

Reproduction

Cold-stop the embedder: hero_proc service stop hero_embedder.
Cold-start it: service_embedder start --reset (or hero_embedder --start).
Wait a few seconds and run hero_proc job list hero_embedder. Observed:

ID   ACTION                  PHASE     PID         SERVICE
214  hero_embedder_ui        running   ...         hero_embedder
213  hero_embedder_server    failed    0           hero_embedder
212  hero_embedderd          running   ...         hero_embedder

The rpc.sock is missing, the dashboard sees hero_router 404s for /rpc, and (post #20) every panel renders the "Backend unavailable…" alert.
Manual recovery: hero_proc job retry hero_embedder hero_embedder_server succeeds and the server stays up afterwards (because the daemon is by now ready).

Root cause

crates/hero_embedder_server/src/main.rs::discover_embedderd calls EmbedderdClient::new(url)?.is_reachable() exactly once at startup. EmbedderdClient's connect_timeout is 5 s and the is_reachable request itself uses a 3 s timeout, so the function returns Err("daemon not reachable") after at most ~5 s if the daemon hasn't bound the port yet.

There is no retry-with-backoff at the server level, and no hero_proc-level dependency declaration that would gate hero_embedder_server startup on hero_embedderd's /health returning 200. So whoever loses the race loses for good (until manual retry).

Suggested fix direction

In crates/hero_embedder_server/src/main.rs, give discover_embedderd an explicit retry budget. Sketch:

const STARTUP_PROBE_BUDGET: Duration = Duration::from_secs(30);
const STARTUP_PROBE_INITIAL: Duration = Duration::from_millis(200);
const STARTUP_PROBE_MAX: Duration   = Duration::from_secs(2);

fn discover_embedderd() -> Result<(EmbedderdClient, String)> {
    let (client, url) = build_candidate()?; // env-or-loopback as today
    let deadline = Instant::now() + STARTUP_PROBE_BUDGET;
    let mut delay = STARTUP_PROBE_INITIAL;
    loop {
        if client.is_reachable() {
            return Ok((client, url));
        }
        if Instant::now() >= deadline {
            anyhow::bail!(
                "hero_embedderd at '{url}' did not respond to /health within {:?}",
                STARTUP_PROBE_BUDGET
            );
        }
        std::thread::sleep(delay);
        delay = (delay * 2).min(STARTUP_PROBE_MAX);
    }
}

The 30 s budget covers a cold model load on this hardware (~5–8 s) with significant headroom. After the budget expires, we still emit the existing actionable error message so operators can diagnose a genuine misconfiguration.

Equivalent alternative: declare a hero_proc dependency in the action registration so hero_embedder_server only starts after hero_embedderd's /health returns 200. That fix lives in hero_skills and is more invasive across the stack; the in-server retry above is self-contained and works regardless of the surrounding orchestrator.

Acceptance criteria

After a cold service_embedder start --reset, all three jobs (hero_embedderd, hero_embedder_server, hero_embedder_ui) reach running without manual job retry.
~/hero/var/sockets/hero_embedder/rpc.sock exists within ~10 s of the start command.
If hero_embedderd is genuinely missing or misconfigured, the server still emits the existing error pointing at the URL after the 30 s budget — no silent infinite hang.
No regression in the happy path (daemon already up): discover_embedderd returns immediately on first probe, no extra latency.

Notes

Closely related: the server also panics with Cannot drop a runtime in a context where blocking is not allowed when discover_embedderd is called directly from the #[tokio::main] async runtime (because reqwest::blocking::Client::builder() spawns/drops a runtime). That's a separate defect tracked in the same PR via tokio::task::spawn_blocking. Neither fix obsoletes the other — the panic fix lets the function run at all; this race fix lets it succeed under realistic timing.
Models loaded on this machine: bge-small (FP32 + INT8), bge-base (FP32 + INT8), bge-reranker-base. Roughly 2 GB total mmapped. A faster / smaller model set would reduce the race window but would not eliminate it; the retry is the correct robustness fix.

## Summary `hero_embedder_server` racks up startup retries and is left in a permanent `failed` state when `hero_embedderd` takes longer than ~3 seconds to start serving its `/health` endpoint. On a fresh boot the daemon needs to mmap ~2 GB of ONNX models (`bge-small`, `bge-base`, `bge-reranker-base`) before it starts listening on `127.0.0.1:8092`; until then the server's startup probe (`is_reachable()`, 3 s connect timeout) fails. hero_proc respawns the server several times in quick succession, exhausts the retry budget, and then stops — at which point manual `hero_proc job retry hero_embedder hero_embedder_server` is required to bring the system up. The server fails on every attempt with: ``` Error: hero_embedderd is required but not reachable Caused by: HERO_EMBEDDERD_URL='http://127.0.0.1:8092' is set but the daemon is not reachable. Start hero_embedderd or unset the variable to fall back to the loopback default. ``` while in parallel the daemon is healthy a few seconds later: ``` [hero_embedderd | running] Listening on http://127.0.0.1:8092 ``` ## Reproduction 1. Cold-stop the embedder: `hero_proc service stop hero_embedder`. 2. Cold-start it: `service_embedder start --reset` (or `hero_embedder --start`). 3. Wait a few seconds and run `hero_proc job list hero_embedder`. Observed: ``` ID ACTION PHASE PID SERVICE 214 hero_embedder_ui running ... hero_embedder 213 hero_embedder_server failed 0 hero_embedder 212 hero_embedderd running ... hero_embedder ``` 4. The `rpc.sock` is missing, the dashboard sees `hero_router` 404s for `/rpc`, and (post #20) every panel renders the "Backend unavailable…" alert. 5. Manual recovery: `hero_proc job retry hero_embedder hero_embedder_server` succeeds and the server stays up afterwards (because the daemon is by now ready). ## Root cause `crates/hero_embedder_server/src/main.rs::discover_embedderd` calls `EmbedderdClient::new(url)?.is_reachable()` exactly once at startup. `EmbedderdClient`'s `connect_timeout` is 5 s and the `is_reachable` request itself uses a 3 s timeout, so the function returns `Err("daemon not reachable")` after at most ~5 s if the daemon hasn't bound the port yet. There is no retry-with-backoff at the server level, and no hero_proc-level dependency declaration that would gate `hero_embedder_server` startup on `hero_embedderd`'s `/health` returning 200. So whoever loses the race loses for good (until manual retry). ## Suggested fix direction In `crates/hero_embedder_server/src/main.rs`, give `discover_embedderd` an explicit retry budget. Sketch: ```rust const STARTUP_PROBE_BUDGET: Duration = Duration::from_secs(30); const STARTUP_PROBE_INITIAL: Duration = Duration::from_millis(200); const STARTUP_PROBE_MAX: Duration = Duration::from_secs(2); fn discover_embedderd() -> Result<(EmbedderdClient, String)> { let (client, url) = build_candidate()?; // env-or-loopback as today let deadline = Instant::now() + STARTUP_PROBE_BUDGET; let mut delay = STARTUP_PROBE_INITIAL; loop { if client.is_reachable() { return Ok((client, url)); } if Instant::now() >= deadline { anyhow::bail!( "hero_embedderd at '{url}' did not respond to /health within {:?}", STARTUP_PROBE_BUDGET ); } std::thread::sleep(delay); delay = (delay * 2).min(STARTUP_PROBE_MAX); } } ``` The 30 s budget covers a cold model load on this hardware (~5–8 s) with significant headroom. After the budget expires, we still emit the existing actionable error message so operators can diagnose a genuine misconfiguration. Equivalent alternative: declare a hero_proc dependency in the action registration so `hero_embedder_server` only starts after `hero_embedderd`'s `/health` returns 200. That fix lives in `hero_skills` and is more invasive across the stack; the in-server retry above is self-contained and works regardless of the surrounding orchestrator. ## Acceptance criteria - [ ] After a cold `service_embedder start --reset`, all three jobs (`hero_embedderd`, `hero_embedder_server`, `hero_embedder_ui`) reach `running` without manual `job retry`. - [ ] `~/hero/var/sockets/hero_embedder/rpc.sock` exists within ~10 s of the start command. - [ ] If `hero_embedderd` is genuinely missing or misconfigured, the server still emits the existing error pointing at the URL after the 30 s budget — no silent infinite hang. - [ ] No regression in the happy path (daemon already up): `discover_embedderd` returns immediately on first probe, no extra latency. ## Notes - Closely related: the server also panics with `Cannot drop a runtime in a context where blocking is not allowed` when `discover_embedderd` is called directly from the `#[tokio::main]` async runtime (because `reqwest::blocking::Client::builder()` spawns/drops a runtime). That's a separate defect tracked in the same PR via `tokio::task::spawn_blocking`. Neither fix obsoletes the other — the panic fix lets the function run at all; this race fix lets it succeed under realistic timing. - Models loaded on this machine: `bge-small` (FP32 + INT8), `bge-base` (FP32 + INT8), `bge-reranker-base`. Roughly 2 GB total mmapped. A faster / smaller model set would reduce the race window but would not eliminate it; the retry is the correct robustness fix.