lhumina_code/home

Fork 0

[nu-demo] hero_embedder_server starts before hero_embedderd finishes loading models — needs dependency ordering #168

New issue

Closed

opened 2026-04-24 14:48:16 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-04-24 14:48:16 +00:00

Owner

Symptom

On service restart, hero_embedder_server fails on first attempt with:

HERO_EMBEDDERD_URL='http://127.0.0.1:8092' is set but the daemon is not reachable.
Start hero_embedderd or unset the variable to fall back to the loopback default.
Error: hero_embedderd is required but not reachable

hero_embedderd takes ~15s to load the 4 ONNX models (bge-small, bge-base, bge-reranker-base, etc.). hero_embedder_server checks HERO_EMBEDDERD_URL connectivity at startup and refuses to run if the daemon isn't ready.

The action's retry_policy says max_attempts=5, backoff=true, delay_ms=2000. That should eventually succeed once the daemon is up (~15s after start). But in practice the 5 retries fall within the 15s window and all fail, then hero_proc marks the job failed and stops retrying.

Manually retrying via hero_proc job retry hero_embedder hero_embedder_server at any point AFTER daemon is healthy succeeds.

Root cause

hero_proc service definitions don't express "start hero_embedder_server AFTER hero_embedderd is healthy." The service add --after <service> option exists for service-level ordering, but not for action-level ordering WITHIN a service.

When a service has multiple actions, they all start concurrently (or in a semi-random order), and there's no way to gate action B on action A's health check passing.

Fixes (ordered by effort)

1. Bump the retry policy

Cheapest: in service_embedder.nu, set hero_embedder_server's retry_policy to max_attempts=20, stability_period_ms=60000 so it keeps retrying past the daemon's 15s warmup. Works but wastes CPU on failed-connect attempts.

2. Add an action-level dependency mechanism

Extend the ActionSpec schema with an optional depends_on_action: Option<Vec<String>> field. hero_proc supervisor waits until each listed action's health check passes before starting this one. Minor hero_proc code change; cleanest solution.

3. Run the server binary as a child-process-of hero_embedderd instead of a sibling

Refactor hero_embedder_server to be a thread within hero_embedderd (or use a Unix socket that only gets bound after the daemon declares itself ready). Not backward compatible but architecturally cleaner — sibling issue #145 already suggests converting to an async RPC model.

4. Supervisor-level startup probe

hero_proc could have a "wait-for-port / wait-for-socket" pre-start hook per action. hero_embedder_server.pre_start = "wait_tcp 127.0.0.1 8092 30s". Small addition; covers dozens of similar ordering issues across the Hero stack.

Demo workaround (applied on herodemo 2026-04-24)

After the service starts and hero_embedderd comes online, manually retry the failed server job:

hero_proc job retry hero_embedder hero_embedder_server

This always works because the daemon is up by then.

home#145 — hero_embedder blocking reqwest + async refactor (sibling architectural issue)
home#133 — service_livekit no redis preflight (same class of missing-dependency-order issues)
home#160 — consolidated demo state

Signed-off-by: mik-tf

## Symptom On service restart, `hero_embedder_server` fails on first attempt with: ``` HERO_EMBEDDERD_URL='http://127.0.0.1:8092' is set but the daemon is not reachable. Start hero_embedderd or unset the variable to fall back to the loopback default. Error: hero_embedderd is required but not reachable ``` hero_embedderd takes ~15s to load the 4 ONNX models (bge-small, bge-base, bge-reranker-base, etc.). hero_embedder_server checks `HERO_EMBEDDERD_URL` connectivity at startup and refuses to run if the daemon isn't ready. The action's retry_policy says `max_attempts=5, backoff=true, delay_ms=2000`. That should eventually succeed once the daemon is up (~15s after start). But in practice the 5 retries fall within the 15s window and all fail, then hero_proc marks the job `failed` and stops retrying. Manually retrying via `hero_proc job retry hero_embedder hero_embedder_server` at any point AFTER daemon is healthy succeeds. ## Root cause hero_proc service definitions don't express "start hero_embedder_server AFTER hero_embedderd is healthy." The `service add --after <service>` option exists for service-level ordering, but not for action-level ordering WITHIN a service. When a service has multiple actions, they all start concurrently (or in a semi-random order), and there's no way to gate action B on action A's health check passing. ## Fixes (ordered by effort) ### 1. Bump the retry policy Cheapest: in `service_embedder.nu`, set `hero_embedder_server`'s `retry_policy` to `max_attempts=20, stability_period_ms=60000` so it keeps retrying past the daemon's 15s warmup. Works but wastes CPU on failed-connect attempts. ### 2. Add an action-level dependency mechanism Extend the ActionSpec schema with an optional `depends_on_action: Option<Vec<String>>` field. hero_proc supervisor waits until each listed action's health check passes before starting this one. Minor hero_proc code change; cleanest solution. ### 3. Run the server binary as a child-process-of hero_embedderd instead of a sibling Refactor hero_embedder_server to be a thread within hero_embedderd (or use a Unix socket that only gets bound after the daemon declares itself ready). Not backward compatible but architecturally cleaner — sibling issue [#145](https://forge.ourworld.tf/lhumina_code/home/issues/145) already suggests converting to an async RPC model. ### 4. Supervisor-level startup probe hero_proc could have a "wait-for-port / wait-for-socket" pre-start hook per action. `hero_embedder_server.pre_start = "wait_tcp 127.0.0.1 8092 30s"`. Small addition; covers dozens of similar ordering issues across the Hero stack. ## Demo workaround (applied on herodemo 2026-04-24) After the service starts and `hero_embedderd` comes online, manually retry the failed server job: ``` hero_proc job retry hero_embedder hero_embedder_server ``` This always works because the daemon is up by then. ## Related - [home#145](https://forge.ourworld.tf/lhumina_code/home/issues/145) — hero_embedder blocking reqwest + async refactor (sibling architectural issue) - [home#133](https://forge.ourworld.tf/lhumina_code/home/issues/133) — service_livekit no redis preflight (same class of missing-dependency-order issues) - [home#160](https://forge.ourworld.tf/lhumina_code/home/issues/160) — consolidated demo state Signed-off-by: mik-tf

mik-tf referenced this issue

2026-04-25 20:39:11 +00:00

[Phase 2] Codify demo-VM hotfixes into upstream code — clean runbook, no manual steps #185

mik-tf referenced this issue

2026-04-25 22:48:54 +00:00

[prod] Hero OS as a versioned nu-shell distribution #187

mik-tf commented

2026-04-27 03:24:45 +00:00

Author

Owner

Fixed in current service_embedder.nu via fix-option #1 from the issue body ("Bump the retry policy"), implemented via start_timeout_ms rather than max_attempts — different mechanism, same outcome.

Verification (service_embedder.nu):

L218: update retry_policy {|t| $t.retry_policy | merge {start_timeout_ms: 180000}} for hero_embedderd (180s window).
L263: update retry_policy {|t| $t.retry_policy | merge {start_timeout_ms: 120000}} for hero_embedder_server (120s window).

With the daemon warming up the 4 ONNX models in ~15s, a 120-second start_timeout window gives the server's existing retry budget (max_attempts=5, backoff=true, delay_ms=2000) ample room to span the warmup and succeed on a later attempt without hero_proc giving up. The original symptom (server marked failed because all 5 retries fell within the 15s window) is no longer reachable.

Functional confirmation: the herodemo bring-up sessions in 2026-04-25 / 2026-04-26 saw hero_embedder_server reach running state on every restart without the manual hero_proc job retry hero_embedder_server workaround the issue body describes.

Architectural follow-up (NOT this fix): the issue's option #2 — adding a true depends_on_action: Option<Vec<String>> to ActionSpec so the supervisor gates B on A's health check — remains the cleanest long-term answer, especially as more services adopt the daemon-plus-server pattern (already happening in hero_office/onlyoffice). That work would land in hero_proc / hero_rpc, not in this module, and is properly tracked separately when the time comes.

Meta-tracker: home#193.

Signed-off-by: mik-tf

Fixed in current `service_embedder.nu` via fix-option #1 from the issue body ("Bump the retry policy"), implemented via `start_timeout_ms` rather than `max_attempts` — different mechanism, same outcome. **Verification (`service_embedder.nu`):** - L218: `update retry_policy {|t| $t.retry_policy | merge {start_timeout_ms: 180000}}` for `hero_embedderd` (180s window). - L263: `update retry_policy {|t| $t.retry_policy | merge {start_timeout_ms: 120000}}` for `hero_embedder_server` (120s window). With the daemon warming up the 4 ONNX models in ~15s, a 120-second start_timeout window gives the server's existing retry budget (`max_attempts=5, backoff=true, delay_ms=2000`) ample room to span the warmup and succeed on a later attempt without hero_proc giving up. The original symptom (server marked `failed` because all 5 retries fell within the 15s window) is no longer reachable. **Functional confirmation:** the herodemo bring-up sessions in 2026-04-25 / 2026-04-26 saw `hero_embedder_server` reach `running` state on every restart without the manual `hero_proc job retry hero_embedder_server` workaround the issue body describes. **Architectural follow-up (NOT this fix):** the issue's option #2 — adding a true `depends_on_action: Option<Vec<String>>` to ActionSpec so the supervisor gates B on A's health check — remains the cleanest long-term answer, especially as more services adopt the daemon-plus-server pattern (already happening in hero_office/onlyoffice). That work would land in hero_proc / hero_rpc, not in this module, and is properly tracked separately when the time comes. Meta-tracker: home#193. Signed-off-by: mik-tf

mik-tf closed this issue

2026-04-27 03:24:45 +00:00

mik-tf referenced this issue from lhumina_code/hero_demo

2026-04-28 12:21:28 +00:00

[Phase 2] Codify demo-VM hotfixes into upstream code — clean runbook, no manual steps #36

mik-tf referenced this issue from lhumina_code/hero_demo

2026-04-28 12:21:36 +00:00

[prod] Hero OS as a versioned nu-shell distribution #38