_admin and _ui health checks probe localhost:80 but daemons bind UDS only, causing restart-loop #21

Closed
opened 2026-05-23 14:14:39 +00:00 by mik-tf · 1 comment
Owner

The action specs for hero_assistance_admin and hero_assistance_ui in crates/hero_assistance/src/main.rs (lines 295-307 for _ui, lines 338-350 for _admin) configure hero_proc health checks with http_url: "http://localhost/health", but neither daemon binds TCP by default. Both bind UDS only (admin.sock and app.sock respectively). Every health probe attempt fails because nothing on the host serves http://localhost:80/health, and after the retry budget elapses (start_period 5s plus 3 retries against a 5s timeout) hero_proc kills the daemon and restarts it. Observed cadence is roughly 30 to 35 seconds per restart cycle, confirmed today against the current development HEAD (49ea76a7). The _server action uses openrpc_socket: Some(server_sock) instead and stays alive correctly. This blocks #18 acceptance independently of lhumina_code/hero_router#109: even once hero_router routing is fixed, the operator Admin and customer UI panes are only reachable during the brief alive windows between restarts. Likely fix paths: switch both health checks to a hero_proc UDS-aware probe against admin.sock and app.sock mirroring the pattern _server already uses, or have _admin and _ui bind a localhost loopback TCP port by default so the existing probe has something to hit.

The action specs for `hero_assistance_admin` and `hero_assistance_ui` in `crates/hero_assistance/src/main.rs` (lines 295-307 for `_ui`, lines 338-350 for `_admin`) configure hero_proc health checks with `http_url: "http://localhost/health"`, but neither daemon binds TCP by default. Both bind UDS only (`admin.sock` and `app.sock` respectively). Every health probe attempt fails because nothing on the host serves `http://localhost:80/health`, and after the retry budget elapses (start_period 5s plus 3 retries against a 5s timeout) hero_proc kills the daemon and restarts it. Observed cadence is roughly 30 to 35 seconds per restart cycle, confirmed today against the current `development` HEAD (49ea76a7). The `_server` action uses `openrpc_socket: Some(server_sock)` instead and stays alive correctly. This blocks https://forge.ourworld.tf/lhumina_code/hero_assistance/issues/18 acceptance independently of https://forge.ourworld.tf/lhumina_code/hero_router/issues/109: even once hero_router routing is fixed, the operator Admin and customer UI panes are only reachable during the brief alive windows between restarts. Likely fix paths: switch both health checks to a hero_proc UDS-aware probe against `admin.sock` and `app.sock` mirroring the pattern `_server` already uses, or have `_admin` and `_ui` bind a localhost loopback TCP port by default so the existing probe has something to hit.
Author
Owner

Closed via squash-merge ee2be7d3 on development (PR #22). Both _ui (lines 295-307) and _admin (lines 338-350) HealthCheck blocks in crates/hero_assistance/src/main.rs now use openrpc_socket: Some(<their UDS path>) mirroring _server's working pattern. HealthDef::OpenRpcSocket is a connect-only probe per hero_proc_server/src/types/config_ext.rs:30, so the daemons do not need to expose /rpc or /openrpc.json for the probe itself; this matters because hero_assistance_ui only exposes /rpc (not /openrpc.json).

New unit test phase24c_build_service_definition_health_checks_use_uds_connect_probe pins the contract across all three actions.

Live verify on the rebuilt + reinstalled binaries: hero_assistance --start brought up all three daemons; after 5.5 minutes under hero_proc supervision the job list still showed running phase for hero_assistance_server (PID 3679580), hero_assistance_ui (PID 3679536), and hero_assistance_admin (PID 3679504) — same PIDs, no restart cycle, ps -o etime confirmed ~6 minutes of uptime per process. curl --unix-socket against rpc.sock, app.sock, and admin.sock all returned HTTP 200 with the expected {"service":"hero_assistance","status":"ok","version":"0.5.0"} health JSON.

Pre-merge gate: cargo fmt --check + cargo clippy --release --workspace --all-targets -- -D warnings + cargo build --workspace --release all clean. Workspace tests 255 pass / 2 fail / 14 ignored (+1 from the new pin test vs the 254/1/14 baseline; the 2 fails are documented pre-existing flakes phase24b_ui_add_access_fails_when_hero_proc_unreachable + the transient phase10_multi_project_merged_stream_tags_by_project_id).

Unblocks row 2 of #18 acceptance.

Closed via squash-merge `ee2be7d3` on `development` (PR #22). Both `_ui` (lines 295-307) and `_admin` (lines 338-350) HealthCheck blocks in `crates/hero_assistance/src/main.rs` now use `openrpc_socket: Some(<their UDS path>)` mirroring `_server`'s working pattern. `HealthDef::OpenRpcSocket` is a connect-only probe per `hero_proc_server/src/types/config_ext.rs:30`, so the daemons do not need to expose `/rpc` or `/openrpc.json` for the probe itself; this matters because `hero_assistance_ui` only exposes `/rpc` (not `/openrpc.json`). New unit test `phase24c_build_service_definition_health_checks_use_uds_connect_probe` pins the contract across all three actions. Live verify on the rebuilt + reinstalled binaries: `hero_assistance --start` brought up all three daemons; after 5.5 minutes under hero_proc supervision the job list still showed `running` phase for `hero_assistance_server` (PID 3679580), `hero_assistance_ui` (PID 3679536), and `hero_assistance_admin` (PID 3679504) — same PIDs, no restart cycle, `ps -o etime` confirmed ~6 minutes of uptime per process. `curl --unix-socket` against `rpc.sock`, `app.sock`, and `admin.sock` all returned HTTP 200 with the expected `{"service":"hero_assistance","status":"ok","version":"0.5.0"}` health JSON. Pre-merge gate: `cargo fmt --check` + `cargo clippy --release --workspace --all-targets -- -D warnings` + `cargo build --workspace --release` all clean. Workspace tests 255 pass / 2 fail / 14 ignored (+1 from the new pin test vs the 254/1/14 baseline; the 2 fails are documented pre-existing flakes `phase24b_ui_add_access_fails_when_hero_proc_unreachable` + the transient `phase10_multi_project_merged_stream_tags_by_project_id`). Unblocks row 2 of #18 acceptance.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_assistance#21
No description provided.