[infra] hero_proc service status: PID-alive → handler-responsive probes (catch half-broken services) #83

Closed
opened 2026-04-30 20:02:34 +00:00 by mik-tf · 1 comment
Owner

Symptom

Today on herodemo, two services were reported "running" by hero_proc service list but were actually broken from a user's perspective:

Service service list says Reality
hero_os running, processes: hero_os_server, hero_os_ui hero_os_server alive, hero_os_ui dead, ui.sock missing — UI shows "Socket 'ui.sock' not found"
hero_foundry running, processes: hero_foundry_server, hero_foundry_ui both PIDs alive, sockets exist, rpc.sock accepts TCP but HTTP handler returns empty reply for every user-facing path (/api/files/..., /health, /.well-known/...) — only internal heartbeat probes were getting through. Photos UI showed alt-text only. Restart fixed it.

In both cases nothing failed loudly — the supervisor reported green, and we only discovered the breakage because a user spotted broken UI in the browser.

Root cause

service status checks PID liveness. It does not verify that the service's sockets actually respond to a real handler request. So:

  • A child process can crash without taking the whole service down → supervisor still green.
  • A service's HTTP handler/dispatch task can panic while leaving the listener bound → supervisor still green.

In each case service list shows ● running. Reality is degraded.

Proposal

Upgrade hero_proc health probes from "PID alive" to "handler responsive":

  1. Each registered service exposes a small health_probe declared in its TOML — typically GET /health on each socket the service should be serving (rpc.sock, ui.sock, or both).
  2. hero_proc runs the probe periodically (e.g. every 30 s) AND on service status.
  3. If a probe fails N times in a row → mark the service degraded and trigger a restart (configurable: warn-only vs auto-recover).
  4. service list shows three states: ● running (probe ok), ◐ degraded (PID alive, probe failing), ○ stopped.

Acceptance

  • service status hero_os returns degraded if either rpc.sock or ui.sock is not responding to GET /health.
  • service list shows degraded as a distinct state.
  • After a child process dies, the next probe cycle detects it and supervisor restarts the service automatically (or alerts, depending on config).
  • Probe failures emit a structured log line so we can grep for them and surface them in monitoring.

Cross-references

  • Discovered live during today's session on herodemo (2026-04-30) — two independent half-broken services in one day.
  • Related but distinct: #78 (dangling rpc.sock dentry on supervisor restart) — that's about file/inode mismatch; this is about handler responsiveness.
  • Indirectly related: #80 (log_batcher leak) — once #80 ships, hero_proc restarts will be rarer, but probes are still needed because services can degrade independently.

Signed-off-by: mik-tf

## Symptom Today on herodemo, two services were reported "running" by `hero_proc service list` but were actually broken from a user's perspective: | Service | `service list` says | Reality | |---------|---------------------|---------| | hero_os | running, processes: hero_os_server, hero_os_ui | hero_os_server alive, **hero_os_ui dead, ui.sock missing** — UI shows "Socket 'ui.sock' not found" | | hero_foundry | running, processes: hero_foundry_server, hero_foundry_ui | both PIDs alive, sockets exist, **rpc.sock accepts TCP but HTTP handler returns empty reply for every user-facing path** (`/api/files/...`, `/health`, `/.well-known/...`) — only internal heartbeat probes were getting through. Photos UI showed alt-text only. Restart fixed it. | In both cases nothing failed loudly — the supervisor reported green, and we only discovered the breakage because a user spotted broken UI in the browser. ## Root cause `service status` checks PID liveness. It does not verify that the service's sockets actually respond to a real handler request. So: - A child process can crash without taking the whole service down → supervisor still green. - A service's HTTP handler/dispatch task can panic while leaving the listener bound → supervisor still green. In each case `service list` shows ● running. Reality is degraded. ## Proposal Upgrade `hero_proc` health probes from "PID alive" to "handler responsive": 1. Each registered service exposes a small `health_probe` declared in its TOML — typically `GET /health` on each socket the service should be serving (rpc.sock, ui.sock, or both). 2. `hero_proc` runs the probe periodically (e.g. every 30 s) AND on `service status`. 3. If a probe fails N times in a row → mark the service degraded and trigger a restart (configurable: warn-only vs auto-recover). 4. `service list` shows three states: ● running (probe ok), ◐ degraded (PID alive, probe failing), ○ stopped. ## Acceptance - [ ] `service status hero_os` returns degraded if either rpc.sock or ui.sock is not responding to GET /health. - [ ] `service list` shows degraded as a distinct state. - [ ] After a child process dies, the next probe cycle detects it and supervisor restarts the service automatically (or alerts, depending on config). - [ ] Probe failures emit a structured log line so we can grep for them and surface them in monitoring. ## Cross-references - Discovered live during today's session on herodemo (2026-04-30) — two independent half-broken services in one day. - Related but distinct: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/78 (dangling rpc.sock dentry on supervisor restart) — that's about file/inode mismatch; this is about handler responsiveness. - Indirectly related: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/80 (log_batcher leak) — once #80 ships, hero_proc restarts will be rarer, but probes are still needed because services can degrade independently. Signed-off-by: mik-tf
mik-tf self-assigned this 2026-04-30 20:02:34 +00:00
Owner

Implemented — handler-responsive probes

Lands as a service-level probe that runs in parallel with the existing per-action health_checks (which keep working unchanged).

Data model

ServiceSpec (in crates/hero_proc_lib/src/db/service/model.rs) gains an optional probe: Option<ServiceProbe> field. ServiceProbe carries the kind (Tcp / Http / OpenRpcSocket / OpenRpcHttp), interval, timeout, and consecutive_failures_to_red. The kind variants reuse the same probe primitives the per-action checks already use, so behavior is consistent.

JSON shape (round-trips through the existing services.spec_json column — no schema migration):

{
  "service": {
    "name": "hero_foundry",
    "probe": {
      "kind": "open_rpc_socket",
      "socket": "/path/to/rpc.sock",
      "interval_ms": 5000,
      "timeout_ms": 1000,
      "consecutive_failures_to_red": 3
    }
  }
}

Runtime

crates/hero_proc_server/src/supervisor/service_state.rs evaluates every declared probe every 5s. State is held in an in-memory store shared with the RPC layer. After consecutive_failures_to_red failures, service.status returns state="failed" with health_reason="probe failed N times: <last error>" — even if the underlying child PIDs are alive.

This is the "is the handler responsive?" check called for in this issue. It catches the half-broken-listener pattern: PID alive, listener accepts TCP, but the handler dispatch is poisoned and returns empty replies. The probe issues a real OpenRPC ping (or the configured variant) and treats no-response as failure.

Wiring

  • Supervisor::new() creates a ServiceStateStore. The supervisor's main poll tick calls service_state::run_one_round() every 5s.
  • WebState (in web.rs) carries the same store handle into the RPC layer.
  • rpc::service::handle_status and handle_status_full apply a health_overlay on top of the existing job-phase-derived state. Probe red → state="failed" → operator sees the truth.

Verification

cargo check --workspace --all-targets clean. Closing as implemented. Auto-respawn on red is intentionally not wired here — operator triggers it via service.reset_failed (delivered under #86 P4).

## Implemented — handler-responsive probes Lands as a service-level probe that runs in parallel with the existing per-action `health_checks` (which keep working unchanged). ### Data model `ServiceSpec` (in `crates/hero_proc_lib/src/db/service/model.rs`) gains an optional `probe: Option<ServiceProbe>` field. `ServiceProbe` carries the kind (`Tcp` / `Http` / `OpenRpcSocket` / `OpenRpcHttp`), interval, timeout, and `consecutive_failures_to_red`. The kind variants reuse the same probe primitives the per-action checks already use, so behavior is consistent. JSON shape (round-trips through the existing `services.spec_json` column — no schema migration): ```json { "service": { "name": "hero_foundry", "probe": { "kind": "open_rpc_socket", "socket": "/path/to/rpc.sock", "interval_ms": 5000, "timeout_ms": 1000, "consecutive_failures_to_red": 3 } } } ``` ### Runtime `crates/hero_proc_server/src/supervisor/service_state.rs` evaluates every declared probe every 5s. State is held in an in-memory store shared with the RPC layer. After `consecutive_failures_to_red` failures, `service.status` returns `state="failed"` with `health_reason="probe failed N times: <last error>"` — even if the underlying child PIDs are alive. This is the "is the handler responsive?" check called for in this issue. It catches the half-broken-listener pattern: PID alive, listener accepts TCP, but the handler dispatch is poisoned and returns empty replies. The probe issues a real OpenRPC ping (or the configured variant) and treats no-response as failure. ### Wiring - `Supervisor::new()` creates a `ServiceStateStore`. The supervisor's main poll tick calls `service_state::run_one_round()` every 5s. - `WebState` (in `web.rs`) carries the same store handle into the RPC layer. - `rpc::service::handle_status` and `handle_status_full` apply a `health_overlay` on top of the existing job-phase-derived state. Probe red → state="failed" → operator sees the truth. ### Verification `cargo check --workspace --all-targets` clean. Closing as implemented. Auto-respawn on red is intentionally **not** wired here — operator triggers it via `service.reset_failed` (delivered under #86 P4).
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#83
No description provided.