lhumina_code/hero_proc

Fork 0

[infra] hero_proc service status: PID-alive → handler-responsive probes (catch half-broken services) #83

New issue

Closed

opened 2026-04-30 20:02:34 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-04-30 20:02:34 +00:00

Owner

Symptom

Today on herodemo, two services were reported "running" by hero_proc service list but were actually broken from a user's perspective:

Service	`service list` says	Reality
hero_os	running, processes: hero_os_server, hero_os_ui	hero_os_server alive, hero_os_ui dead, ui.sock missing — UI shows "Socket 'ui.sock' not found"
hero_foundry	running, processes: hero_foundry_server, hero_foundry_ui	both PIDs alive, sockets exist, rpc.sock accepts TCP but HTTP handler returns empty reply for every user-facing path (`/api/files/...`, `/health`, `/.well-known/...`) — only internal heartbeat probes were getting through. Photos UI showed alt-text only. Restart fixed it.

In both cases nothing failed loudly — the supervisor reported green, and we only discovered the breakage because a user spotted broken UI in the browser.

Root cause

service status checks PID liveness. It does not verify that the service's sockets actually respond to a real handler request. So:

A child process can crash without taking the whole service down → supervisor still green.
A service's HTTP handler/dispatch task can panic while leaving the listener bound → supervisor still green.

In each case service list shows ● running. Reality is degraded.

Proposal

Upgrade hero_proc health probes from "PID alive" to "handler responsive":

Each registered service exposes a small health_probe declared in its TOML — typically GET /health on each socket the service should be serving (rpc.sock, ui.sock, or both).
hero_proc runs the probe periodically (e.g. every 30 s) AND on service status.
If a probe fails N times in a row → mark the service degraded and trigger a restart (configurable: warn-only vs auto-recover).
service list shows three states: ● running (probe ok), ◐ degraded (PID alive, probe failing), ○ stopped.

Acceptance

service status hero_os returns degraded if either rpc.sock or ui.sock is not responding to GET /health.
service list shows degraded as a distinct state.
After a child process dies, the next probe cycle detects it and supervisor restarts the service automatically (or alerts, depending on config).
Probe failures emit a structured log line so we can grep for them and surface them in monitoring.

Cross-references

Discovered live during today's session on herodemo (2026-04-30) — two independent half-broken services in one day.
Related but distinct: #78 (dangling rpc.sock dentry on supervisor restart) — that's about file/inode mismatch; this is about handler responsiveness.
Indirectly related: #80 (log_batcher leak) — once #80 ships, hero_proc restarts will be rarer, but probes are still needed because services can degrade independently.

Signed-off-by: mik-tf

## Symptom Today on herodemo, two services were reported "running" by `hero_proc service list` but were actually broken from a user's perspective: | Service | `service list` says | Reality | |---------|---------------------|---------| | hero_os | running, processes: hero_os_server, hero_os_ui | hero_os_server alive, **hero_os_ui dead, ui.sock missing** — UI shows "Socket 'ui.sock' not found" | | hero_foundry | running, processes: hero_foundry_server, hero_foundry_ui | both PIDs alive, sockets exist, **rpc.sock accepts TCP but HTTP handler returns empty reply for every user-facing path** (`/api/files/...`, `/health`, `/.well-known/...`) — only internal heartbeat probes were getting through. Photos UI showed alt-text only. Restart fixed it. | In both cases nothing failed loudly — the supervisor reported green, and we only discovered the breakage because a user spotted broken UI in the browser. ## Root cause `service status` checks PID liveness. It does not verify that the service's sockets actually respond to a real handler request. So: - A child process can crash without taking the whole service down → supervisor still green. - A service's HTTP handler/dispatch task can panic while leaving the listener bound → supervisor still green. In each case `service list` shows ● running. Reality is degraded. ## Proposal Upgrade `hero_proc` health probes from "PID alive" to "handler responsive": 1. Each registered service exposes a small `health_probe` declared in its TOML — typically `GET /health` on each socket the service should be serving (rpc.sock, ui.sock, or both). 2. `hero_proc` runs the probe periodically (e.g. every 30 s) AND on `service status`. 3. If a probe fails N times in a row → mark the service degraded and trigger a restart (configurable: warn-only vs auto-recover). 4. `service list` shows three states: ● running (probe ok), ◐ degraded (PID alive, probe failing), ○ stopped. ## Acceptance - [ ] `service status hero_os` returns degraded if either rpc.sock or ui.sock is not responding to GET /health. - [ ] `service list` shows degraded as a distinct state. - [ ] After a child process dies, the next probe cycle detects it and supervisor restarts the service automatically (or alerts, depending on config). - [ ] Probe failures emit a structured log line so we can grep for them and surface them in monitoring. ## Cross-references - Discovered live during today's session on herodemo (2026-04-30) — two independent half-broken services in one day. - Related but distinct: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/78 (dangling rpc.sock dentry on supervisor restart) — that's about file/inode mismatch; this is about handler responsiveness. - Indirectly related: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/80 (log_batcher leak) — once #80 ships, hero_proc restarts will be rarer, but probes are still needed because services can degrade independently. Signed-off-by: mik-tf

mik-tf self-assigned this

2026-04-30 20:02:34 +00:00

mik-tf referenced this issue from lhumina_code/home

2026-04-30 20:02:43 +00:00

[infra] Live smoke loop against herodemo — automated detection of broken services before users see it #201

mik-tf referenced this issue from lhumina_code/home

2026-04-30 20:02:50 +00:00

[infra] Half-broken running service pattern — listener alive, handler dead (foundry / OServer dispatch) #202

mik-tf referenced this issue from lhumina_code/home

2026-04-30 20:12:05 +00:00

[arch] OServer panic isolation — make broken-handler-with-live-listener impossible, not just recoverable #204

mik-tf referenced this issue

2026-04-30 20:12:10 +00:00

[arch] Service readiness contract — services declare ready, supervisor doesn't guess from PID #84

mik-tf referenced this issue

2026-04-30 23:01:22 +00:00

[meta] hero_proc reliability roadmap — every observed failure mode + tracking + order of attack #86

mik-tf referenced this issue from lhumina_code/hero_demo

2026-05-01 02:05:19 +00:00

[vision] Hero OS as an ambient AI desktop — sovereign, voice-native, multi-context #52

despiegk referenced this issue

2026-05-01 04:05:50 +00:00

[meta] hero_proc reliability roadmap — every observed failure mode + tracking + order of attack #86

despiegk referenced this issue

2026-05-01 04:06:51 +00:00

Dangling rpc.sock dentry on service restart — kernel listener alive, file vanished #78

despiegk commented

2026-05-01 04:06:51 +00:00

Owner

Implemented — handler-responsive probes

Lands as a service-level probe that runs in parallel with the existing per-action health_checks (which keep working unchanged).

Data model

ServiceSpec (in crates/hero_proc_lib/src/db/service/model.rs) gains an optional probe: Option<ServiceProbe> field. ServiceProbe carries the kind (Tcp / Http / OpenRpcSocket / OpenRpcHttp), interval, timeout, and consecutive_failures_to_red. The kind variants reuse the same probe primitives the per-action checks already use, so behavior is consistent.

JSON shape (round-trips through the existing services.spec_json column — no schema migration):

{
  "service": {
    "name": "hero_foundry",
    "probe": {
      "kind": "open_rpc_socket",
      "socket": "/path/to/rpc.sock",
      "interval_ms": 5000,
      "timeout_ms": 1000,
      "consecutive_failures_to_red": 3
    }
  }
}

Runtime

crates/hero_proc_server/src/supervisor/service_state.rs evaluates every declared probe every 5s. State is held in an in-memory store shared with the RPC layer. After consecutive_failures_to_red failures, service.status returns state="failed" with health_reason="probe failed N times: <last error>" — even if the underlying child PIDs are alive.

This is the "is the handler responsive?" check called for in this issue. It catches the half-broken-listener pattern: PID alive, listener accepts TCP, but the handler dispatch is poisoned and returns empty replies. The probe issues a real OpenRPC ping (or the configured variant) and treats no-response as failure.

Wiring

Supervisor::new() creates a ServiceStateStore. The supervisor's main poll tick calls service_state::run_one_round() every 5s.
WebState (in web.rs) carries the same store handle into the RPC layer.
rpc::service::handle_status and handle_status_full apply a health_overlay on top of the existing job-phase-derived state. Probe red → state="failed" → operator sees the truth.

Verification

cargo check --workspace --all-targets clean. Closing as implemented. Auto-respawn on red is intentionally not wired here — operator triggers it via service.reset_failed (delivered under #86 P4).

## Implemented — handler-responsive probes Lands as a service-level probe that runs in parallel with the existing per-action `health_checks` (which keep working unchanged). ### Data model `ServiceSpec` (in `crates/hero_proc_lib/src/db/service/model.rs`) gains an optional `probe: Option<ServiceProbe>` field. `ServiceProbe` carries the kind (`Tcp` / `Http` / `OpenRpcSocket` / `OpenRpcHttp`), interval, timeout, and `consecutive_failures_to_red`. The kind variants reuse the same probe primitives the per-action checks already use, so behavior is consistent. JSON shape (round-trips through the existing `services.spec_json` column — no schema migration): ```json { "service": { "name": "hero_foundry", "probe": { "kind": "open_rpc_socket", "socket": "/path/to/rpc.sock", "interval_ms": 5000, "timeout_ms": 1000, "consecutive_failures_to_red": 3 } } } ``` ### Runtime `crates/hero_proc_server/src/supervisor/service_state.rs` evaluates every declared probe every 5s. State is held in an in-memory store shared with the RPC layer. After `consecutive_failures_to_red` failures, `service.status` returns `state="failed"` with `health_reason="probe failed N times: <last error>"` — even if the underlying child PIDs are alive. This is the "is the handler responsive?" check called for in this issue. It catches the half-broken-listener pattern: PID alive, listener accepts TCP, but the handler dispatch is poisoned and returns empty replies. The probe issues a real OpenRPC ping (or the configured variant) and treats no-response as failure. ### Wiring - `Supervisor::new()` creates a `ServiceStateStore`. The supervisor's main poll tick calls `service_state::run_one_round()` every 5s. - `WebState` (in `web.rs`) carries the same store handle into the RPC layer. - `rpc::service::handle_status` and `handle_status_full` apply a `health_overlay` on top of the existing job-phase-derived state. Probe red → state="failed" → operator sees the truth. ### Verification `cargo check --workspace --all-targets` clean. Closing as implemented. Auto-respawn on red is intentionally **not** wired here — operator triggers it via `service.reset_failed` (delivered under #86 P4).

despiegk closed this issue

2026-05-01 04:06:56 +00:00

despiegk referenced this issue

2026-05-01 05:14:49 +00:00

spec(db): verify ServiceSpec probe/sockets/require_ready round-trip through spec_json — add regression test + doc contract #90

mik-tf referenced this issue

2026-05-01 14:20:47 +00:00

[bug][P1] Log-store runaway: 58 GB in 24h from one chatty service wedged hero_proc + filled /data on herodemo #87

mik-tf referenced this issue from lhumina_code/hero_demo

2026-05-05 04:28:11 +00:00

[infra][P1] CI-built static-musl binaries + --from-ci install path — make deploys minutes, not hours #54

mik-tf referenced this issue from lhumina_code/hero_demo