[infra] hero_proc service status: PID-alive → handler-responsive probes (catch half-broken services) #83
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#83
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
Today on herodemo, two services were reported "running" by
hero_proc service listbut were actually broken from a user's perspective:service listsays/api/files/...,/health,/.well-known/...) — only internal heartbeat probes were getting through. Photos UI showed alt-text only. Restart fixed it.In both cases nothing failed loudly — the supervisor reported green, and we only discovered the breakage because a user spotted broken UI in the browser.
Root cause
service statuschecks PID liveness. It does not verify that the service's sockets actually respond to a real handler request. So:In each case
service listshows ● running. Reality is degraded.Proposal
Upgrade
hero_prochealth probes from "PID alive" to "handler responsive":health_probedeclared in its TOML — typicallyGET /healthon each socket the service should be serving (rpc.sock, ui.sock, or both).hero_procruns the probe periodically (e.g. every 30 s) AND onservice status.service listshows three states: ● running (probe ok), ◐ degraded (PID alive, probe failing), ○ stopped.Acceptance
service status hero_osreturns degraded if either rpc.sock or ui.sock is not responding to GET /health.service listshows degraded as a distinct state.Cross-references
Signed-off-by: mik-tf
Implemented — handler-responsive probes
Lands as a service-level probe that runs in parallel with the existing per-action
health_checks(which keep working unchanged).Data model
ServiceSpec(incrates/hero_proc_lib/src/db/service/model.rs) gains an optionalprobe: Option<ServiceProbe>field.ServiceProbecarries the kind (Tcp/Http/OpenRpcSocket/OpenRpcHttp), interval, timeout, andconsecutive_failures_to_red. The kind variants reuse the same probe primitives the per-action checks already use, so behavior is consistent.JSON shape (round-trips through the existing
services.spec_jsoncolumn — no schema migration):Runtime
crates/hero_proc_server/src/supervisor/service_state.rsevaluates every declared probe every 5s. State is held in an in-memory store shared with the RPC layer. Afterconsecutive_failures_to_redfailures,service.statusreturnsstate="failed"withhealth_reason="probe failed N times: <last error>"— even if the underlying child PIDs are alive.This is the "is the handler responsive?" check called for in this issue. It catches the half-broken-listener pattern: PID alive, listener accepts TCP, but the handler dispatch is poisoned and returns empty replies. The probe issues a real OpenRPC ping (or the configured variant) and treats no-response as failure.
Wiring
Supervisor::new()creates aServiceStateStore. The supervisor's main poll tick callsservice_state::run_one_round()every 5s.WebState(inweb.rs) carries the same store handle into the RPC layer.rpc::service::handle_statusandhandle_status_fullapply ahealth_overlayon top of the existing job-phase-derived state. Probe red → state="failed" → operator sees the truth.Verification
cargo check --workspace --all-targetsclean. Closing as implemented. Auto-respawn on red is intentionally not wired here — operator triggers it viaservice.reset_failed(delivered under #86 P4).