[arch] Service readiness contract — services declare ready, supervisor doesn't guess from PID #84
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#84
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Premise
hero_procdecides "service is running" from PID liveness + (sometimes) socket-file existence. Both are weak proxies for the thing we actually care about: the service is ready to handle requests.Health probes (#83) bolt a periodic "is the handler responding" check on top of this. That's the coping layer. This issue is the structural fix: services should explicitly declare readiness, the same way systemd's
sd_notify(READY=1)and Kubernetes'readinessProbework in mature ecosystems.What's wrong with the current shape
PID alive ≠ service ready
A child process can be spawned but still inside its initialization (loading models, opening sqlite, binding sockets, joining a cluster). PID-alive returns "running" the moment the binary starts executing. Anything that depends on the service not yet being ready (a sibling that races startup, a smoke test that fires too early) sees flaky failures.
Socket file exists ≠ socket accepting connections
After
hero_proc service restart, the supervisor checks the socket file withstat()and reports green. But the new instance may not yet have calledbind()— or may have bound but not yetaccept()-ed. The file might also be a stale dentry (#78) — present but not connected to a live listener.Two child processes, supervisor only watches one
A service like
hero_oshas two declared children:hero_os_serverandhero_os_ui. Today's supervision treats the service as a single unit — if one child silently dies,service liststill shows ● running. The user only finds out when they hit the URL. (Today's session, pid 4993 alive, ui.sock missing.)Probes detect, contracts prevent
Probes (issue #83) are reactive: every 30s, ask "are you alive?" Best-case detection latency is the probe interval. Contracts are proactive: the service itself raises a flag the moment it's ready (or unready), and the supervisor consumes that signal. Detection is instantaneous.
Proposal
A readiness contract for every Hero service. Concretely:
1. Each service declares its socket(s) in its TOML
2. Each service binary signals "ready" explicitly
Mechanism options (pick one, document the choice):
creat()s~/hero/var/sockets/<svc>/.readyafter every declared listener has calledaccept()and is serving. Simplest, no IPC dependency. Supervisor watches with inotify.GETs its own/healthafter binding, then writes a ready flag. Reuses health-probe infra.Recommendation: socket-on-disk file (
.ready) — atomiccreat(), easy to debug, no extra protocol.3.
service listreflects three states from contract, not from PID guesses.readypresent, last probe ok.readynot yet written, withinready_timeoutready_timeoutelapsed without.ready4.
service startblocks until ready (configurable)hero_proc service start hero_foundryreturns only when the service has signaled ready (orready_timeoutelapsed). No more "started, but not really" race. CLI flag--no-waitfor batch starts.5. Each child process in a multi-process service signals separately
hero_oshashero_os_serverandhero_os_ui. Each writes its own.readyfile (.ready.server,.ready.ui). Supervisor requires all of them. This catches today's "one child silently dead" pattern at startup, not 8 hours later.6. Liveness vs readiness are separate
A service can transition ready → unready at runtime (e.g., its DB went away). A separate
.healthyfile that the service touches/removes when it knows. Health probes (#83) are still useful as defense-in-depth, but the primary signal is the service's own self-report.Why this is "long-term," not coping
After this lands:
service start --waitis reliable.This is what mature service supervisors do. Hero is reinventing this primitive; let's do it deliberately.
Acceptance
socketsandready_timeout..readyfiles viacreat()).hero_procwatches readiness signals (inotify on the sockets dir).service listshows starting/running/failed/unhealthy as distinct states sourced from the contract.service start <name>blocks until ready by default;--no-waitavailable for batch.docs/dev/SERVICE_READINESS.md.Cross-references
sd_notify(3), KubernetesreadinessProbe,s6-rcnotifications.Signed-off-by: mik-tf
Implemented —
.readyfile readiness contractThe mechanism picked: a
.readymarker file in the service's socket directory. Closest analog tosd_notify(READY=1)without dragging systemd into hero. The file lives at$HERO_SOCKET_DIR/<service>/.readyand contains the writing PID as plain ASCII (informational — the supervisor only checks for presence).SDK helper
New module
crates/hero_proc_sdk/src/ready.rsexposes:declare_ready(service_name)— call once your listener is bound and your handler answers a self-probe.clear_ready(service_name)— idempotent removal on graceful shutdown.is_ready(service_name)/ready_pid(service_name)— for tests and operator tooling.Pre-spawn cleanup of the marker is handled by the supervisor (see #78), so a crashed instance does not leave a stale marker for the next one.
Supervisor side
ServiceSpecgainsrequire_ready: bool(default false to preserve behavior for un-migrated services).When
require_ready=true,service.statusreturnsstate="starting"untilservice_state::run_one_round()observes the marker file. This means a service that binds its socket but panics in handler init never reports "running" — closing the gap whereservice listwould show green for a service that is still loading models, opening sqlite, joining a cluster, etc.Why
.readyover alternatives.readyfile: zero coupling. Service writes one file. Supervisor stats it. Pre-spawn cleanup (already in place for sockets per #78) covers the crash case.Wiring
ServiceSpec(crates/hero_proc_lib/src/db/service/model.rs)crates/hero_proc_sdk/src/ready.rs)ready_seenflag every 5s (supervisor/service_state.rs)rpc/service.rsdowngradesrunning→startingwhile ready is unseenTwo ready-file tests pass
Out of scope
declare_ready()— that's per-service work tracked in their own repos. Default isrequire_ready=falseso nothing breaks.notifycrate later if needed.Verification
cargo check --workspace --all-targetsclean. SDK tests green. Closing as implemented.