[arch] Service readiness contract — services declare ready, supervisor doesn't guess from PID #84

Closed
opened 2026-04-30 20:12:10 +00:00 by mik-tf · 1 comment
Owner

Premise

hero_proc decides "service is running" from PID liveness + (sometimes) socket-file existence. Both are weak proxies for the thing we actually care about: the service is ready to handle requests.

Health probes (#83) bolt a periodic "is the handler responding" check on top of this. That's the coping layer. This issue is the structural fix: services should explicitly declare readiness, the same way systemd's sd_notify(READY=1) and Kubernetes' readinessProbe work in mature ecosystems.

What's wrong with the current shape

PID alive ≠ service ready

A child process can be spawned but still inside its initialization (loading models, opening sqlite, binding sockets, joining a cluster). PID-alive returns "running" the moment the binary starts executing. Anything that depends on the service not yet being ready (a sibling that races startup, a smoke test that fires too early) sees flaky failures.

Socket file exists ≠ socket accepting connections

After hero_proc service restart, the supervisor checks the socket file with stat() and reports green. But the new instance may not yet have called bind() — or may have bound but not yet accept()-ed. The file might also be a stale dentry (#78) — present but not connected to a live listener.

Two child processes, supervisor only watches one

A service like hero_os has two declared children: hero_os_server and hero_os_ui. Today's supervision treats the service as a single unit — if one child silently dies, service list still shows ● running. The user only finds out when they hit the URL. (Today's session, pid 4993 alive, ui.sock missing.)

Probes detect, contracts prevent

Probes (issue #83) are reactive: every 30s, ask "are you alive?" Best-case detection latency is the probe interval. Contracts are proactive: the service itself raises a flag the moment it's ready (or unready), and the supervisor consumes that signal. Detection is instantaneous.

Proposal

A readiness contract for every Hero service. Concretely:

1. Each service declares its socket(s) in its TOML

[service]
name = "hero_foundry"
sockets = ["rpc.sock", "ui.sock"]
ready_timeout = "30s"  # how long startup may take before supervisor declares failure

2. Each service binary signals "ready" explicitly

Mechanism options (pick one, document the choice):

  • Socket-on-disk: service creat()s ~/hero/var/sockets/<svc>/.ready after every declared listener has called accept() and is serving. Simplest, no IPC dependency. Supervisor watches with inotify.
  • fd-3 protocol: hero_proc passes an inherited fd; service writes "READY" + close. Stronger isolation, more wiring.
  • HTTP self-probe: service GETs its own /health after binding, then writes a ready flag. Reuses health-probe infra.

Recommendation: socket-on-disk file (.ready) — atomic creat(), easy to debug, no extra protocol.

3. service list reflects three states from contract, not from PID guesses

  • ● running — .ready present, last probe ok
  • ◐ starting — PID alive, .ready not yet written, within ready_timeout
  • ◯ stopped / failed — PID gone OR ready_timeout elapsed without .ready

4. service start blocks until ready (configurable)

hero_proc service start hero_foundry returns only when the service has signaled ready (or ready_timeout elapsed). No more "started, but not really" race. CLI flag --no-wait for batch starts.

5. Each child process in a multi-process service signals separately

hero_os has hero_os_server and hero_os_ui. Each writes its own .ready file (.ready.server, .ready.ui). Supervisor requires all of them. This catches today's "one child silently dead" pattern at startup, not 8 hours later.

6. Liveness vs readiness are separate

A service can transition ready → unready at runtime (e.g., its DB went away). A separate .healthy file that the service touches/removes when it knows. Health probes (#83) are still useful as defense-in-depth, but the primary signal is the service's own self-report.

Why this is "long-term," not coping

After this lands:

  • Race conditions between service startup and dependents disappear — service start --wait is reliable.
  • "PID alive but actually broken" becomes detectable in milliseconds, not minutes.
  • The probe layer (#83) becomes a sanity check ("does the readiness contract still match reality"), not the primary defense.
  • Multi-child services correctly report aggregate state.

This is what mature service supervisors do. Hero is reinventing this primitive; let's do it deliberately.

Acceptance

  • Service TOML schema extended with sockets and ready_timeout.
  • One canonical readiness mechanism chosen, documented, and shipped (recommend .ready files via creat()).
  • hero_proc watches readiness signals (inotify on the sockets dir).
  • service list shows starting/running/failed/unhealthy as distinct states sourced from the contract.
  • service start <name> blocks until ready by default; --no-wait available for batch.
  • Migration: hero_foundry, hero_osis, hero_os, hero_books migrated as proof-of-shape.
  • Documented in docs/dev/SERVICE_READINESS.md.

Cross-references

  • Structural superseder of #83 (probes detect, this contract prevents).
  • Removes the upstream cause of #78 (dangling socket dentry).
  • Pattern reference: systemd sd_notify(3), Kubernetes readinessProbe, s6-rc notifications.

Signed-off-by: mik-tf

## Premise `hero_proc` decides "service is running" from PID liveness + (sometimes) socket-file existence. Both are weak proxies for the thing we actually care about: **the service is ready to handle requests**. Health probes (https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83) bolt a periodic "is the handler responding" check on top of this. That's the coping layer. **This issue is the structural fix**: services should explicitly *declare* readiness, the same way systemd's `sd_notify(READY=1)` and Kubernetes' `readinessProbe` work in mature ecosystems. ## What's wrong with the current shape ### PID alive ≠ service ready A child process can be spawned but still inside its initialization (loading models, opening sqlite, binding sockets, joining a cluster). PID-alive returns "running" the moment the binary starts executing. Anything that depends on the service not yet being ready (a sibling that races startup, a smoke test that fires too early) sees flaky failures. ### Socket file exists ≠ socket accepting connections After `hero_proc service restart`, the supervisor checks the socket file with `stat()` and reports green. But the new instance may not yet have called `bind()` — or may have bound but not yet `accept()`-ed. The file might also be a stale dentry (https://forge.ourworld.tf/lhumina_code/hero_proc/issues/78) — present but not connected to a live listener. ### Two child processes, supervisor only watches one A service like `hero_os` has two declared children: `hero_os_server` and `hero_os_ui`. Today's supervision treats the service as a single unit — if one child silently dies, `service list` still shows ● running. The user only finds out when they hit the URL. (Today's session, pid 4993 alive, ui.sock missing.) ### Probes detect, contracts prevent Probes (issue #83) are reactive: every 30s, ask "are you alive?" Best-case detection latency is the probe interval. Contracts are proactive: the service itself raises a flag the moment it's ready (or unready), and the supervisor consumes that signal. Detection is instantaneous. ## Proposal A readiness contract for every Hero service. Concretely: ### 1. Each service declares its socket(s) in its TOML ```toml [service] name = "hero_foundry" sockets = ["rpc.sock", "ui.sock"] ready_timeout = "30s" # how long startup may take before supervisor declares failure ``` ### 2. Each service binary signals "ready" explicitly Mechanism options (pick one, document the choice): - **Socket-on-disk**: service `creat()`s `~/hero/var/sockets/<svc>/.ready` after every declared listener has called `accept()` and is serving. Simplest, no IPC dependency. Supervisor watches with inotify. - **fd-3 protocol**: hero_proc passes an inherited fd; service writes "READY" + close. Stronger isolation, more wiring. - **HTTP self-probe**: service `GET`s its own `/health` after binding, then writes a ready flag. Reuses health-probe infra. Recommendation: socket-on-disk file (`.ready`) — atomic `creat()`, easy to debug, no extra protocol. ### 3. `service list` reflects three states from contract, not from PID guesses - ● running — `.ready` present, last probe ok - ◐ starting — PID alive, `.ready` not yet written, within `ready_timeout` - ◯ stopped / failed — PID gone OR `ready_timeout` elapsed without `.ready` ### 4. `service start` blocks until ready (configurable) `hero_proc service start hero_foundry` returns only when the service has signaled ready (or `ready_timeout` elapsed). No more "started, but not really" race. CLI flag `--no-wait` for batch starts. ### 5. Each child process in a multi-process service signals separately `hero_os` has `hero_os_server` and `hero_os_ui`. Each writes its own `.ready` file (`.ready.server`, `.ready.ui`). Supervisor requires all of them. This catches today's "one child silently dead" pattern at startup, not 8 hours later. ### 6. Liveness vs readiness are separate A service can transition ready → unready at runtime (e.g., its DB went away). A separate `.healthy` file that the service touches/removes when it knows. Health probes (#83) are still useful as defense-in-depth, but the primary signal is the service's own self-report. ## Why this is "long-term," not coping After this lands: - Race conditions between service startup and dependents disappear — `service start --wait` is reliable. - "PID alive but actually broken" becomes detectable in milliseconds, not minutes. - The probe layer (#83) becomes a sanity check ("does the readiness contract still match reality"), not the primary defense. - Multi-child services correctly report aggregate state. This is what mature service supervisors do. Hero is reinventing this primitive; let's do it deliberately. ## Acceptance - [ ] Service TOML schema extended with `sockets` and `ready_timeout`. - [ ] One canonical readiness mechanism chosen, documented, and shipped (recommend `.ready` files via `creat()`). - [ ] `hero_proc` watches readiness signals (inotify on the sockets dir). - [ ] `service list` shows starting/running/failed/unhealthy as distinct states sourced from the contract. - [ ] `service start <name>` blocks until ready by default; `--no-wait` available for batch. - [ ] Migration: hero_foundry, hero_osis, hero_os, hero_books migrated as proof-of-shape. - [ ] Documented in `docs/dev/SERVICE_READINESS.md`. ## Cross-references - Structural superseder of https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83 (probes detect, this contract prevents). - Removes the upstream cause of https://forge.ourworld.tf/lhumina_code/hero_proc/issues/78 (dangling socket dentry). - Pattern reference: systemd `sd_notify(3)`, Kubernetes `readinessProbe`, `s6-rc` notifications. Signed-off-by: mik-tf
mik-tf self-assigned this 2026-04-30 20:12:10 +00:00
Owner

Implemented — .ready file readiness contract

The mechanism picked: a .ready marker file in the service's socket directory. Closest analog to sd_notify(READY=1) without dragging systemd into hero. The file lives at $HERO_SOCKET_DIR/<service>/.ready and contains the writing PID as plain ASCII (informational — the supervisor only checks for presence).

SDK helper

New module crates/hero_proc_sdk/src/ready.rs exposes:

  • declare_ready(service_name) — call once your listener is bound and your handler answers a self-probe.
  • clear_ready(service_name) — idempotent removal on graceful shutdown.
  • is_ready(service_name) / ready_pid(service_name) — for tests and operator tooling.

Pre-spawn cleanup of the marker is handled by the supervisor (see #78), so a crashed instance does not leave a stale marker for the next one.

Supervisor side

ServiceSpec gains require_ready: bool (default false to preserve behavior for un-migrated services).

When require_ready=true, service.status returns state="starting" until service_state::run_one_round() observes the marker file. This means a service that binds its socket but panics in handler init never reports "running" — closing the gap where service list would show green for a service that is still loading models, opening sqlite, joining a cluster, etc.

Why .ready over alternatives

  • sd_notify: drags in systemd-only API and an extra fd protocol. Not portable to launchd/runit users.
  • fd-3 protocol: requires the supervisor to pass a numbered fd through every layer — fragile.
  • explicit RPC: makes the service know about hero_proc; cyclic dependency.
  • .ready file: zero coupling. Service writes one file. Supervisor stats it. Pre-spawn cleanup (already in place for sockets per #78) covers the crash case.

Wiring

  • Field in ServiceSpec (crates/hero_proc_lib/src/db/service/model.rs)
  • SDK helper module (crates/hero_proc_sdk/src/ready.rs)
  • Evaluator updates ready_seen flag every 5s (supervisor/service_state.rs)
  • Status overlay in rpc/service.rs downgrades runningstarting while ready is unseen

Two ready-file tests pass

test ready::tests::declare_creates_marker ... ok
test ready::tests::clear_is_idempotent ... ok

Out of scope

  • Existing services have not been migrated to call declare_ready() — that's per-service work tracked in their own repos. Default is require_ready=false so nothing breaks.
  • inotify watching is not used; the 5s polled stat is enough at our scale and avoids a Linux-only dependency. Easy to switch to notify crate later if needed.

Verification

cargo check --workspace --all-targets clean. SDK tests green. Closing as implemented.

## Implemented — `.ready` file readiness contract The mechanism picked: a `.ready` marker file in the service's socket directory. Closest analog to `sd_notify(READY=1)` without dragging systemd into hero. The file lives at `$HERO_SOCKET_DIR/<service>/.ready` and contains the writing PID as plain ASCII (informational — the supervisor only checks for presence). ### SDK helper New module `crates/hero_proc_sdk/src/ready.rs` exposes: - `declare_ready(service_name)` — call once your listener is bound and your handler answers a self-probe. - `clear_ready(service_name)` — idempotent removal on graceful shutdown. - `is_ready(service_name)` / `ready_pid(service_name)` — for tests and operator tooling. Pre-spawn cleanup of the marker is handled by the supervisor (see #78), so a crashed instance does not leave a stale marker for the next one. ### Supervisor side `ServiceSpec` gains `require_ready: bool` (default false to preserve behavior for un-migrated services). When `require_ready=true`, `service.status` returns `state="starting"` until `service_state::run_one_round()` observes the marker file. This means a service that binds its socket but panics in handler init never reports "running" — closing the gap where `service list` would show green for a service that is still loading models, opening sqlite, joining a cluster, etc. ### Why `.ready` over alternatives - **sd_notify**: drags in systemd-only API and an extra fd protocol. Not portable to launchd/runit users. - **fd-3 protocol**: requires the supervisor to pass a numbered fd through every layer — fragile. - **explicit RPC**: makes the service know about hero_proc; cyclic dependency. - **`.ready` file**: zero coupling. Service writes one file. Supervisor stats it. Pre-spawn cleanup (already in place for sockets per #78) covers the crash case. ### Wiring - Field in `ServiceSpec` (`crates/hero_proc_lib/src/db/service/model.rs`) - SDK helper module (`crates/hero_proc_sdk/src/ready.rs`) - Evaluator updates `ready_seen` flag every 5s (`supervisor/service_state.rs`) - Status overlay in `rpc/service.rs` downgrades `running` → `starting` while ready is unseen ### Two ready-file tests pass ``` test ready::tests::declare_creates_marker ... ok test ready::tests::clear_is_idempotent ... ok ``` ### Out of scope - Existing services have not been migrated to call `declare_ready()` — that's per-service work tracked in their own repos. Default is `require_ready=false` so nothing breaks. - inotify watching is **not** used; the 5s polled stat is enough at our scale and avoids a Linux-only dependency. Easy to switch to `notify` crate later if needed. ### Verification `cargo check --workspace --all-targets` clean. SDK tests green. Closing as implemented.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#84
No description provided.