lhumina_code/hero_proc

Fork 0

[arch] Service readiness contract — services declare ready, supervisor doesn't guess from PID #84

New issue

Closed

opened 2026-04-30 20:12:10 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-04-30 20:12:10 +00:00

Owner

Premise

hero_proc decides "service is running" from PID liveness + (sometimes) socket-file existence. Both are weak proxies for the thing we actually care about: the service is ready to handle requests.

Health probes (#83) bolt a periodic "is the handler responding" check on top of this. That's the coping layer. This issue is the structural fix: services should explicitly declare readiness, the same way systemd's sd_notify(READY=1) and Kubernetes' readinessProbe work in mature ecosystems.

What's wrong with the current shape

PID alive ≠ service ready

A child process can be spawned but still inside its initialization (loading models, opening sqlite, binding sockets, joining a cluster). PID-alive returns "running" the moment the binary starts executing. Anything that depends on the service not yet being ready (a sibling that races startup, a smoke test that fires too early) sees flaky failures.

Socket file exists ≠ socket accepting connections

After hero_proc service restart, the supervisor checks the socket file with stat() and reports green. But the new instance may not yet have called bind() — or may have bound but not yet accept()-ed. The file might also be a stale dentry (#78) — present but not connected to a live listener.

Two child processes, supervisor only watches one

A service like hero_os has two declared children: hero_os_server and hero_os_ui. Today's supervision treats the service as a single unit — if one child silently dies, service list still shows ● running. The user only finds out when they hit the URL. (Today's session, pid 4993 alive, ui.sock missing.)

Probes detect, contracts prevent

Probes (issue #83) are reactive: every 30s, ask "are you alive?" Best-case detection latency is the probe interval. Contracts are proactive: the service itself raises a flag the moment it's ready (or unready), and the supervisor consumes that signal. Detection is instantaneous.

Proposal

A readiness contract for every Hero service. Concretely:

1. Each service declares its socket(s) in its TOML

[service]
name = "hero_foundry"
sockets = ["rpc.sock", "ui.sock"]
ready_timeout = "30s"  # how long startup may take before supervisor declares failure

2. Each service binary signals "ready" explicitly

Mechanism options (pick one, document the choice):

Socket-on-disk: service creat()s ~/hero/var/sockets/<svc>/.ready after every declared listener has called accept() and is serving. Simplest, no IPC dependency. Supervisor watches with inotify.
fd-3 protocol: hero_proc passes an inherited fd; service writes "READY" + close. Stronger isolation, more wiring.
HTTP self-probe: service GETs its own /health after binding, then writes a ready flag. Reuses health-probe infra.

Recommendation: socket-on-disk file (.ready) — atomic creat(), easy to debug, no extra protocol.

3. `service list` reflects three states from contract, not from PID guesses

● running — .ready present, last probe ok
◐ starting — PID alive, .ready not yet written, within ready_timeout
◯ stopped / failed — PID gone OR ready_timeout elapsed without .ready

4. `service start` blocks until ready (configurable)

hero_proc service start hero_foundry returns only when the service has signaled ready (or ready_timeout elapsed). No more "started, but not really" race. CLI flag --no-wait for batch starts.

5. Each child process in a multi-process service signals separately

hero_os has hero_os_server and hero_os_ui. Each writes its own .ready file (.ready.server, .ready.ui). Supervisor requires all of them. This catches today's "one child silently dead" pattern at startup, not 8 hours later.

6. Liveness vs readiness are separate

A service can transition ready → unready at runtime (e.g., its DB went away). A separate .healthy file that the service touches/removes when it knows. Health probes (#83) are still useful as defense-in-depth, but the primary signal is the service's own self-report.

Why this is "long-term," not coping

After this lands:

Race conditions between service startup and dependents disappear — service start --wait is reliable.
"PID alive but actually broken" becomes detectable in milliseconds, not minutes.
The probe layer (#83) becomes a sanity check ("does the readiness contract still match reality"), not the primary defense.
Multi-child services correctly report aggregate state.

This is what mature service supervisors do. Hero is reinventing this primitive; let's do it deliberately.

Acceptance

Service TOML schema extended with sockets and ready_timeout.
One canonical readiness mechanism chosen, documented, and shipped (recommend .ready files via creat()).
hero_proc watches readiness signals (inotify on the sockets dir).
service list shows starting/running/failed/unhealthy as distinct states sourced from the contract.
service start <name> blocks until ready by default; --no-wait available for batch.
Migration: hero_foundry, hero_osis, hero_os, hero_books migrated as proof-of-shape.
Documented in docs/dev/SERVICE_READINESS.md.

Cross-references

Structural superseder of #83 (probes detect, this contract prevents).
Removes the upstream cause of #78 (dangling socket dentry).
Pattern reference: systemd sd_notify(3), Kubernetes readinessProbe, s6-rc notifications.

Signed-off-by: mik-tf

## Premise `hero_proc` decides "service is running" from PID liveness + (sometimes) socket-file existence. Both are weak proxies for the thing we actually care about: **the service is ready to handle requests**. Health probes (https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83) bolt a periodic "is the handler responding" check on top of this. That's the coping layer. **This issue is the structural fix**: services should explicitly *declare* readiness, the same way systemd's `sd_notify(READY=1)` and Kubernetes' `readinessProbe` work in mature ecosystems. ## What's wrong with the current shape ### PID alive ≠ service ready A child process can be spawned but still inside its initialization (loading models, opening sqlite, binding sockets, joining a cluster). PID-alive returns "running" the moment the binary starts executing. Anything that depends on the service not yet being ready (a sibling that races startup, a smoke test that fires too early) sees flaky failures. ### Socket file exists ≠ socket accepting connections After `hero_proc service restart`, the supervisor checks the socket file with `stat()` and reports green. But the new instance may not yet have called `bind()` — or may have bound but not yet `accept()`-ed. The file might also be a stale dentry (https://forge.ourworld.tf/lhumina_code/hero_proc/issues/78) — present but not connected to a live listener. ### Two child processes, supervisor only watches one A service like `hero_os` has two declared children: `hero_os_server` and `hero_os_ui`. Today's supervision treats the service as a single unit — if one child silently dies, `service list` still shows ● running. The user only finds out when they hit the URL. (Today's session, pid 4993 alive, ui.sock missing.) ### Probes detect, contracts prevent Probes (issue #83) are reactive: every 30s, ask "are you alive?" Best-case detection latency is the probe interval. Contracts are proactive: the service itself raises a flag the moment it's ready (or unready), and the supervisor consumes that signal. Detection is instantaneous. ## Proposal A readiness contract for every Hero service. Concretely: ### 1. Each service declares its socket(s) in its TOML ```toml [service] name = "hero_foundry" sockets = ["rpc.sock", "ui.sock"] ready_timeout = "30s" # how long startup may take before supervisor declares failure ``` ### 2. Each service binary signals "ready" explicitly Mechanism options (pick one, document the choice): - **Socket-on-disk**: service `creat()`s `~/hero/var/sockets/<svc>/.ready` after every declared listener has called `accept()` and is serving. Simplest, no IPC dependency. Supervisor watches with inotify. - **fd-3 protocol**: hero_proc passes an inherited fd; service writes "READY" + close. Stronger isolation, more wiring. - **HTTP self-probe**: service `GET`s its own `/health` after binding, then writes a ready flag. Reuses health-probe infra. Recommendation: socket-on-disk file (`.ready`) — atomic `creat()`, easy to debug, no extra protocol. ### 3. `service list` reflects three states from contract, not from PID guesses - ● running — `.ready` present, last probe ok - ◐ starting — PID alive, `.ready` not yet written, within `ready_timeout` - ◯ stopped / failed — PID gone OR `ready_timeout` elapsed without `.ready` ### 4. `service start` blocks until ready (configurable) `hero_proc service start hero_foundry` returns only when the service has signaled ready (or `ready_timeout` elapsed). No more "started, but not really" race. CLI flag `--no-wait` for batch starts. ### 5. Each child process in a multi-process service signals separately `hero_os` has `hero_os_server` and `hero_os_ui`. Each writes its own `.ready` file (`.ready.server`, `.ready.ui`). Supervisor requires all of them. This catches today's "one child silently dead" pattern at startup, not 8 hours later. ### 6. Liveness vs readiness are separate A service can transition ready → unready at runtime (e.g., its DB went away). A separate `.healthy` file that the service touches/removes when it knows. Health probes (#83) are still useful as defense-in-depth, but the primary signal is the service's own self-report. ## Why this is "long-term," not coping After this lands: - Race conditions between service startup and dependents disappear — `service start --wait` is reliable. - "PID alive but actually broken" becomes detectable in milliseconds, not minutes. - The probe layer (#83) becomes a sanity check ("does the readiness contract still match reality"), not the primary defense. - Multi-child services correctly report aggregate state. This is what mature service supervisors do. Hero is reinventing this primitive; let's do it deliberately. ## Acceptance - [ ] Service TOML schema extended with `sockets` and `ready_timeout`. - [ ] One canonical readiness mechanism chosen, documented, and shipped (recommend `.ready` files via `creat()`). - [ ] `hero_proc` watches readiness signals (inotify on the sockets dir). - [ ] `service list` shows starting/running/failed/unhealthy as distinct states sourced from the contract. - [ ] `service start <name>` blocks until ready by default; `--no-wait` available for batch. - [ ] Migration: hero_foundry, hero_osis, hero_os, hero_books migrated as proof-of-shape. - [ ] Documented in `docs/dev/SERVICE_READINESS.md`. ## Cross-references - Structural superseder of https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83 (probes detect, this contract prevents). - Removes the upstream cause of https://forge.ourworld.tf/lhumina_code/hero_proc/issues/78 (dangling socket dentry). - Pattern reference: systemd `sd_notify(3)`, Kubernetes `readinessProbe`, `s6-rc` notifications. Signed-off-by: mik-tf

mik-tf self-assigned this

2026-04-30 20:12:10 +00:00

mik-tf referenced this issue

2026-04-30 22:33:50 +00:00

[arch] Log pipeline performance — measure producers, size SQLite, or move logs out (root cause behind #80) #85

mik-tf referenced this issue

2026-04-30 23:01:22 +00:00

[meta] hero_proc reliability roadmap — every observed failure mode + tracking + order of attack #86

mik-tf referenced this issue from lhumina_code/hero_demo

2026-05-01 02:05:19 +00:00

[vision] Hero OS as an ambient AI desktop — sovereign, voice-native, multi-context #52

despiegk referenced this issue

2026-05-01 04:05:50 +00:00

[meta] hero_proc reliability roadmap — every observed failure mode + tracking + order of attack #86

despiegk referenced this issue

2026-05-01 04:06:51 +00:00

Dangling rpc.sock dentry on service restart — kernel listener alive, file vanished #78

despiegk commented

2026-05-01 04:06:51 +00:00

Owner

Implemented — `.ready` file readiness contract

The mechanism picked: a .ready marker file in the service's socket directory. Closest analog to sd_notify(READY=1) without dragging systemd into hero. The file lives at $HERO_SOCKET_DIR/<service>/.ready and contains the writing PID as plain ASCII (informational — the supervisor only checks for presence).

SDK helper

New module crates/hero_proc_sdk/src/ready.rs exposes:

declare_ready(service_name) — call once your listener is bound and your handler answers a self-probe.
clear_ready(service_name) — idempotent removal on graceful shutdown.
is_ready(service_name) / ready_pid(service_name) — for tests and operator tooling.

Pre-spawn cleanup of the marker is handled by the supervisor (see #78), so a crashed instance does not leave a stale marker for the next one.

Supervisor side

ServiceSpec gains require_ready: bool (default false to preserve behavior for un-migrated services).

When require_ready=true, service.status returns state="starting" until service_state::run_one_round() observes the marker file. This means a service that binds its socket but panics in handler init never reports "running" — closing the gap where service list would show green for a service that is still loading models, opening sqlite, joining a cluster, etc.

Why `.ready` over alternatives

sd_notify: drags in systemd-only API and an extra fd protocol. Not portable to launchd/runit users.
fd-3 protocol: requires the supervisor to pass a numbered fd through every layer — fragile.
explicit RPC: makes the service know about hero_proc; cyclic dependency.
.ready file: zero coupling. Service writes one file. Supervisor stats it. Pre-spawn cleanup (already in place for sockets per #78) covers the crash case.

Wiring

Field in ServiceSpec (crates/hero_proc_lib/src/db/service/model.rs)
SDK helper module (crates/hero_proc_sdk/src/ready.rs)
Evaluator updates ready_seen flag every 5s (supervisor/service_state.rs)
Status overlay in rpc/service.rs downgrades running → starting while ready is unseen

Two ready-file tests pass

test ready::tests::declare_creates_marker ... ok
test ready::tests::clear_is_idempotent ... ok

Out of scope

Existing services have not been migrated to call declare_ready() — that's per-service work tracked in their own repos. Default is require_ready=false so nothing breaks.
inotify watching is not used; the 5s polled stat is enough at our scale and avoids a Linux-only dependency. Easy to switch to notify crate later if needed.

Verification

cargo check --workspace --all-targets clean. SDK tests green. Closing as implemented.

## Implemented — `.ready` file readiness contract The mechanism picked: a `.ready` marker file in the service's socket directory. Closest analog to `sd_notify(READY=1)` without dragging systemd into hero. The file lives at `$HERO_SOCKET_DIR/<service>/.ready` and contains the writing PID as plain ASCII (informational — the supervisor only checks for presence). ### SDK helper New module `crates/hero_proc_sdk/src/ready.rs` exposes: - `declare_ready(service_name)` — call once your listener is bound and your handler answers a self-probe. - `clear_ready(service_name)` — idempotent removal on graceful shutdown. - `is_ready(service_name)` / `ready_pid(service_name)` — for tests and operator tooling. Pre-spawn cleanup of the marker is handled by the supervisor (see #78), so a crashed instance does not leave a stale marker for the next one. ### Supervisor side `ServiceSpec` gains `require_ready: bool` (default false to preserve behavior for un-migrated services). When `require_ready=true`, `service.status` returns `state="starting"` until `service_state::run_one_round()` observes the marker file. This means a service that binds its socket but panics in handler init never reports "running" — closing the gap where `service list` would show green for a service that is still loading models, opening sqlite, joining a cluster, etc. ### Why `.ready` over alternatives - **sd_notify**: drags in systemd-only API and an extra fd protocol. Not portable to launchd/runit users. - **fd-3 protocol**: requires the supervisor to pass a numbered fd through every layer — fragile. - **explicit RPC**: makes the service know about hero_proc; cyclic dependency. - **`.ready` file**: zero coupling. Service writes one file. Supervisor stats it. Pre-spawn cleanup (already in place for sockets per #78) covers the crash case. ### Wiring - Field in `ServiceSpec` (`crates/hero_proc_lib/src/db/service/model.rs`) - SDK helper module (`crates/hero_proc_sdk/src/ready.rs`) - Evaluator updates `ready_seen` flag every 5s (`supervisor/service_state.rs`) - Status overlay in `rpc/service.rs` downgrades `running` → `starting` while ready is unseen ### Two ready-file tests pass ``` test ready::tests::declare_creates_marker ... ok test ready::tests::clear_is_idempotent ... ok ``` ### Out of scope - Existing services have not been migrated to call `declare_ready()` — that's per-service work tracked in their own repos. Default is `require_ready=false` so nothing breaks. - inotify watching is **not** used; the 5s polled stat is enough at our scale and avoids a Linux-only dependency. Easy to switch to `notify` crate later if needed. ### Verification `cargo check --workspace --all-targets` clean. SDK tests green. Closing as implemented.

despiegk closed this issue

2026-05-01 04:06:57 +00:00

despiegk referenced this issue

2026-05-01 05:14:49 +00:00

spec(db): verify ServiceSpec probe/sockets/require_ready round-trip through spec_json — add regression test + doc contract #90

mik-tf referenced this issue from lhumina_code/hero_demo

2026-05-05 04:28:11 +00:00

[infra][P1] CI-built static-musl binaries + --from-ci install path — make deploys minutes, not hours #54