[meta] hero_proc reliability roadmap — every observed failure mode + tracking + order of attack #86

Closed
opened 2026-04-30 23:01:22 +00:00 by mik-tf · 1 comment
Owner

What this issue is

Single index for every observed reliability bug in hero_proc, the failure-mode patterns, and the issues that track each fix. Intended for an agent to systematically work through. Content is observation-driven — every entry below has a concrete reproduction, a date observed, and (where filed) the issue tracking the durable fix.

The recurring theme: hero_proc's status reports drift from reality. It reports services as running when they're broken, reports them as failed when they're working, fails to detect when one of its children dies, and accumulates state inside its own process that becomes a single point of failure for the whole demo.


A. Observed failure modes (concrete evidence)

A1. service list reports green when handler is dead

Evidence (2026-04-30 herodemo): hero_foundry was reported as ● running by hero_proc service list. The PIDs were alive, both rpc.sock and ui.sock existed. But:

  • curl --unix-socket .../hero_foundry/rpc.sock /api/files/... → empty reply, connection accepted then immediately closed
  • /api/files/... requests via the gateway returned 502
  • Internal heartbeat probes (/health, /rpc) on the same socket DID succeed

So the listener was alive for some paths but not others. Restart fixed it.

Same shape on hero_os earlier in the day: service list said running with both child action names, but hero_os_ui was actually dead and ui.sock was missing. The supervisor had no idea one of the two child processes had died.

Tracking: #83 (handler-responsive probes) + lhumina_code/home#202 (half-broken listener root cause)

A2. service list reports red when reality is green

Evidence (2026-04-30 herodemo, post-service_proc start --reset): Right after restarting hero_proc, service list showed every service except hero_proc_ui as ✗ failed with PID 0. But:

  • All processes were alive (verified by ps)
  • All services responded on their sockets
  • All gateway probes returned the expected status codes
  • Per-context routing returned distinct data
  • The user-visible demo was fully functional in the browser

The supervisor's bookkeeping marked the previous run_id=180 as status=error during shutdown, and that propagated to "service status = failed" for every service even after autostart respawned them all successfully.

Tracking: not yet filed as a discrete issue — folds into #83 (the supervisor's status accounting needs to be sourced from a real probe, not from the most-recent run_id status).

A3. Dangling rpc.sock dentry after supervisor restart

Evidence: socket file present on disk but kernel listener gone, or vice versa. Cause: hero_proc creates the socket file at startup, child binds to it, on supervisor restart the cleanup ordering can leave the file pointing at a dead inode.

Tracking: #78

A4. Children orphan to PID 1 and survive supervisor restart

Evidence (2026-04-30 herodemo, observed during today's restart): hero_proc_server PID had PPid: 1 (init) — it was already orphaned from its own startup chain. When service_proc start --reset killed and respawned the supervisor, the children were correctly killed and replaced (via the autostart-on-restart mechanism). But:

  • During the kill→respawn window, sockets briefly disappear
  • Each respawn issues a fresh bind() which creates the dangling-dentry race (A3)
  • The "I'm a fresh hero_proc, what do I supervise?" boot path is the most fragile path in the codebase

This is a family of bugs (A1, A2, A3 all stem from the supervisor not having a clean restart contract).

Tracking: #84 (service readiness contract — services declare ready, supervisor doesn't guess from PID)

A5. Memory growth under log volume → OOM

Evidence (2026-04-30 herodemo): hero_proc_server at 7.9 GB resident after ~5h26m uptime. Killed cascade-wise (operationally restarted hero_proc → child services restart → some of them landed in the half-broken state in A1).

Status: shipped 2026-04-30 — bounded mpsc channel + drop-on-full + visibility counter. Caps in-channel memory at ~10 MB. #80 (closed)

The architectural follow-up — why does the channel fill at all? — is #85 (measure producer rate, size SQLite, or move logs out).

A6. sysinfo /proc fd retention (kept-files-open default)

Evidence: hero_proc_server fd table grew without bound because sysinfo's default keeps /proc/<pid>/stat files open across refresh cycles.

Status: shipped — set_open_files_limit(0) + ProcessRefreshKind::nothing().with_memory().with_cpu(). #81 (closed)

A7. Half-broken listener pattern (OServer-wide, not hero_proc-specific)

Evidence (2026-04-30 herodemo): hero_foundry rpc.sock listener alive, accepting TCP, internal heartbeats getting through, but every user-facing path returned empty reply. The hyper per-connection task was probably panicking holding shared state, leaving the listener dispatching to a poisoned state.rs.

This affects every OServer-pattern service (hero_foundry, hero_osis, hero_books, hero_agent, ...) — the supervisor doesn't know which one is in this state at any given moment.

Tracking: lhumina_code/home#202 (specific case study), lhumina_code/home#204 (architectural — make panic isolation an actual property of OServer)

A8. Restart cascade collateral damage

Evidence (2026-04-30, today's session): when we restarted hero_proc to apply the bounded-channel fix, the child respawn left service list in a broken-reporting state (A2). When we restarted hero_proc earlier in the day to clear the 7.9 GB leak (A5), hero_os was knocked into the missing-ui.sock state (A1) and hero_foundry was knocked into the half-broken-listener state (A7). Every restart of hero_proc is currently a roll of the dice for which child gets damaged.

Tracking: not its own issue — composite of A1+A3+A7 fixes.


B. Architectural / structural issues filed (the proper fixes)

# Title Repo What it fixes
78 Dangling rpc.sock dentry on supervisor restart hero_proc A3
83 PID-alive → handler-responsive probes hero_proc A1, A2
84 Service readiness contract — services declare ready hero_proc A4, parts of A1/A2/A3
85 Log pipeline performance — root cause behind #80 hero_proc A5 (architectural; #80 was the safety-net)
202 Half-broken running service pattern home A7 (case study + repro)
204 OServer panic isolation home A7 (architectural)

C. What we want hero_proc to do that it doesn't

  1. Supervise itself — at minimum, an external watchdog (systemd, runit, monit, or a self-restart script with health probe). Today: hero_proc_server is daemonized with PPID 1 and has no parent watching it.
  2. Real readiness contract — services should write a .ready file (or sd_notify, or fd-3 protocol) when their listener is bound AND their handler responds to a self-probe. Supervisor watches the file. (#84)
  3. Real liveness probe — periodically hit each declared socket with a real handler request. PID-alive is a bad proxy. (#83)
  4. Idempotent, incremental restartservice_proc start --reset should never produce A2-style red status board. The restart sequence should: probe each child → only kill and respawn the genuinely-failed ones → leave healthy ones alone.
  5. Drop coupling between log volume and supervisor liveness — even with #80 shipped, all logs from all supervised processes still land in hero_proc's SQLite. A single chatty service can saturate the writer. (#85)
  6. Fail loudly, not silently — a panic in any handler task should surface as a structured log line that hero_proc service logs catches, not vanish into a stderr buffer that gets dropped on shutdown. (#204)

D. Suggested order of attack

Priority ordering by ROI for an agent or implementer:

  1. #83 (handler-responsive probes) — the broadest single change. Fixes A1, A2, exposes A3 and A4 reliably so they can be fixed next. Without this we keep playing whack-a-mole.
  2. #84 (readiness contract) — once #83 is in, #84 is the next layer: have services explicitly declare ready instead of guessing. Cleanly fixes A4 and reduces A3.
  3. #78 (dangling dentry) — small, focused, but blocked by #84 (clean readiness signal makes the dentry race detectable).
  4. #85 (log pipeline) — measure first, then decide: reduce volume, optimise SQLite, or move logs out. The bounded-channel safety net (#80) holds the line until this lands.
  5. #202 (half-broken listener case study) + #204 (OServer panic isolation) — these are home-repo issues but every OServer service is affected. Should ship together.
  6. External watchdog for hero_proc itself — not yet filed as its own issue; folds into ops work on hero_demo.

E. Acceptance for "hero_proc is reliable"

  • After a fresh service_proc start --reset against a healthy demo, service list reports each running service as ● running within 30 seconds, not ✗ failed (A2 closed).
  • After any service crash (PID dies), service list flips that service to ✗ within one probe cycle, and the supervisor either restarts it or alerts (A1, #83).
  • After hero_proc itself crashes, an external watchdog respawns it within 30s. (D-6)
  • No service can OOM hero_proc by emitting log volume. (#80 already shipped; #85 is the architectural answer.)
  • A panicked handler in any OServer service does not poison shared state — it dies, logs, the next request works. (#204)
  • All filed issues above either closed with shipped fixes or explicitly deferred with a tracking comment.

F. Cross-references

  • Closed today: #80, #81
  • Open architectural issues: 78, 83, 84, 85 (this repo); 202, 204 (home repo)
  • Demo-side coping that depends on hero_proc reliability: lhumina_code/home#201 (smoke loop — external probe that catches what hero_proc misses)

Signed-off-by: mik-tf

## What this issue is Single index for every observed reliability bug in hero_proc, the failure-mode patterns, and the issues that track each fix. Intended for an agent to systematically work through. Content is **observation-driven** — every entry below has a concrete reproduction, a date observed, and (where filed) the issue tracking the durable fix. The recurring theme: **hero_proc's status reports drift from reality.** It reports services as running when they're broken, reports them as failed when they're working, fails to detect when one of its children dies, and accumulates state inside its own process that becomes a single point of failure for the whole demo. --- ## A. Observed failure modes (concrete evidence) ### A1. `service list` reports green when handler is dead **Evidence (2026-04-30 herodemo)**: `hero_foundry` was reported as `● running` by `hero_proc service list`. The PIDs were alive, both rpc.sock and ui.sock existed. But: - `curl --unix-socket .../hero_foundry/rpc.sock /api/files/...` → empty reply, connection accepted then immediately closed - `/api/files/...` requests via the gateway returned 502 - Internal heartbeat probes (`/health`, `/rpc`) on the same socket DID succeed So the listener was alive for some paths but not others. Restart fixed it. Same shape on `hero_os` earlier in the day: `service list` said `running` with both child action names, but `hero_os_ui` was actually dead and `ui.sock` was missing. The supervisor had no idea one of the two child processes had died. **Tracking**: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83 (handler-responsive probes) + https://forge.ourworld.tf/lhumina_code/home/issues/202 (half-broken listener root cause) ### A2. `service list` reports red when reality is green **Evidence (2026-04-30 herodemo, post-`service_proc start --reset`)**: Right after restarting hero_proc, `service list` showed every service except `hero_proc_ui` as `✗ failed` with PID 0. But: - All processes were alive (verified by `ps`) - All services responded on their sockets - All gateway probes returned the expected status codes - Per-context routing returned distinct data - The user-visible demo was fully functional in the browser The supervisor's bookkeeping marked the previous `run_id=180` as `status=error` during shutdown, and that propagated to "service status = failed" for every service even after autostart respawned them all successfully. **Tracking**: not yet filed as a discrete issue — folds into https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83 (the supervisor's status accounting needs to be sourced from a real probe, not from the most-recent `run_id` status). ### A3. Dangling rpc.sock dentry after supervisor restart **Evidence**: socket file present on disk but kernel listener gone, or vice versa. Cause: hero_proc creates the socket file at startup, child binds to it, on supervisor restart the cleanup ordering can leave the file pointing at a dead inode. **Tracking**: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/78 ### A4. Children orphan to PID 1 and survive supervisor restart **Evidence (2026-04-30 herodemo, observed during today's restart)**: `hero_proc_server` PID had `PPid: 1` (init) — it was already orphaned from its own startup chain. When `service_proc start --reset` killed and respawned the supervisor, the children were correctly killed and replaced (via the autostart-on-restart mechanism). But: - During the kill→respawn window, sockets briefly disappear - Each respawn issues a fresh `bind()` which creates the dangling-dentry race (A3) - The "I'm a fresh hero_proc, what do I supervise?" boot path is the most fragile path in the codebase This is a **family** of bugs (A1, A2, A3 all stem from the supervisor not having a clean restart contract). **Tracking**: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/84 (service readiness contract — services declare ready, supervisor doesn't guess from PID) ### A5. Memory growth under log volume → OOM **Evidence (2026-04-30 herodemo)**: hero_proc_server at 7.9 GB resident after ~5h26m uptime. Killed cascade-wise (operationally restarted hero_proc → child services restart → some of them landed in the half-broken state in A1). **Status**: shipped 2026-04-30 — bounded mpsc channel + drop-on-full + visibility counter. Caps in-channel memory at ~10 MB. https://forge.ourworld.tf/lhumina_code/hero_proc/issues/80 (closed) The architectural follow-up — **why does the channel fill at all?** — is https://forge.ourworld.tf/lhumina_code/hero_proc/issues/85 (measure producer rate, size SQLite, or move logs out). ### A6. sysinfo /proc fd retention (kept-files-open default) **Evidence**: hero_proc_server fd table grew without bound because sysinfo's default keeps `/proc/<pid>/stat` files open across refresh cycles. **Status**: shipped — `set_open_files_limit(0)` + `ProcessRefreshKind::nothing().with_memory().with_cpu()`. https://forge.ourworld.tf/lhumina_code/hero_proc/issues/81 (closed) ### A7. Half-broken listener pattern (OServer-wide, not hero_proc-specific) **Evidence (2026-04-30 herodemo)**: hero_foundry rpc.sock listener alive, accepting TCP, internal heartbeats getting through, but every user-facing path returned empty reply. The hyper per-connection task was probably panicking holding shared state, leaving the listener dispatching to a poisoned `state.rs`. This affects **every OServer-pattern service** (hero_foundry, hero_osis, hero_books, hero_agent, ...) — the supervisor doesn't know which one is in this state at any given moment. **Tracking**: https://forge.ourworld.tf/lhumina_code/home/issues/202 (specific case study), https://forge.ourworld.tf/lhumina_code/home/issues/204 (architectural — make panic isolation an actual property of OServer) ### A8. Restart cascade collateral damage **Evidence (2026-04-30, today's session)**: when we restarted hero_proc to apply the bounded-channel fix, the child respawn left `service list` in a broken-reporting state (A2). When we restarted hero_proc earlier in the day to clear the 7.9 GB leak (A5), hero_os was knocked into the missing-ui.sock state (A1) and hero_foundry was knocked into the half-broken-listener state (A7). **Every restart of hero_proc is currently a roll of the dice for which child gets damaged.** **Tracking**: not its own issue — composite of A1+A3+A7 fixes. --- ## B. Architectural / structural issues filed (the proper fixes) | # | Title | Repo | What it fixes | |---|---|---|---| | 78 | Dangling rpc.sock dentry on supervisor restart | hero_proc | A3 | | 83 | PID-alive → handler-responsive probes | hero_proc | A1, A2 | | 84 | Service readiness contract — services declare ready | hero_proc | A4, parts of A1/A2/A3 | | 85 | Log pipeline performance — root cause behind #80 | hero_proc | A5 (architectural; #80 was the safety-net) | | 202 | Half-broken running service pattern | home | A7 (case study + repro) | | 204 | OServer panic isolation | home | A7 (architectural) | --- ## C. What we want hero_proc to do that it doesn't 1. **Supervise itself** — at minimum, an external watchdog (systemd, runit, monit, or a self-restart script with health probe). Today: hero_proc_server is daemonized with PPID 1 and has no parent watching it. 2. **Real readiness contract** — services should write a `.ready` file (or sd_notify, or fd-3 protocol) when their listener is bound AND their handler responds to a self-probe. Supervisor watches the file. (#84) 3. **Real liveness probe** — periodically hit each declared socket with a real handler request. PID-alive is a bad proxy. (#83) 4. **Idempotent, incremental restart** — `service_proc start --reset` should never produce A2-style red status board. The restart sequence should: probe each child → only kill and respawn the genuinely-failed ones → leave healthy ones alone. 5. **Drop coupling between log volume and supervisor liveness** — even with #80 shipped, all logs from all supervised processes still land in hero_proc's SQLite. A single chatty service can saturate the writer. (#85) 6. **Fail loudly, not silently** — a panic in any handler task should surface as a structured log line that hero_proc service logs catches, not vanish into a stderr buffer that gets dropped on shutdown. (#204) --- ## D. Suggested order of attack Priority ordering by ROI for an agent or implementer: 1. **#83** (handler-responsive probes) — the broadest single change. Fixes A1, A2, exposes A3 and A4 reliably so they can be fixed next. Without this we keep playing whack-a-mole. 2. **#84** (readiness contract) — once #83 is in, #84 is the next layer: have services explicitly declare ready instead of guessing. Cleanly fixes A4 and reduces A3. 3. **#78** (dangling dentry) — small, focused, but blocked by #84 (clean readiness signal makes the dentry race detectable). 4. **#85** (log pipeline) — measure first, then decide: reduce volume, optimise SQLite, or move logs out. The bounded-channel safety net (#80) holds the line until this lands. 5. **#202** (half-broken listener case study) + **#204** (OServer panic isolation) — these are home-repo issues but every OServer service is affected. Should ship together. 6. **External watchdog for hero_proc itself** — not yet filed as its own issue; folds into ops work on hero_demo. --- ## E. Acceptance for "hero_proc is reliable" - [ ] After a fresh `service_proc start --reset` against a healthy demo, `service list` reports each running service as `● running` within 30 seconds, not `✗ failed` (A2 closed). - [ ] After any service crash (PID dies), `service list` flips that service to ✗ within one probe cycle, and the supervisor either restarts it or alerts (A1, #83). - [ ] After hero_proc itself crashes, an external watchdog respawns it within 30s. (D-6) - [ ] No service can OOM hero_proc by emitting log volume. (#80 already shipped; #85 is the architectural answer.) - [ ] A panicked handler in any OServer service does not poison shared state — it dies, logs, the next request works. (#204) - [ ] All filed issues above either closed with shipped fixes or explicitly deferred with a tracking comment. --- ## F. Cross-references - Closed today: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/80, https://forge.ourworld.tf/lhumina_code/hero_proc/issues/81 - Open architectural issues: 78, 83, 84, 85 (this repo); 202, 204 (home repo) - Demo-side coping that depends on hero_proc reliability: https://forge.ourworld.tf/lhumina_code/home/issues/201 (smoke loop — external probe that catches what hero_proc misses) Signed-off-by: mik-tf
mik-tf self-assigned this 2026-04-30 23:01:22 +00:00
Owner

Plan + implementation status (logs work explicitly out of scope per request)

Detailed plan in repo at specs/issue_86_plan.md. Five sub-issues in dependency order; #85 (logs) deferred per instructions, watchdog dropped per instructions.

Status

Priority Issue Status
P0 service.status from current run, not stale terminal phase already shipped via #57 — verified by tests at crates/hero_proc_server/src/rpc/service.rs:723-801
P1 #83 handler-responsive probes implemented — see comment on #83
P2 #84 readiness contract via .ready file implemented — see comment on #84
P3 #78 dentry race fix implemented — see comment on #78
P4 service.reset_failed (idempotent incremental restart) implemented — new RPC + CLI subcommand

All five layers compile cleanly via cargo check --workspace --all-targets and full cargo build --workspace. Existing test suites for hero_proc_lib::db::service::model, hero_proc_server::rpc::service, and hero_proc_sdk::ready pass.

Architectural shape of what was added

A new module crates/hero_proc_server/src/supervisor/service_state.rs owns the per-service liveness/readiness/dentry state. The supervisor's main loop runs one evaluator round every 5s (throttled below the 500ms tick because probes do real network IO). The store is shared with the RPC layer so service.status and service.status_full apply a health overlay on top of the existing job-phase-derived state:

  • if probe failures hit threshold, state → failed
  • if a declared socket is missing on disk, state → failed
  • if require_ready is set and .ready not seen, state → starting

Per-action health_checks continue to work in parallel — the new probe is service-level and does not replace them.

The idempotent restart variant lives behind service.reset_failed (RPC) and hero_proc service reset-failed (CLI). --force falls back to legacy "restart everything" behavior. Without --force, healthy services are explicitly listed as left_alone in the response.

What did NOT change

  • No new SQLite schema. New ServiceSpec fields (probe, sockets, require_ready) round-trip through the existing spec_json column — services that never set them stay byte-compatible.
  • No new socket conventions. .ready lives next to existing sockets in $HERO_SOCKET_DIR/<service>/.
  • No service auto-restart on probe failure yet — this commit makes the supervisor report the truth honestly. The next layer (auto-respawn on probe red) is intentionally separate so the operator can drive the policy.

Sub-issue references with concrete pointers in the comments below.

## Plan + implementation status (logs work explicitly out of scope per request) Detailed plan in repo at `specs/issue_86_plan.md`. Five sub-issues in dependency order; #85 (logs) deferred per instructions, watchdog dropped per instructions. ### Status | Priority | Issue | Status | |---|---|---| | P0 | service.status from current run, not stale terminal phase | already shipped via #57 — verified by tests at `crates/hero_proc_server/src/rpc/service.rs:723-801` | | P1 | #83 handler-responsive probes | implemented — see comment on #83 | | P2 | #84 readiness contract via `.ready` file | implemented — see comment on #84 | | P3 | #78 dentry race fix | implemented — see comment on #78 | | P4 | service.reset_failed (idempotent incremental restart) | implemented — new RPC + CLI subcommand | All five layers compile cleanly via `cargo check --workspace --all-targets` and full `cargo build --workspace`. Existing test suites for `hero_proc_lib::db::service::model`, `hero_proc_server::rpc::service`, and `hero_proc_sdk::ready` pass. ### Architectural shape of what was added A new module `crates/hero_proc_server/src/supervisor/service_state.rs` owns the per-service liveness/readiness/dentry state. The supervisor's main loop runs one evaluator round every 5s (throttled below the 500ms tick because probes do real network IO). The store is shared with the RPC layer so `service.status` and `service.status_full` apply a health overlay on top of the existing job-phase-derived state: - if probe failures hit threshold, state → `failed` - if a declared socket is missing on disk, state → `failed` - if `require_ready` is set and `.ready` not seen, state → `starting` Per-action `health_checks` continue to work in parallel — the new probe is service-level and does not replace them. The idempotent restart variant lives behind `service.reset_failed` (RPC) and `hero_proc service reset-failed` (CLI). `--force` falls back to legacy "restart everything" behavior. Without `--force`, healthy services are explicitly listed as `left_alone` in the response. ### What did NOT change - No new SQLite schema. New `ServiceSpec` fields (`probe`, `sockets`, `require_ready`) round-trip through the existing `spec_json` column — services that never set them stay byte-compatible. - No new socket conventions. `.ready` lives next to existing sockets in `$HERO_SOCKET_DIR/<service>/`. - No service auto-restart on probe failure yet — this commit makes the supervisor *report* the truth honestly. The next layer (auto-respawn on probe red) is intentionally separate so the operator can drive the policy. Sub-issue references with concrete pointers in the comments below.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#86
No description provided.