[meta] hero_proc reliability roadmap — every observed failure mode + tracking + order of attack #86
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#86
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What this issue is
Single index for every observed reliability bug in hero_proc, the failure-mode patterns, and the issues that track each fix. Intended for an agent to systematically work through. Content is observation-driven — every entry below has a concrete reproduction, a date observed, and (where filed) the issue tracking the durable fix.
The recurring theme: hero_proc's status reports drift from reality. It reports services as running when they're broken, reports them as failed when they're working, fails to detect when one of its children dies, and accumulates state inside its own process that becomes a single point of failure for the whole demo.
A. Observed failure modes (concrete evidence)
A1.
service listreports green when handler is deadEvidence (2026-04-30 herodemo):
hero_foundrywas reported as● runningbyhero_proc service list. The PIDs were alive, both rpc.sock and ui.sock existed. But:curl --unix-socket .../hero_foundry/rpc.sock /api/files/...→ empty reply, connection accepted then immediately closed/api/files/...requests via the gateway returned 502/health,/rpc) on the same socket DID succeedSo the listener was alive for some paths but not others. Restart fixed it.
Same shape on
hero_osearlier in the day:service listsaidrunningwith both child action names, buthero_os_uiwas actually dead andui.sockwas missing. The supervisor had no idea one of the two child processes had died.Tracking: #83 (handler-responsive probes) + lhumina_code/home#202 (half-broken listener root cause)
A2.
service listreports red when reality is greenEvidence (2026-04-30 herodemo, post-
service_proc start --reset): Right after restarting hero_proc,service listshowed every service excepthero_proc_uias✗ failedwith PID 0. But:ps)The supervisor's bookkeeping marked the previous
run_id=180asstatus=errorduring shutdown, and that propagated to "service status = failed" for every service even after autostart respawned them all successfully.Tracking: not yet filed as a discrete issue — folds into #83 (the supervisor's status accounting needs to be sourced from a real probe, not from the most-recent
run_idstatus).A3. Dangling rpc.sock dentry after supervisor restart
Evidence: socket file present on disk but kernel listener gone, or vice versa. Cause: hero_proc creates the socket file at startup, child binds to it, on supervisor restart the cleanup ordering can leave the file pointing at a dead inode.
Tracking: #78
A4. Children orphan to PID 1 and survive supervisor restart
Evidence (2026-04-30 herodemo, observed during today's restart):
hero_proc_serverPID hadPPid: 1(init) — it was already orphaned from its own startup chain. Whenservice_proc start --resetkilled and respawned the supervisor, the children were correctly killed and replaced (via the autostart-on-restart mechanism). But:bind()which creates the dangling-dentry race (A3)This is a family of bugs (A1, A2, A3 all stem from the supervisor not having a clean restart contract).
Tracking: #84 (service readiness contract — services declare ready, supervisor doesn't guess from PID)
A5. Memory growth under log volume → OOM
Evidence (2026-04-30 herodemo): hero_proc_server at 7.9 GB resident after ~5h26m uptime. Killed cascade-wise (operationally restarted hero_proc → child services restart → some of them landed in the half-broken state in A1).
Status: shipped 2026-04-30 — bounded mpsc channel + drop-on-full + visibility counter. Caps in-channel memory at ~10 MB. #80 (closed)
The architectural follow-up — why does the channel fill at all? — is #85 (measure producer rate, size SQLite, or move logs out).
A6. sysinfo /proc fd retention (kept-files-open default)
Evidence: hero_proc_server fd table grew without bound because sysinfo's default keeps
/proc/<pid>/statfiles open across refresh cycles.Status: shipped —
set_open_files_limit(0)+ProcessRefreshKind::nothing().with_memory().with_cpu(). #81 (closed)A7. Half-broken listener pattern (OServer-wide, not hero_proc-specific)
Evidence (2026-04-30 herodemo): hero_foundry rpc.sock listener alive, accepting TCP, internal heartbeats getting through, but every user-facing path returned empty reply. The hyper per-connection task was probably panicking holding shared state, leaving the listener dispatching to a poisoned
state.rs.This affects every OServer-pattern service (hero_foundry, hero_osis, hero_books, hero_agent, ...) — the supervisor doesn't know which one is in this state at any given moment.
Tracking: lhumina_code/home#202 (specific case study), lhumina_code/home#204 (architectural — make panic isolation an actual property of OServer)
A8. Restart cascade collateral damage
Evidence (2026-04-30, today's session): when we restarted hero_proc to apply the bounded-channel fix, the child respawn left
service listin a broken-reporting state (A2). When we restarted hero_proc earlier in the day to clear the 7.9 GB leak (A5), hero_os was knocked into the missing-ui.sock state (A1) and hero_foundry was knocked into the half-broken-listener state (A7). Every restart of hero_proc is currently a roll of the dice for which child gets damaged.Tracking: not its own issue — composite of A1+A3+A7 fixes.
B. Architectural / structural issues filed (the proper fixes)
C. What we want hero_proc to do that it doesn't
.readyfile (or sd_notify, or fd-3 protocol) when their listener is bound AND their handler responds to a self-probe. Supervisor watches the file. (#84)service_proc start --resetshould never produce A2-style red status board. The restart sequence should: probe each child → only kill and respawn the genuinely-failed ones → leave healthy ones alone.D. Suggested order of attack
Priority ordering by ROI for an agent or implementer:
E. Acceptance for "hero_proc is reliable"
service_proc start --resetagainst a healthy demo,service listreports each running service as● runningwithin 30 seconds, not✗ failed(A2 closed).service listflips that service to ✗ within one probe cycle, and the supervisor either restarts it or alerts (A1, #83).F. Cross-references
Signed-off-by: mik-tf
Plan + implementation status (logs work explicitly out of scope per request)
Detailed plan in repo at
specs/issue_86_plan.md. Five sub-issues in dependency order; #85 (logs) deferred per instructions, watchdog dropped per instructions.Status
crates/hero_proc_server/src/rpc/service.rs:723-801.readyfileAll five layers compile cleanly via
cargo check --workspace --all-targetsand fullcargo build --workspace. Existing test suites forhero_proc_lib::db::service::model,hero_proc_server::rpc::service, andhero_proc_sdk::readypass.Architectural shape of what was added
A new module
crates/hero_proc_server/src/supervisor/service_state.rsowns the per-service liveness/readiness/dentry state. The supervisor's main loop runs one evaluator round every 5s (throttled below the 500ms tick because probes do real network IO). The store is shared with the RPC layer soservice.statusandservice.status_fullapply a health overlay on top of the existing job-phase-derived state:failedfailedrequire_readyis set and.readynot seen, state →startingPer-action
health_checkscontinue to work in parallel — the new probe is service-level and does not replace them.The idempotent restart variant lives behind
service.reset_failed(RPC) andhero_proc service reset-failed(CLI).--forcefalls back to legacy "restart everything" behavior. Without--force, healthy services are explicitly listed asleft_alonein the response.What did NOT change
ServiceSpecfields (probe,sockets,require_ready) round-trip through the existingspec_jsoncolumn — services that never set them stay byte-compatible..readylives next to existing sockets in$HERO_SOCKET_DIR/<service>/.Sub-issue references with concrete pointers in the comments below.