[arch] OServer panic isolation — make broken-handler-with-live-listener impossible, not just recoverable #204
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Premise
The "half-broken running service" pattern (#202) — listener bound, handler dead — is not specific to hero_foundry. It's a property of how every OServer-pattern service shares state across hyper connection tasks.
That issue (#202) is about reproducing and patching a specific instance. This issue is the architectural review — making panic-isolation actually a property of the OServer pattern, not an aspiration.
The vulnerability shape
Today's hero_foundry shape:
Hyper's per-connection task is panic-isolated at the task level — one panicked connection task does not kill the listener. Good. But every task captures
Arc<ServerState>and accesses interior mutable state (Mutexes, channels, FS handles, in-memory caches) via that Arc. The moment any task:Mutex<X>→Mutexis poisoned. Every subsequent task onlock()returnsPoisonError.await→ those tasks block forever.RwLockhalves → the two are stuck forever, but new connections accept fine.The listener task (
acceptloop) keeps running because it doesn't touch the poisoned state. Internal heartbeats fromhero_procsucceed because they hit shallow paths that also don't touch it. User-facing requests die silently because they do.This is exactly what we observed on hero_foundry today. The eprintln on line 176 didn't appear in
hero_proc service logs— either it never fired (silent panic in deeper future), or it fired and got swallowed by stderr buffering.The architectural ask
OServer (the shared HTTP server pattern used by every
hero_*_serverbinary) should make panic-isolation an actual property:1. No mutable state shared across connection tasks
Either:
Arc<ImmutableConfig>).mpsc::Sender<Command>), so a panicked consumer drops cleanly and a fresh consumer can be respawned.The "Arc + Mutex on hot path" pattern that's pervasive today should be flagged in code review and migrated.
2. Catch panics at the connection-task boundary
Wrap
serve_connectionso a panicked task is logged with full context (path, method, headers, stack trace if possible) and counted in a metric. Repeated panics on the same path = an explicit alarm, not silent drift.3. Replace
eprintln!with a structured logger that hero_proc captureshero_proc service logsshould see every connection error and every panic, with timestamp and identifier. Today some of these go to a stderr buffer that's never flushed when the service is killed mid-request.4. Explicit handler timeouts
Any handler that takes >N seconds is killed and the connection task ends with a 504. A handler future stuck on a poisoned lock should not be allowed to wait forever — that's what makes "service running, half the paths timeout silently" a permanent state instead of a self-correcting one.
Why this is "long-term," not coping
A panic-catch + restart layer (#202 acceptance criterion) makes the service recover loudly. That's necessary but not sufficient. The structural fix is: make the failure mode impossible, not faster to recover from.
Once shipped:
Acceptance
catch_unwind+ structured panic log + metric.Cross-references
Signed-off-by: mik-tf