[arch] OServer panic isolation — make broken-handler-with-live-listener impossible, not just recoverable #204

New issue

Open

opened 2026-04-30 20:12:05 +00:00 by mik-tf · 0 comments

mik-tf commented

2026-04-30 20:12:05 +00:00

Owner

Premise

The "half-broken running service" pattern (#202) — listener bound, handler dead — is not specific to hero_foundry. It's a property of how every OServer-pattern service shares state across hyper connection tasks.

That issue (#202) is about reproducing and patching a specific instance. This issue is the architectural review — making panic-isolation actually a property of the OServer pattern, not an aspiration.

The vulnerability shape

Today's hero_foundry shape:

async fn serve_connection<I: ...>(io: I, state: Arc<ServerState>) {
    let service = service_fn(move |req| {
        let state = state.clone();          // shared mutable state
        handle_request(req, state)
    });
    if let Err(e) = http1::Builder::new()
        .serve_connection(TokioIo::new(io), service)
        .await
    { eprintln!("[Server] Connection error: {}", e); }
}

Hyper's per-connection task is panic-isolated at the task level — one panicked connection task does not kill the listener. Good. But every task captures Arc<ServerState> and accesses interior mutable state (Mutexes, channels, FS handles, in-memory caches) via that Arc. The moment any task:

Panics holding a Mutex<X> → Mutex is poisoned. Every subsequent task on lock() returns PoisonError.
Drops a sender for a channel that other tasks await → those tasks block forever.
Holds an FS lock that's leaked → next FS access deadlocks.
Has a deadlock between two futures both holding RwLock halves → the two are stuck forever, but new connections accept fine.

The listener task (accept loop) keeps running because it doesn't touch the poisoned state. Internal heartbeats from hero_proc succeed because they hit shallow paths that also don't touch it. User-facing requests die silently because they do.

This is exactly what we observed on hero_foundry today. The eprintln on line 176 didn't appear in hero_proc service logs — either it never fired (silent panic in deeper future), or it fired and got swallowed by stderr buffering.

The architectural ask

OServer (the shared HTTP server pattern used by every hero_*_server binary) should make panic-isolation an actual property:

1. No mutable state shared across connection tasks

Either:

All shared state is immutable (Arc<ImmutableConfig>).
State changes go through an actor/channel (mpsc::Sender<Command>), so a panicked consumer drops cleanly and a fresh consumer can be respawned.
Each connection gets its own state snapshot at accept time.

The "Arc + Mutex on hot path" pattern that's pervasive today should be flagged in code review and migrated.

2. Catch panics at the connection-task boundary

Wrap serve_connection so a panicked task is logged with full context (path, method, headers, stack trace if possible) and counted in a metric. Repeated panics on the same path = an explicit alarm, not silent drift.

3. Replace `eprintln!` with a structured logger that hero_proc captures

hero_proc service logs should see every connection error and every panic, with timestamp and identifier. Today some of these go to a stderr buffer that's never flushed when the service is killed mid-request.

4. Explicit handler timeouts

Any handler that takes >N seconds is killed and the connection task ends with a 504. A handler future stuck on a poisoned lock should not be allowed to wait forever — that's what makes "service running, half the paths timeout silently" a permanent state instead of a self-correcting one.

Why this is "long-term," not coping

A panic-catch + restart layer (#202 acceptance criterion) makes the service recover loudly. That's necessary but not sufficient. The structural fix is: make the failure mode impossible, not faster to recover from.

Once shipped:

A panicked handler does not poison shared state. It dies, logs, the next request works.
"Half-broken running service" is no longer a steady state — it's either fully running or fully dead, and the supervisor can act on the latter.
Health probes (lhumina_code/hero_proc#83) become a defense-in-depth check, not the primary defense.

Acceptance

OServer dispatcher rewritten so no mutable state is shared by reference across connection tasks. Migration plan for existing handlers.
Connection-level catch_unwind + structured panic log + metric.
Per-handler timeout (configurable, default 30s).
Stderr → structured log piped through hero_proc, end-to-end test that panics are captured.
Migration of the 3–5 most heavily-used hero_*_server binaries onto the new pattern as proof-of-shape (foundry, osis, books, agent at minimum).

Cross-references

Architectural fix for #202 (specific case study).
Reduces the surface area defended by lhumina_code/hero_proc#83 — health probes still useful, but as defense-in-depth.
Discovered live 2026-04-30 — hero_foundry on herodemo, restart fixed the symptom but the underlying pattern remains untouched on every other OServer-based service.

Signed-off-by: mik-tf

## Premise The "half-broken running service" pattern (https://forge.ourworld.tf/lhumina_code/home/issues/202) — listener bound, handler dead — is not specific to hero_foundry. It's a property of how every OServer-pattern service shares state across hyper connection tasks. That issue (#202) is about reproducing and patching a specific instance. **This issue is the architectural review** — making panic-isolation actually a property of the OServer pattern, not an aspiration. ## The vulnerability shape Today's hero_foundry shape: ```rust async fn serve_connection<I: ...>(io: I, state: Arc<ServerState>) { let service = service_fn(move |req| { let state = state.clone(); // shared mutable state handle_request(req, state) }); if let Err(e) = http1::Builder::new() .serve_connection(TokioIo::new(io), service) .await { eprintln!("[Server] Connection error: {}", e); } } ``` Hyper's per-connection task is panic-isolated at the *task* level — one panicked connection task does not kill the listener. Good. But every task captures `Arc<ServerState>` and accesses interior mutable state (Mutexes, channels, FS handles, in-memory caches) via that Arc. The moment any task: - Panics holding a `Mutex<X>` → `Mutex` is poisoned. Every subsequent task on `lock()` returns `PoisonError`. - Drops a sender for a channel that other tasks `await` → those tasks block forever. - Holds an FS lock that's leaked → next FS access deadlocks. - Has a deadlock between two futures both holding `RwLock` halves → the two are stuck forever, but new connections accept fine. The listener task (`accept` loop) keeps running because it doesn't touch the poisoned state. Internal heartbeats from `hero_proc` succeed because they hit shallow paths that also don't touch it. User-facing requests die silently because they do. This is exactly what we observed on hero_foundry today. The eprintln on line 176 didn't appear in `hero_proc service logs` — either it never fired (silent panic in deeper future), or it fired and got swallowed by stderr buffering. ## The architectural ask OServer (the shared HTTP server pattern used by every `hero_*_server` binary) should make panic-isolation an actual property: ### 1. No mutable state shared across connection tasks Either: - All shared state is immutable (`Arc<ImmutableConfig>`). - State changes go through an actor/channel (`mpsc::Sender<Command>`), so a panicked consumer drops cleanly and a fresh consumer can be respawned. - Each connection gets its own state snapshot at accept time. The "Arc + Mutex on hot path" pattern that's pervasive today should be flagged in code review and migrated. ### 2. Catch panics at the connection-task boundary Wrap `serve_connection` so a panicked task is logged with full context (path, method, headers, stack trace if possible) and counted in a metric. Repeated panics on the same path = an explicit alarm, not silent drift. ### 3. Replace `eprintln!` with a structured logger that hero_proc captures `hero_proc service logs` should see every connection error and every panic, with timestamp and identifier. Today some of these go to a stderr buffer that's never flushed when the service is killed mid-request. ### 4. Explicit handler timeouts Any handler that takes >N seconds is killed and the connection task ends with a 504. A handler future stuck on a poisoned lock should not be allowed to wait forever — that's what makes "service running, half the paths timeout silently" a permanent state instead of a self-correcting one. ## Why this is "long-term," not coping A panic-catch + restart layer (https://forge.ourworld.tf/lhumina_code/home/issues/202 acceptance criterion) makes the service recover loudly. That's necessary but not sufficient. The structural fix is: **make the failure mode impossible**, not faster to recover from. Once shipped: - A panicked handler does not poison shared state. It dies, logs, the next request works. - "Half-broken running service" is no longer a steady state — it's either fully running or fully dead, and the supervisor can act on the latter. - Health probes (https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83) become a defense-in-depth check, not the primary defense. ## Acceptance - [ ] OServer dispatcher rewritten so no mutable state is shared by reference across connection tasks. Migration plan for existing handlers. - [ ] Connection-level `catch_unwind` + structured panic log + metric. - [ ] Per-handler timeout (configurable, default 30s). - [ ] Stderr → structured log piped through hero_proc, end-to-end test that panics are captured. - [ ] Migration of the 3–5 most heavily-used hero_*_server binaries onto the new pattern as proof-of-shape (foundry, osis, books, agent at minimum). ## Cross-references - Architectural fix for https://forge.ourworld.tf/lhumina_code/home/issues/202 (specific case study). - Reduces the surface area defended by https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83 — health probes still useful, but as defense-in-depth. - Discovered live 2026-04-30 — hero_foundry on herodemo, restart fixed the symptom but the underlying pattern remains untouched on every other OServer-based service. Signed-off-by: mik-tf