[infra] Half-broken running service pattern — listener alive, handler dead (foundry / OServer dispatch) #202

New issue

Open

opened 2026-04-30 20:02:50 +00:00 by mik-tf · 0 comments

mik-tf commented

2026-04-30 20:02:50 +00:00

Owner

Symptom

Today on herodemo, hero_foundry_server (PID 21181) reached a state where:

rpc.sock was bound and listed by ss -lx (kernel listener alive)
The PID owned the listener fd (/proc/21181/fd/9 → socket:[50179088])
Internal heartbeat probes (GET /health, POST /rpc, GET /.well-known/heroservice.json) reached the handler and were logged
Every other path returned "Empty reply from server" — connection accepted, immediately closed, no response, nothing logged

Direct probe via curl --unix-socket .../rpc.sock http://localhost/api/files/geomind/Photos/beach_retreat.jpg → exit 52 (empty reply). Same path via gateway → 502. Restart fixed it. After restart, identical curl returned 200 + 34 KB JPEG.

[Server] Listening on unix:/home/driver/hero/var/sockets/hero_foundry/rpc.sock  ← from startup, hours earlier
[HTTP] GET /health           ← these keep succeeding
[HTTP] POST /rpc             ← these keep succeeding
[HTTP] GET /.well-known/...  ← these keep succeeding
                              ← /api/files/* requests never reached the handler at all

Hypothesis

crates/hero_foundry_server/src/http/server.rs::serve_connection spawns one task per connection via tokio::spawn. Each task does:

http1::Builder::new()
    .serve_connection(TokioIo::new(io), service)
    .await

If the per-connection task panics (or if the underlying service_fn returns an error that hyper considers fatal), the listener task itself stays alive — it just stops accepting new useful connections, OR it accepts them but the service_fn closure has captured state that is now poisoned (e.g., a poisoned Mutex on state, a dropped channel sender, etc.).

The fact that some paths still worked but others didn't suggests the dispatch is partial: the simple paths (/health, /rpc → JSON-RPC) hit code that doesn't touch the broken state, while /api/files/* and /webdav/* touch some shared resource (e.g., webdav handler state, state.get_context_storage_path(), an FS handle pool) that has been poisoned.

Why this matters

This is a "half-broken running service" pattern. The supervisor reports green. Most probes report green. Users see broken features. Without an explicit fix, every restart we do (because of unrelated reasons — memory leaks, deploys) is a roll of the dice for whether the service comes back fully healthy.

Today we hit the same shape on hero_os (one child dead, parent process alive, supervisor reports green) — different mechanism, same observable symptom. There's a class of bugs here, not a one-off.

Acceptance

Reproduce the half-broken state in a controlled test (panic injected into a webdav handler, fd table exhausted, FS-handle pool poisoned, etc.) — confirm which mechanism leaves which paths working/broken.
Add panic catch + log + (optional) self-restart in serve_connection so a panicked task is at minimum loud.
Audit state.rs for shared resources protected by Mutex or RwLock — a panic while holding one of these is the classic trigger.
Add a connection-error log line that shows up in hero_proc service logs (not just eprintln! to a buffer that gets dropped) so we can post-mortem these states.
Once root cause is known, harden the same pattern across all OServer-pattern services (every hero_*_server binary).

Cross-references

Live observation 2026-04-30 — hero_foundry on herodemo, ~3.5 hrs uptime.
Sibling: lhumina_code/hero_proc#83 — even with this fixed, probes are needed.
Sibling: #201 — even with this fixed, the loop catches new failure modes.
Related: lhumina_code/hero_proc#78 (dangling socket dentry on supervisor restart) — different mechanism (file vanishes), same family (listener state vs reality drift).

Signed-off-by: mik-tf

## Symptom Today on herodemo, `hero_foundry_server` (PID 21181) reached a state where: - `rpc.sock` was bound and listed by `ss -lx` (kernel listener alive) - The PID owned the listener fd (`/proc/21181/fd/9` → `socket:[50179088]`) - Internal heartbeat probes (`GET /health`, `POST /rpc`, `GET /.well-known/heroservice.json`) reached the handler and were logged - **Every other path** returned "Empty reply from server" — connection accepted, immediately closed, no response, nothing logged Direct probe via `curl --unix-socket .../rpc.sock http://localhost/api/files/geomind/Photos/beach_retreat.jpg` → exit 52 (empty reply). Same path via gateway → 502. Restart fixed it. After restart, identical curl returned 200 + 34 KB JPEG. ``` [Server] Listening on unix:/home/driver/hero/var/sockets/hero_foundry/rpc.sock ← from startup, hours earlier [HTTP] GET /health ← these keep succeeding [HTTP] POST /rpc ← these keep succeeding [HTTP] GET /.well-known/... ← these keep succeeding ← /api/files/* requests never reached the handler at all ``` ## Hypothesis `crates/hero_foundry_server/src/http/server.rs::serve_connection` spawns one task per connection via `tokio::spawn`. Each task does: ```rust http1::Builder::new() .serve_connection(TokioIo::new(io), service) .await ``` If the per-connection task panics (or if the underlying `service_fn` returns an error that hyper considers fatal), the listener task itself stays alive — it just stops accepting new useful connections, OR it accepts them but the service_fn closure has captured state that is now poisoned (e.g., a poisoned Mutex on `state`, a dropped channel sender, etc.). The fact that some paths still worked but others didn't suggests the dispatch is partial: the simple paths (`/health`, `/rpc` → JSON-RPC) hit code that doesn't touch the broken state, while `/api/files/*` and `/webdav/*` touch some shared resource (e.g., webdav handler state, `state.get_context_storage_path()`, an FS handle pool) that has been poisoned. ## Why this matters This is a "half-broken running service" pattern. The supervisor reports green. Most probes report green. Users see broken features. Without an explicit fix, every restart we do (because of unrelated reasons — memory leaks, deploys) is a roll of the dice for whether the service comes back fully healthy. Today we hit the same shape on `hero_os` (one child dead, parent process alive, supervisor reports green) — different mechanism, same observable symptom. There's a class of bugs here, not a one-off. ## Acceptance - [ ] Reproduce the half-broken state in a controlled test (panic injected into a webdav handler, fd table exhausted, FS-handle pool poisoned, etc.) — confirm which mechanism leaves which paths working/broken. - [ ] Add panic catch + log + (optional) self-restart in `serve_connection` so a panicked task is at minimum loud. - [ ] Audit `state.rs` for shared resources protected by `Mutex` or `RwLock` — a panic while holding one of these is the classic trigger. - [ ] Add a connection-error log line that shows up in `hero_proc service logs` (not just `eprintln!` to a buffer that gets dropped) so we can post-mortem these states. - [ ] Once root cause is known, harden the same pattern across all OServer-pattern services (every hero_*_server binary). ## Cross-references - Live observation 2026-04-30 — hero_foundry on herodemo, ~3.5 hrs uptime. - Sibling: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/83 — even with this fixed, probes are needed. - Sibling: https://forge.ourworld.tf/lhumina_code/home/issues/201 — even with this fixed, the loop catches new failure modes. - Related: https://forge.ourworld.tf/lhumina_code/hero_proc/issues/78 (dangling socket dentry on supervisor restart) — different mechanism (file vanishes), same family (listener state vs reality drift). Signed-off-by: mik-tf