[infra] Half-broken running service pattern — listener alive, handler dead (foundry / OServer dispatch) #202
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
Today on herodemo,
hero_foundry_server(PID 21181) reached a state where:rpc.sockwas bound and listed byss -lx(kernel listener alive)/proc/21181/fd/9→socket:[50179088])GET /health,POST /rpc,GET /.well-known/heroservice.json) reached the handler and were loggedDirect probe via
curl --unix-socket .../rpc.sock http://localhost/api/files/geomind/Photos/beach_retreat.jpg→ exit 52 (empty reply). Same path via gateway → 502. Restart fixed it. After restart, identical curl returned 200 + 34 KB JPEG.Hypothesis
crates/hero_foundry_server/src/http/server.rs::serve_connectionspawns one task per connection viatokio::spawn. Each task does:If the per-connection task panics (or if the underlying
service_fnreturns an error that hyper considers fatal), the listener task itself stays alive — it just stops accepting new useful connections, OR it accepts them but the service_fn closure has captured state that is now poisoned (e.g., a poisoned Mutex onstate, a dropped channel sender, etc.).The fact that some paths still worked but others didn't suggests the dispatch is partial: the simple paths (
/health,/rpc→ JSON-RPC) hit code that doesn't touch the broken state, while/api/files/*and/webdav/*touch some shared resource (e.g., webdav handler state,state.get_context_storage_path(), an FS handle pool) that has been poisoned.Why this matters
This is a "half-broken running service" pattern. The supervisor reports green. Most probes report green. Users see broken features. Without an explicit fix, every restart we do (because of unrelated reasons — memory leaks, deploys) is a roll of the dice for whether the service comes back fully healthy.
Today we hit the same shape on
hero_os(one child dead, parent process alive, supervisor reports green) — different mechanism, same observable symptom. There's a class of bugs here, not a one-off.Acceptance
serve_connectionso a panicked task is at minimum loud.state.rsfor shared resources protected byMutexorRwLock— a panic while holding one of these is the classic trigger.hero_proc service logs(not justeprintln!to a buffer that gets dropped) so we can post-mortem these states.Cross-references
Signed-off-by: mik-tf