Dangling rpc.sock dentry on service restart — kernel listener alive, file vanished #78
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#78
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
After a service restart (sometimes after a hero_proc daemon restart, sometimes after just
proc service restart <X>),<service>/rpc.sockenters a state where:ss -lnpshowsLISTENon the path with the right PID and FD.ls -ladoes not showrpc.sockin the directory.ENOENT(No such file or directory (os error 2)) when theyconnect()to the path.The service itself reports
runningtoproc service status, because hero_proc tracks the action's process state, which is fine — the inconsistency is between kernel binding and filesystem visibility.Repro pattern observed
This bit at least twice in one session on
herodemo.gent01.grid.tf:hero_agentafter my session restarted things —hero_agent_uicouldn't reachhero_agent_serveron rpc.sock;ssshowed PID 1819 (hero_agent_server) listening on the path;lsshowed only ui.sock. AI Assistant errored withBackend unavailable: No such file or directory (os error 2).proc service restart hero_agentcleared it.hero_foundryafter the same session's mid-session hero_proc daemon crash + recovery —hero_office_servererrored onfoundry list_files failed: connecting to /home/driver/hero/var/sockets/hero_foundry/rpc.sock;ssshowed PID 20606 (hero_foundry_server) listening on the same path;lsshowed only ui.sock. Photos/videos/Office documents all broke as a knock-on.proc service restart hero_foundrycleared it.The pattern is: the kernel listen socket survived a previous restart cycle but the file the new instance was supposed to bind to is missing.
Likely cause
kill_other.socketcleanup in the action spec callsunlink(path)before the new process starts. If the new process inherits / clones from a still-running predecessor (e.g. exec into a wrapper that already has the FD open) without re-binding, the kernel listener stays bound to the inode but the directory entry that pointed at it is gone — exactly the state we see.Possible scenarios that produce this:
service restart, callsunlink(rpc.sock), then re-execs the action. The action's process (or a child it kept) had the FD open from the previous bind;unlink()removed the dentry but the inode + its kernel listener stayed because the FD pinned it. The new process attemptsbind(rpc.sock)and getsEADDRINUSEsince the inode still exists, OR the new process re-uses the same FD and never re-binds (depending on action shape).A clean restart of just that action repairs it because the launcher is fully re-spawned and binds a fresh dentry.
What would harden this
A few options worth considering — pick whichever fits the supervisor model:
kill_other.socketcleanup + new process spawn, hero_proc verifies the listed sockets are visible viastat()once the health check passes; if any are missing, it logs a warning and re-cleans + re-execs the action.SO_REUSEADDR-equivalent on UDS: ensure new bind always wins when the old FD is somehow still around — Linux UDS doesn't haveSO_REUSEADDR, but aconnect()-then-fail probe before bind can detect the stale FD case.Option 1 is cheapest to implement; it just adds a post-start invariant check on the listed
kill_other.socketpaths.Workaround today
Operators see this as
os error 2in a downstream service that talks to the affected one; runningproc service restart <affected>clears it.Files / context
kill_other.socket: ["<path>/rpc.sock"]pattern fromhero_skills/tools/modules/services/lib.nu.service restartcycle before the dangling state appeared.Implemented — closes the dangling-dentry race
Fix lands in two layers, both wired through
ServiceSpec::sockets(a new optional list of socket basenames declared by the service):Layer 1 — pre-spawn cleanup
crates/hero_proc_server/src/supervisor/executor.rsnow removes any stale socket files (and the.readymarker, see #84) before launching a process job for a service that has declaredsockets. This closes the most common dangling-dentry case where the previous instance crashed leaving the dentry behind:See
cleanup_service_sockets()inexecutor.rs.Layer 2 — periodic invariant check
crates/hero_proc_server/src/supervisor/service_state.rsruns every 5s. For every service that has declaredsockets, every basename must exist on disk. A missing socket flips the service's status tofailedwithhealth_reason="declared socket missing on disk: <name>". This catches a dentry race that survives the spawn — operator sees the truth inservice listinstead of green-when-broken.How services opt in
Or via the JSON spec stored in
services.spec_json— the new field is optional and round-trips through the existing column.Out of scope
home#202half-broken-listener case (kernel listener alive, handler poisoned) is not fixed by this issue — that's the OServer panic-isolation problem tracked inhome#204. The probe layer in #83 catches it from hero_proc's side.service.reset_failed(added under #86 P4).Verification
cargo check --workspace --all-targetsclean. Build clean. Existing suite passes. Closing as implemented.