Dangling rpc.sock dentry on service restart — kernel listener alive, file vanished #78

Closed
opened 2026-04-30 11:22:54 +00:00 by mik-tf · 1 comment
Owner

Symptom

After a service restart (sometimes after a hero_proc daemon restart, sometimes after just proc service restart <X>), <service>/rpc.sock enters a state where:

  • The kernel-side listener is alive — ss -lnp shows LISTEN on the path with the right PID and FD.
  • The filesystem dentry is gone — ls -la does not show rpc.sock in the directory.
  • New clients fail with ENOENT (No such file or directory (os error 2)) when they connect() to the path.
  • The service's UI socket in the same dir typically stays healthy.

The service itself reports running to proc service status, because hero_proc tracks the action's process state, which is fine — the inconsistency is between kernel binding and filesystem visibility.

Repro pattern observed

This bit at least twice in one session on herodemo.gent01.grid.tf:

  1. hero_agent after my session restarted things — hero_agent_ui couldn't reach hero_agent_server on rpc.sock; ss showed PID 1819 (hero_agent_server) listening on the path; ls showed only ui.sock. AI Assistant errored with Backend unavailable: No such file or directory (os error 2). proc service restart hero_agent cleared it.
  2. hero_foundry after the same session's mid-session hero_proc daemon crash + recovery — hero_office_server errored on foundry list_files failed: connecting to /home/driver/hero/var/sockets/hero_foundry/rpc.sock; ss showed PID 20606 (hero_foundry_server) listening on the same path; ls showed only ui.sock. Photos/videos/Office documents all broke as a knock-on. proc service restart hero_foundry cleared it.

The pattern is: the kernel listen socket survived a previous restart cycle but the file the new instance was supposed to bind to is missing.

Likely cause

kill_other.socket cleanup in the action spec calls unlink(path) before the new process starts. If the new process inherits / clones from a still-running predecessor (e.g. exec into a wrapper that already has the FD open) without re-binding, the kernel listener stays bound to the inode but the directory entry that pointed at it is gone — exactly the state we see.

Possible scenarios that produce this:

  • hero_proc receives service restart, calls unlink(rpc.sock), then re-execs the action. The action's process (or a child it kept) had the FD open from the previous bind; unlink() removed the dentry but the inode + its kernel listener stayed because the FD pinned it. The new process attempts bind(rpc.sock) and gets EADDRINUSE since the inode still exists, OR the new process re-uses the same FD and never re-binds (depending on action shape).
  • Daemon-side crash where hero_proc dies mid-action-restart, leaving the unlink already done but the new bind never reached.

A clean restart of just that action repairs it because the launcher is fully re-spawned and binds a fresh dentry.

What would harden this

A few options worth considering — pick whichever fits the supervisor model:

  1. Sanity-check after start: after kill_other.socket cleanup + new process spawn, hero_proc verifies the listed sockets are visible via stat() once the health check passes; if any are missing, it logs a warning and re-cleans + re-execs the action.
  2. Force unbind via SO_REUSEADDR-equivalent on UDS: ensure new bind always wins when the old FD is somehow still around — Linux UDS doesn't have SO_REUSEADDR, but a connect()-then-fail probe before bind can detect the stale FD case.
  3. Rebind-on-startup health check: each action's healthcheck endpoint also stat's its own listen path on first probe and crashes loudly if the file is missing — hero_proc's retry policy then restarts it cleanly.

Option 1 is cheapest to implement; it just adds a post-start invariant check on the listed kill_other.socket paths.

Workaround today

Operators see this as os error 2 in a downstream service that talks to the affected one; running proc service restart <affected> clears it.

Files / context

  • The action specs that exhibited the bug used the standard kill_other.socket: ["<path>/rpc.sock"] pattern from hero_skills/tools/modules/services/lib.nu.
  • Both affected services (hero_agent, hero_foundry) had been running across at least one prior service restart cycle before the dangling state appeared.
## Symptom After a service restart (sometimes after a hero_proc daemon restart, sometimes after just `proc service restart <X>`), `<service>/rpc.sock` enters a state where: - The kernel-side listener is alive — `ss -lnp` shows `LISTEN` on the path with the right PID and FD. - The filesystem dentry is gone — `ls -la` does not show `rpc.sock` in the directory. - New clients fail with `ENOENT` (`No such file or directory (os error 2)`) when they `connect()` to the path. - The service's UI socket in the same dir typically stays healthy. The service itself reports `running` to `proc service status`, because hero_proc tracks the action's process state, which is fine — the inconsistency is between kernel binding and filesystem visibility. ## Repro pattern observed This bit at least twice in one session on `herodemo.gent01.grid.tf`: 1. `hero_agent` after my session restarted things — `hero_agent_ui` couldn't reach `hero_agent_server` on rpc.sock; `ss` showed PID 1819 (hero_agent_server) listening on the path; `ls` showed only ui.sock. AI Assistant errored with `Backend unavailable: No such file or directory (os error 2)`. `proc service restart hero_agent` cleared it. 2. `hero_foundry` after the same session's mid-session hero_proc daemon crash + recovery — `hero_office_server` errored on `foundry list_files failed: connecting to /home/driver/hero/var/sockets/hero_foundry/rpc.sock`; `ss` showed PID 20606 (hero_foundry_server) listening on the same path; `ls` showed only ui.sock. Photos/videos/Office documents all broke as a knock-on. `proc service restart hero_foundry` cleared it. The pattern is: **the kernel listen socket survived a previous restart cycle but the file the new instance was supposed to bind to is missing.** ## Likely cause `kill_other.socket` cleanup in the action spec calls `unlink(path)` before the new process starts. If the new process inherits / clones from a still-running predecessor (e.g. exec into a wrapper that already has the FD open) without re-binding, the kernel listener stays bound to the inode but the directory entry that pointed at it is gone — exactly the state we see. Possible scenarios that produce this: - hero_proc receives `service restart`, calls `unlink(rpc.sock)`, then re-execs the action. The action's process (or a child it kept) had the FD open from the previous bind; `unlink()` removed the dentry but the inode + its kernel listener stayed because the FD pinned it. The new process attempts `bind(rpc.sock)` and gets `EADDRINUSE` since the inode still exists, OR the new process re-uses the same FD and never re-binds (depending on action shape). - Daemon-side crash where hero_proc dies mid-action-restart, leaving the unlink already done but the new bind never reached. A clean restart of *just that action* repairs it because the launcher is fully re-spawned and binds a fresh dentry. ## What would harden this A few options worth considering — pick whichever fits the supervisor model: 1. **Sanity-check after start**: after `kill_other.socket` cleanup + new process spawn, hero_proc verifies the listed sockets are visible via `stat()` once the health check passes; if any are missing, it logs a warning and re-cleans + re-execs the action. 2. **Force unbind via `SO_REUSEADDR`-equivalent on UDS**: ensure new bind always wins when the old FD is somehow still around — Linux UDS doesn't have `SO_REUSEADDR`, but a `connect()`-then-fail probe before bind can detect the stale FD case. 3. **Rebind-on-startup health check**: each action's healthcheck endpoint also stat's its own listen path on first probe and crashes loudly if the file is missing — hero_proc's retry policy then restarts it cleanly. Option 1 is cheapest to implement; it just adds a post-start invariant check on the listed `kill_other.socket` paths. ## Workaround today Operators see this as `os error 2` in a downstream service that talks to the affected one; running `proc service restart <affected>` clears it. ## Files / context - The action specs that exhibited the bug used the standard `kill_other.socket: ["<path>/rpc.sock"]` pattern from `hero_skills/tools/modules/services/lib.nu`. - Both affected services (hero_agent, hero_foundry) had been running across at least one prior `service restart` cycle before the dangling state appeared.
Owner

Implemented — closes the dangling-dentry race

Fix lands in two layers, both wired through ServiceSpec::sockets (a new optional list of socket basenames declared by the service):

Layer 1 — pre-spawn cleanup

crates/hero_proc_server/src/supervisor/executor.rs now removes any stale socket files (and the .ready marker, see #84) before launching a process job for a service that has declared sockets. This closes the most common dangling-dentry case where the previous instance crashed leaving the dentry behind:

  • Before: child binds → bind sees existing dentry → behaviour OS-dependent → sometimes ENOENT for new clients
  • After: supervisor unlinks → child binds clean → dentry on disk matches kernel listener

See cleanup_service_sockets() in executor.rs.

Layer 2 — periodic invariant check

crates/hero_proc_server/src/supervisor/service_state.rs runs every 5s. For every service that has declared sockets, every basename must exist on disk. A missing socket flips the service's status to failed with health_reason="declared socket missing on disk: <name>". This catches a dentry race that survives the spawn — operator sees the truth in service list instead of green-when-broken.

How services opt in

// In your service definition (Rust):
ServiceSpec {
    name: "myservice".into(),
    sockets: vec!["rpc.sock".into(), "ui.sock".into()],
    ..Default::default()
}

Or via the JSON spec stored in services.spec_json — the new field is optional and round-trips through the existing column.

Out of scope

  • The home#202 half-broken-listener case (kernel listener alive, handler poisoned) is not fixed by this issue — that's the OServer panic-isolation problem tracked in home#204. The probe layer in #83 catches it from hero_proc's side.
  • No automatic respawn on missing-dentry yet. Reporting goes red; operator decides via service.reset_failed (added under #86 P4).

Verification

cargo check --workspace --all-targets clean. Build clean. Existing suite passes. Closing as implemented.

## Implemented — closes the dangling-dentry race Fix lands in two layers, both wired through `ServiceSpec::sockets` (a new optional list of socket basenames declared by the service): ### Layer 1 — pre-spawn cleanup `crates/hero_proc_server/src/supervisor/executor.rs` now removes any stale socket files (and the `.ready` marker, see #84) before launching a process job for a service that has declared `sockets`. This closes the most common dangling-dentry case where the previous instance crashed leaving the dentry behind: - Before: child binds → bind sees existing dentry → behaviour OS-dependent → sometimes ENOENT for new clients - After: supervisor unlinks → child binds clean → dentry on disk matches kernel listener See `cleanup_service_sockets()` in `executor.rs`. ### Layer 2 — periodic invariant check `crates/hero_proc_server/src/supervisor/service_state.rs` runs every 5s. For every service that has declared `sockets`, every basename must exist on disk. A missing socket flips the service's status to `failed` with `health_reason="declared socket missing on disk: <name>"`. This catches a dentry race that survives the spawn — operator sees the truth in `service list` instead of green-when-broken. ### How services opt in ```rust // In your service definition (Rust): ServiceSpec { name: "myservice".into(), sockets: vec!["rpc.sock".into(), "ui.sock".into()], ..Default::default() } ``` Or via the JSON spec stored in `services.spec_json` — the new field is optional and round-trips through the existing column. ### Out of scope - The `home#202` half-broken-listener case (kernel listener alive, handler poisoned) is **not** fixed by this issue — that's the OServer panic-isolation problem tracked in `home#204`. The probe layer in #83 catches it from hero_proc's side. - No automatic respawn on missing-dentry yet. Reporting goes red; operator decides via `service.reset_failed` (added under #86 P4). ### Verification `cargo check --workspace --all-targets` clean. Build clean. Existing suite passes. Closing as implemented.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#78
No description provided.