lhumina_code/hero_proc

Fork 0

Dangling rpc.sock dentry on service restart — kernel listener alive, file vanished #78

New issue

Closed

opened 2026-04-30 11:22:54 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-04-30 11:22:54 +00:00

Owner

Symptom

After a service restart (sometimes after a hero_proc daemon restart, sometimes after just proc service restart <X>), <service>/rpc.sock enters a state where:

The kernel-side listener is alive — ss -lnp shows LISTEN on the path with the right PID and FD.
The filesystem dentry is gone — ls -la does not show rpc.sock in the directory.
New clients fail with ENOENT (No such file or directory (os error 2)) when they connect() to the path.
The service's UI socket in the same dir typically stays healthy.

The service itself reports running to proc service status, because hero_proc tracks the action's process state, which is fine — the inconsistency is between kernel binding and filesystem visibility.

Repro pattern observed

This bit at least twice in one session on herodemo.gent01.grid.tf:

hero_agent after my session restarted things — hero_agent_ui couldn't reach hero_agent_server on rpc.sock; ss showed PID 1819 (hero_agent_server) listening on the path; ls showed only ui.sock. AI Assistant errored with Backend unavailable: No such file or directory (os error 2). proc service restart hero_agent cleared it.
hero_foundry after the same session's mid-session hero_proc daemon crash + recovery — hero_office_server errored on foundry list_files failed: connecting to /home/driver/hero/var/sockets/hero_foundry/rpc.sock; ss showed PID 20606 (hero_foundry_server) listening on the same path; ls showed only ui.sock. Photos/videos/Office documents all broke as a knock-on. proc service restart hero_foundry cleared it.

The pattern is: the kernel listen socket survived a previous restart cycle but the file the new instance was supposed to bind to is missing.

Likely cause

kill_other.socket cleanup in the action spec calls unlink(path) before the new process starts. If the new process inherits / clones from a still-running predecessor (e.g. exec into a wrapper that already has the FD open) without re-binding, the kernel listener stays bound to the inode but the directory entry that pointed at it is gone — exactly the state we see.

Possible scenarios that produce this:

hero_proc receives service restart, calls unlink(rpc.sock), then re-execs the action. The action's process (or a child it kept) had the FD open from the previous bind; unlink() removed the dentry but the inode + its kernel listener stayed because the FD pinned it. The new process attempts bind(rpc.sock) and gets EADDRINUSE since the inode still exists, OR the new process re-uses the same FD and never re-binds (depending on action shape).
Daemon-side crash where hero_proc dies mid-action-restart, leaving the unlink already done but the new bind never reached.

A clean restart of just that action repairs it because the launcher is fully re-spawned and binds a fresh dentry.

What would harden this

A few options worth considering — pick whichever fits the supervisor model:

Sanity-check after start: after kill_other.socket cleanup + new process spawn, hero_proc verifies the listed sockets are visible via stat() once the health check passes; if any are missing, it logs a warning and re-cleans + re-execs the action.
Force unbind via SO_REUSEADDR-equivalent on UDS: ensure new bind always wins when the old FD is somehow still around — Linux UDS doesn't have SO_REUSEADDR, but a connect()-then-fail probe before bind can detect the stale FD case.
Rebind-on-startup health check: each action's healthcheck endpoint also stat's its own listen path on first probe and crashes loudly if the file is missing — hero_proc's retry policy then restarts it cleanly.

Option 1 is cheapest to implement; it just adds a post-start invariant check on the listed kill_other.socket paths.

Workaround today

Operators see this as os error 2 in a downstream service that talks to the affected one; running proc service restart <affected> clears it.

Files / context

The action specs that exhibited the bug used the standard kill_other.socket: ["<path>/rpc.sock"] pattern from hero_skills/tools/modules/services/lib.nu.
Both affected services (hero_agent, hero_foundry) had been running across at least one prior service restart cycle before the dangling state appeared.

## Symptom After a service restart (sometimes after a hero_proc daemon restart, sometimes after just `proc service restart <X>`), `<service>/rpc.sock` enters a state where: - The kernel-side listener is alive — `ss -lnp` shows `LISTEN` on the path with the right PID and FD. - The filesystem dentry is gone — `ls -la` does not show `rpc.sock` in the directory. - New clients fail with `ENOENT` (`No such file or directory (os error 2)`) when they `connect()` to the path. - The service's UI socket in the same dir typically stays healthy. The service itself reports `running` to `proc service status`, because hero_proc tracks the action's process state, which is fine — the inconsistency is between kernel binding and filesystem visibility. ## Repro pattern observed This bit at least twice in one session on `herodemo.gent01.grid.tf`: 1. `hero_agent` after my session restarted things — `hero_agent_ui` couldn't reach `hero_agent_server` on rpc.sock; `ss` showed PID 1819 (hero_agent_server) listening on the path; `ls` showed only ui.sock. AI Assistant errored with `Backend unavailable: No such file or directory (os error 2)`. `proc service restart hero_agent` cleared it. 2. `hero_foundry` after the same session's mid-session hero_proc daemon crash + recovery — `hero_office_server` errored on `foundry list_files failed: connecting to /home/driver/hero/var/sockets/hero_foundry/rpc.sock`; `ss` showed PID 20606 (hero_foundry_server) listening on the same path; `ls` showed only ui.sock. Photos/videos/Office documents all broke as a knock-on. `proc service restart hero_foundry` cleared it. The pattern is: **the kernel listen socket survived a previous restart cycle but the file the new instance was supposed to bind to is missing.** ## Likely cause `kill_other.socket` cleanup in the action spec calls `unlink(path)` before the new process starts. If the new process inherits / clones from a still-running predecessor (e.g. exec into a wrapper that already has the FD open) without re-binding, the kernel listener stays bound to the inode but the directory entry that pointed at it is gone — exactly the state we see. Possible scenarios that produce this: - hero_proc receives `service restart`, calls `unlink(rpc.sock)`, then re-execs the action. The action's process (or a child it kept) had the FD open from the previous bind; `unlink()` removed the dentry but the inode + its kernel listener stayed because the FD pinned it. The new process attempts `bind(rpc.sock)` and gets `EADDRINUSE` since the inode still exists, OR the new process re-uses the same FD and never re-binds (depending on action shape). - Daemon-side crash where hero_proc dies mid-action-restart, leaving the unlink already done but the new bind never reached. A clean restart of *just that action* repairs it because the launcher is fully re-spawned and binds a fresh dentry. ## What would harden this A few options worth considering — pick whichever fits the supervisor model: 1. **Sanity-check after start**: after `kill_other.socket` cleanup + new process spawn, hero_proc verifies the listed sockets are visible via `stat()` once the health check passes; if any are missing, it logs a warning and re-cleans + re-execs the action. 2. **Force unbind via `SO_REUSEADDR`-equivalent on UDS**: ensure new bind always wins when the old FD is somehow still around — Linux UDS doesn't have `SO_REUSEADDR`, but a `connect()`-then-fail probe before bind can detect the stale FD case. 3. **Rebind-on-startup health check**: each action's healthcheck endpoint also stat's its own listen path on first probe and crashes loudly if the file is missing — hero_proc's retry policy then restarts it cleanly. Option 1 is cheapest to implement; it just adds a post-start invariant check on the listed `kill_other.socket` paths. ## Workaround today Operators see this as `os error 2` in a downstream service that talks to the affected one; running `proc service restart <affected>` clears it. ## Files / context - The action specs that exhibited the bug used the standard `kill_other.socket: ["<path>/rpc.sock"]` pattern from `hero_skills/tools/modules/services/lib.nu`. - Both affected services (hero_agent, hero_foundry) had been running across at least one prior `service restart` cycle before the dangling state appeared.

mik-tf referenced this issue

2026-04-30 13:34:13 +00:00

[bug] log_batcher unbounded channel — hero_proc_server leaks unbounded memory under sqlite write pressure #80

mik-tf referenced this issue

2026-04-30 20:02:34 +00:00

[infra] hero_proc service status: PID-alive → handler-responsive probes (catch half-broken services) #83

mik-tf referenced this issue from lhumina_code/home

2026-04-30 20:02:50 +00:00

[infra] Half-broken running service pattern — listener alive, handler dead (foundry / OServer dispatch) #202

mik-tf referenced this issue

2026-04-30 20:12:10 +00:00

[arch] Service readiness contract — services declare ready, supervisor doesn't guess from PID #84

mik-tf referenced this issue

2026-04-30 23:01:22 +00:00

[meta] hero_proc reliability roadmap — every observed failure mode + tracking + order of attack #86

despiegk referenced this issue

2026-05-01 04:05:50 +00:00

[meta] hero_proc reliability roadmap — every observed failure mode + tracking + order of attack #86

despiegk commented

2026-05-01 04:06:51 +00:00

Owner

Implemented — closes the dangling-dentry race

Fix lands in two layers, both wired through ServiceSpec::sockets (a new optional list of socket basenames declared by the service):

Layer 1 — pre-spawn cleanup

crates/hero_proc_server/src/supervisor/executor.rs now removes any stale socket files (and the .ready marker, see #84) before launching a process job for a service that has declared sockets. This closes the most common dangling-dentry case where the previous instance crashed leaving the dentry behind:

Before: child binds → bind sees existing dentry → behaviour OS-dependent → sometimes ENOENT for new clients
After: supervisor unlinks → child binds clean → dentry on disk matches kernel listener

See cleanup_service_sockets() in executor.rs.

Layer 2 — periodic invariant check

crates/hero_proc_server/src/supervisor/service_state.rs runs every 5s. For every service that has declared sockets, every basename must exist on disk. A missing socket flips the service's status to failed with health_reason="declared socket missing on disk: <name>". This catches a dentry race that survives the spawn — operator sees the truth in service list instead of green-when-broken.

How services opt in

// In your service definition (Rust):
ServiceSpec {
    name: "myservice".into(),
    sockets: vec!["rpc.sock".into(), "ui.sock".into()],
    ..Default::default()
}

Or via the JSON spec stored in services.spec_json — the new field is optional and round-trips through the existing column.

Out of scope

The home#202 half-broken-listener case (kernel listener alive, handler poisoned) is not fixed by this issue — that's the OServer panic-isolation problem tracked in home#204. The probe layer in #83 catches it from hero_proc's side.
No automatic respawn on missing-dentry yet. Reporting goes red; operator decides via service.reset_failed (added under #86 P4).

Verification

cargo check --workspace --all-targets clean. Build clean. Existing suite passes. Closing as implemented.

## Implemented — closes the dangling-dentry race Fix lands in two layers, both wired through `ServiceSpec::sockets` (a new optional list of socket basenames declared by the service): ### Layer 1 — pre-spawn cleanup `crates/hero_proc_server/src/supervisor/executor.rs` now removes any stale socket files (and the `.ready` marker, see #84) before launching a process job for a service that has declared `sockets`. This closes the most common dangling-dentry case where the previous instance crashed leaving the dentry behind: - Before: child binds → bind sees existing dentry → behaviour OS-dependent → sometimes ENOENT for new clients - After: supervisor unlinks → child binds clean → dentry on disk matches kernel listener See `cleanup_service_sockets()` in `executor.rs`. ### Layer 2 — periodic invariant check `crates/hero_proc_server/src/supervisor/service_state.rs` runs every 5s. For every service that has declared `sockets`, every basename must exist on disk. A missing socket flips the service's status to `failed` with `health_reason="declared socket missing on disk: <name>"`. This catches a dentry race that survives the spawn — operator sees the truth in `service list` instead of green-when-broken. ### How services opt in ```rust // In your service definition (Rust): ServiceSpec { name: "myservice".into(), sockets: vec!["rpc.sock".into(), "ui.sock".into()], ..Default::default() } ``` Or via the JSON spec stored in `services.spec_json` — the new field is optional and round-trips through the existing column. ### Out of scope - The `home#202` half-broken-listener case (kernel listener alive, handler poisoned) is **not** fixed by this issue — that's the OServer panic-isolation problem tracked in `home#204`. The probe layer in #83 catches it from hero_proc's side. - No automatic respawn on missing-dentry yet. Reporting goes red; operator decides via `service.reset_failed` (added under #86 P4). ### Verification `cargo check --workspace --all-targets` clean. Build clean. Existing suite passes. Closing as implemented.

despiegk referenced this issue

2026-05-01 04:06:51 +00:00

[arch] Service readiness contract — services declare ready, supervisor doesn't guess from PID #84

despiegk closed this issue

2026-05-01 04:06:56 +00:00

despiegk referenced this issue

2026-05-01 05:14:49 +00:00

spec(db): verify ServiceSpec probe/sockets/require_ready round-trip through spec_json — add regression test + doc contract #90