fix(scanner): close TOCTOU race between probe and stale-socket delete (#49) #54

Merged
rawdaGastan merged 1 commit from development_fix_scanner_toctou into development 2026-04-26 07:30:55 +00:00
Member

Summary

cleanup_stale_sockets used to probe every socket in $HERO_SOCKET_DIR first, collect the unresponsive paths into a Vec, then unlink them after the loop finished. Under rapid service churn, a socket could go from connection refused (mid-startup) in the probe pass to fully-bound-and-serving by the time the delete step reached it — we would then unlink a live socket, making the just-respawned service silently unreachable to new clients until the next scan cycle.

This PR closes that window with a pre-delete re-probe and tolerates benign ENOENT on delete.

Closes #49.

Changes

crates/hero_router/src/scanner.rs (+32 / -0)

  • In the cleanup_stale_sockets delete loop, call is_socket_responsive(&path).await immediately before remove_file. On true, log at debug! and continue — the service respawned during the probe pass, and unlinking would silently break it.
  • Add an Err(e) if e.kind() == std::io::ErrorKind::NotFound arm for remove_file that logs at debug! rather than warn!. Another process (or an earlier sweep) already removed the file; it is not an error.
  • Other remove_file errors (EPERM, EIO, etc.) continue to log at warn! with the same format. No changes to is_socket_responsive, the Scanner struct, or any caller.
  • Add a #[cfg(test)] mod tests block with remove_file_on_missing_path_reports_notfound, which documents the invariant the NotFound arm relies on. The module previously had no tests.

Why no helper extraction or hero_proc integration

The issue's "even better" suggestion — consulting hero_proc's job_list and refusing to unlink sockets owned by live jobs — was explicitly deferred. The pre-delete re-probe plus NotFound handling covers the two concrete failure modes spelled out in the reproduction section and keeps the diff to a surgical 15-line change inside the existing loop body.

Extracting a probe-then-delete private helper was considered and rejected for the same reason — it grows the diff beyond a targeted race fix.

Cost

The re-probe adds at most one extra 1s connect per stale candidate (the existing is_socket_responsive timeout). In steady-state operation stale_paths is small — most sockets are live and filtered in the first pass — so added latency is bounded.

Race reproducibility

The respawn race is timing-dependent and not cleanly reproducible in CI without a real UDS harness. Verified by inspection: is_socket_responsive returning true causes continue before remove_file, so a respawned socket is never unlinked. The NotFound path is covered by the new unit test.

Manual reproduction (see issue body):

# While hero_router scanner is running:
rm ~/hero/var/sockets/some_service/rpc.sock      # force next probe to fail
# Rebind the socket before the cleanup loop reaches it.
# Expected log:
#   DEBUG hero_router::scanner: Socket respawned before cleanup, skipping: <path>
# The new socket is left intact.

Test Results

  • cargo test --workspace — 72 passed / 0 failed / 0 ignored (1 new test in scanner::tests).
  • cargo clippy --workspace --all-targets -- -D warnings — clean.
  • cargo fmt --all --check — clean.
  • End-to-end: cargo build --bin hero_router && hero_router scan runs without regression.
## Summary `cleanup_stale_sockets` used to probe every socket in `$HERO_SOCKET_DIR` first, collect the unresponsive paths into a `Vec`, then unlink them after the loop finished. Under rapid service churn, a socket could go from `connection refused` (mid-startup) in the probe pass to fully-bound-and-serving by the time the delete step reached it — we would then unlink a live socket, making the just-respawned service silently unreachable to new clients until the next scan cycle. This PR closes that window with a pre-delete re-probe and tolerates benign `ENOENT` on delete. Closes https://forge.ourworld.tf/lhumina_code/hero_router/issues/49. ## Changes ### `crates/hero_router/src/scanner.rs` (+32 / -0) - In the `cleanup_stale_sockets` delete loop, call `is_socket_responsive(&path).await` immediately before `remove_file`. On `true`, log at `debug!` and `continue` — the service respawned during the probe pass, and unlinking would silently break it. - Add an `Err(e) if e.kind() == std::io::ErrorKind::NotFound` arm for `remove_file` that logs at `debug!` rather than `warn!`. Another process (or an earlier sweep) already removed the file; it is not an error. - Other `remove_file` errors (`EPERM`, `EIO`, etc.) continue to log at `warn!` with the same format. No changes to `is_socket_responsive`, the `Scanner` struct, or any caller. - Add a `#[cfg(test)] mod tests` block with `remove_file_on_missing_path_reports_notfound`, which documents the invariant the `NotFound` arm relies on. The module previously had no tests. ## Why no helper extraction or hero_proc integration The issue's "even better" suggestion — consulting `hero_proc`'s `job_list` and refusing to unlink sockets owned by live jobs — was explicitly deferred. The pre-delete re-probe plus `NotFound` handling covers the two concrete failure modes spelled out in the reproduction section and keeps the diff to a surgical 15-line change inside the existing loop body. Extracting a `probe-then-delete` private helper was considered and rejected for the same reason — it grows the diff beyond a targeted race fix. ## Cost The re-probe adds at most one extra 1s connect per stale candidate (the existing `is_socket_responsive` timeout). In steady-state operation `stale_paths` is small — most sockets are live and filtered in the first pass — so added latency is bounded. ## Race reproducibility The respawn race is timing-dependent and not cleanly reproducible in CI without a real UDS harness. Verified by inspection: `is_socket_responsive` returning `true` causes `continue` before `remove_file`, so a respawned socket is never unlinked. The `NotFound` path is covered by the new unit test. Manual reproduction (see issue body): ``` # While hero_router scanner is running: rm ~/hero/var/sockets/some_service/rpc.sock # force next probe to fail # Rebind the socket before the cleanup loop reaches it. # Expected log: # DEBUG hero_router::scanner: Socket respawned before cleanup, skipping: <path> # The new socket is left intact. ``` ## Test Results - `cargo test --workspace` — 72 passed / 0 failed / 0 ignored (1 new test in `scanner::tests`). - `cargo clippy --workspace --all-targets -- -D warnings` — clean. - `cargo fmt --all --check` — clean. - End-to-end: `cargo build --bin hero_router && hero_router scan` runs without regression.
fix(scanner): close TOCTOU race between probe and stale-socket delete
All checks were successful
Build & Test / check (pull_request) Successful in 2m53s
Build & Test / check (push) Successful in 3m29s
4e7ee83ce0
#49

cleanup_stale_sockets used to collect all unresponsive socket paths during
the full probe pass, then iterate and unlink them after the loop finished.
A service that respawned during the probe loop (binding a fresh socket at
the same path) could have its live socket deleted, making it silently
unreachable to new clients until the next scan.

- Re-probe immediately before remove_file. If the socket is now
  responsive it respawned during the probe loop — log at debug and
  continue instead of unlinking.
- Treat std::io::ErrorKind::NotFound on remove_file as a no-op. Another
  process (or an earlier sweep of ours) already removed the file; not
  an error.
- Other remove_file errors (EPERM, EIO, etc.) still logged at warn! as
  today. No changes to is_socket_responsive, Scanner fields, or any
  caller.

Add a #[cfg(test)] mod tests with remove_file_on_missing_path_reports_notfound
documenting the invariant the NotFound arm relies on. The scanner module
previously had no tests.

hero_proc job-list integration ("skip deletion for sockets owned by live
jobs") is deferred as a larger follow-up — the surgical re-probe plus
NotFound handling covers the two concrete failure modes spelled out in
the issue's reproduction section.
rawdaGastan merged commit 6166fccda8 into development 2026-04-26 07:30:55 +00:00
rawdaGastan deleted branch development_fix_scanner_toctou 2026-04-26 07:31:00 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_router!54
No description provided.