fix(scanner): close TOCTOU race between probe and stale-socket delete (#49) #54
No reviewers
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_router!54
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "development_fix_scanner_toctou"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
cleanup_stale_socketsused to probe every socket in$HERO_SOCKET_DIRfirst, collect the unresponsive paths into aVec, then unlink them after the loop finished. Under rapid service churn, a socket could go fromconnection refused(mid-startup) in the probe pass to fully-bound-and-serving by the time the delete step reached it — we would then unlink a live socket, making the just-respawned service silently unreachable to new clients until the next scan cycle.This PR closes that window with a pre-delete re-probe and tolerates benign
ENOENTon delete.Closes #49.
Changes
crates/hero_router/src/scanner.rs(+32 / -0)cleanup_stale_socketsdelete loop, callis_socket_responsive(&path).awaitimmediately beforeremove_file. Ontrue, log atdebug!andcontinue— the service respawned during the probe pass, and unlinking would silently break it.Err(e) if e.kind() == std::io::ErrorKind::NotFoundarm forremove_filethat logs atdebug!rather thanwarn!. Another process (or an earlier sweep) already removed the file; it is not an error.remove_fileerrors (EPERM,EIO, etc.) continue to log atwarn!with the same format. No changes tois_socket_responsive, theScannerstruct, or any caller.#[cfg(test)] mod testsblock withremove_file_on_missing_path_reports_notfound, which documents the invariant theNotFoundarm relies on. The module previously had no tests.Why no helper extraction or hero_proc integration
The issue's "even better" suggestion — consulting
hero_proc'sjob_listand refusing to unlink sockets owned by live jobs — was explicitly deferred. The pre-delete re-probe plusNotFoundhandling covers the two concrete failure modes spelled out in the reproduction section and keeps the diff to a surgical 15-line change inside the existing loop body.Extracting a
probe-then-deleteprivate helper was considered and rejected for the same reason — it grows the diff beyond a targeted race fix.Cost
The re-probe adds at most one extra 1s connect per stale candidate (the existing
is_socket_responsivetimeout). In steady-state operationstale_pathsis small — most sockets are live and filtered in the first pass — so added latency is bounded.Race reproducibility
The respawn race is timing-dependent and not cleanly reproducible in CI without a real UDS harness. Verified by inspection:
is_socket_responsivereturningtruecausescontinuebeforeremove_file, so a respawned socket is never unlinked. TheNotFoundpath is covered by the new unit test.Manual reproduction (see issue body):
Test Results
cargo test --workspace— 72 passed / 0 failed / 0 ignored (1 new test inscanner::tests).cargo clippy --workspace --all-targets -- -D warnings— clean.cargo fmt --all --check— clean.cargo build --bin hero_router && hero_router scanruns without regression.